Edge universality of correlation matrices

Natesh S. Pillai, Jun Yin

Introduction

The aim of this paper is to prove the edge universality of correlation matrices. The data matrix $\widetilde{X}=(\widetilde{x}_{ij})$ is an $M\times N$ matrix with independent centered real-valued entries. The entries in each column $j$ all are assumed to be identically distributed:

Furthermore, the entries $q_{ij}$ have a subexponential decay, that is, there exists a constant $\vartheta>0$ such that for $u>1$ ,

Notice that all our constants may depend on $\theta$ and $\vartheta$ , but we will subsume this dependence in the notation.

The matrix ${\widetilde{X}}^{\dagger}\widetilde{X}$ is the usual covariance matrix. The $j$ th column of $\widetilde{X}$ is denoted by $\widetilde{\mathbf{x}}_{j}$ . Define the matrix $M\times N$ matrix $X=(x_{ij})$

Since we are mainly interested in correlation matrices, without loss of generality, henceforth we will assume that

Covariance matrices are ubiquitous in modern multivariate statistics where the advance of technology has led to a profusion of high-dimensional data sets. See John01 , John07 , John08 , NYcov and the references therein for motivation and applications in a wide variety of fields. Correlation matrices are sometimes preferred in certain statistical applications. For instance, the classic exploratory method Principal Component Analysis (PCA) is not invariant to change of scale in the matrix entries. Therefore, it is often recommended first to standardize the matrix entries and then perform PCA on the resulting correlation matrix John01 .

Recent progress in random matrix theory has led to a wealth of techniques for proving universality of various matrix ensembles (see EKYY11 , EKYY12 , ESY1 , ESY2 , ESY3 , ESY4 , EPRSY , ESYY , EYYBulkuni , EYYgenwig , EYYrigid , Joha12 , KY1 , KY2 , TaoVu09 , TaoVu10 and the references therein). Here the word universality refers to the phenomenon that the asymptotic distributions of various functionals of covariance/correlation matrices (such as eigenvalues, eigenvector, etc.) are identical to those Gaussian covariance/correlation matrices. Thus, harnessing these methods to obtain universality results in statistical problems is an important step, since these results let us calculate the exact asymptotic distributions of various test statistics without having restrictive distributional assumptions of the matrix entries. For instance, an important consequence of universality is that in some cases one can perform various hypothesis tests under the assumption that the matrix entries are not normally distributed but use the same test statistic as in the Gaussian case.

In this context, in a recent paper NYcov we studied the asymptotic distribution of the eigenvalues of the covariance matrix ${\widetilde{X}}^{\dagger}\widetilde{X}$ under the assumptions of (1) and (2). In NYcov , we proved that the Stieltjes transform of the empirical eigenvalue distribution of the sample covariance matrix is given by the Marcenko–Pastur law MP uniformly up to the edges of the spectrum with an error of order $(N\eta)^{-1}$ , where $\eta$ is the imaginary part of the spectral parameter in the Stieltjes transform. From this strong local Marcenko–Pastur law, we derived the following results: (1) rigidity of eigenvalues (2) delocalization of eigenvectors (3) universality of eigenvalues in the bulk and (4) universality of eigenvalues at the edges. Furthermore, in our proof of edge universality of eigenvalues for covariance matrices (see Theorem 7.5 of NYcov ), we gave a sufficient criterion for checking whether two matrices of form $Q^{\dagger}Q$ ( $Q$ is a data matrix) have the same asymptotic eigenvalue distribution at the edge (see Section 3 for details). Here ${Q}^{\dagger}Q$ could be quite general, including covariance and correlation matrices.

Verifying the above criteria for correlation matrices is much more complicated, owing to the fact that even if it has the same form ${X}^{\dagger}X$ as above, the matrix entries of $X$ are not independent. Fortunately in NYcov , as a byproduct, we also proved the strong Marcenko–Pastur law, the rigidity of eigenvalues and delocalization of eigenvectors of correlation matrices (see Lemma 2.3 in Section 2 below or Theorem 1.5 of NYcov ). In this paper, we complete the research program initiated in NYcov by proving the edge universality of correlation matrices. There are not many papers which study the asymptotics of the correlation matrices as compared to the relatively large literature on covariance matrices. The asymptotic distribution of the largest (appropriately rescaled) eigenvalue of the Gaussian correlation matrix was only very recently established by Bao1 . As will be explained below, we also obtain this result as a special case of our main result and, more importantly, we do not need this result in our proof (see Remark 1.3). The almost sure convergence of the largest and smallest eigenvalues of the correlation matrix was established in Jiang . The very recent paper Bao1 , relying on our results in NYcov , shows that the asymptotic distribution of the largest or smallest eigenvalue of the correlation matrix is given by the Tracy–Widom law, under the assumption that the data matrix $X$ satisfies (1) and its entries have symmetric distributions. In particular, the authors in Bao1 use the above mentioned sufficiency criteria for edge universality developed in NYcov . Furthermore, the assumption that the matrix entries are symmetric is very restrictive and not natural in statistical applications. In this paper we will build on our previous work NYcov and prove edge universality of correlation matrices just under the assumptions (1) and (2). Furthermore, we believe that all of our main results should hold if one replaces the subexponential tail decay of the matrix entries by a uniform bound on the $p$ th moment $(p>4)$ of the matrix entries (e.g., $p=13$ will suffice), as proved in EKYY12 for Wigner matrices.

The central ideas in this paper are based on the general machinery for proving universality established in a series of recent papers EKYY11 , EKYY12 , ESY1 , ESY2 , ESY3 , ESY4 , EPRSY , ESYY , EYYBulkuni , EYYgenwig , EYYrigid , KY1 , KY2 , where the authors Yau, Erdős et al. study the distribution of eigenvalues and eigenvectors by studying the Green’s functions (resolvent) of the random matrices.

The proof of this paper is based on the comparison of Green’s functions first initiated in EYYBulkuni , but, as mentioned earlier, the key obstacle to be surmounted is the strong dependence of the entries of the correlation matrix. We achieve this via a novel argument which involves comparing the moments of the product of the entries of the standardized data matrix to those of the raw data matrix (see Section 3 for a summary of the key ideas). Our proof strategy may be extended for proving the edge universality of other random matrix ensembles with dependent entries and hence is of independent interest. Furthermore, it will be interesting to see if bulk universality of correlation matrices can be established using the methods developed in this paper.

Let us state the main result now. We denote $\lambda_{i}$ , $1\leq i\leq N$ , as the eigenvalues of $X^{\dagger}X$ and $\lambda_{\alpha}=0$ for $\min\{N,M\}+1\leq\alpha\leq\max\{N,M\}$ . We order them as

Analogously, let $\widetilde{\lambda}_{\alpha}$ denote the eigenvalues values of the matrix ${\widetilde{X}}^{\dagger}{\widetilde{X}}$ .

The following is the main result of this paper. It shows that the largest and smallest $k$ eigenvalues of the correlation matrix, after appropriate centering and rescaling, converge in distribution to those of the corresponding covariance matrix.

An analogous result holds for the $k$ smallest eigenvalues.

In Pech07 , Sod1 and Sosh1 , Peche, Soshnikov and Sodin proved that for some covariance matrices (including the Wishart matrix), the largest and smallest $k$ eigenvalues after appropriate centering and rescaling converge in distribution to the Tracy–Widom lawHere we use the term Tracy–Widom law as in Sosh1 . whose density is a smooth function. Combining with our recent result on the universality of covariance matrices in NYcov , we have the following immediate corollary for Theorem 1.1:

Let $X$ denote the correlation matrix as defined in (1)–(4). For any fixed $k>0$ , we have

Thus, as a special case, we also obtain the TW law for the Gaussian correlation matrices.

Although the current paper builds on our recent work NYcov , it is mostly self-contained and for the reader’s convenience, we will recall all of the needed results from NYcov . The rest of the paper is organized as follows. In Section 2, after establishing some notation, we give the key results establishing the strong Marcenko–Pastur law and rigidity of eigenvalues for correlation matrices, as obtained from NYcov . In Section 3 we give a brief proof sketch illustrating the key ideas. In Section 4 we give the proof of the main results and in Section 5 we prove some technical lemmas which constitute the key ingredients in the proof of the main result. For the rest of the paper the letter $C$ will denote a generic constant whose value might change from one line to the next, but will be independent of everything else. The notation $O_{\varepsilon}(N^{a})$ will be used to denote $O(N^{a+C{\varepsilon}})$ .

Preliminaries

We will adopt the notation used in this paper from NYcov . Define the Green function of ${X}^{\dagger}X$ by

The Stieltjes transform of the empirical eigenvalue distribution of ${X}^{\dagger}X$ is given by

The Marcenko–Pastur (henceforth abbreviated by MP) law is given by

The function $m_{W}$ depends on $d$ and has the closed form solution

where $\sqrt{\ }$ denotes the square root on a complex plane whose branch cut is the negative real line. We also define the classical location of the eigenvalues with $\rho_{W}$ as follows:

Let $\zeta>0$ . We say that an event $\Omega$ holds with $\zeta$ -high probability if there exists a constant $C>0$ such that

Let us first give the following large deviation lemma for independent random variables (see EYYBulkuni , Appendix B for a proof).

Thus, the main result of NYcov (see Theorem 1.5 of NYcov ) is applicable for the correlation matrix $X$ , yielding the following strong local MP law and rigidity of eigenvalues:

Let $X=[x_{ij}]$ be the correlation matrix given by (4). Then for any $\zeta>0$ there exists a constant $C_{\zeta}$ such that the following events hold with $\zeta$ -high probability.

The Stieltjes transform of the empirical eigenvalue distribution of $X^{\dagger}X$ satisfies

where ${mS}(C_{\zeta})$ defined as the set

The individual matrix elements of the Green function satisfy

The smallest nonzero and largest eigenvalues of $X^{\dagger}X$ satisfy

Rigidity of the eigenvalues: recall $\gamma_{j}$ in (12). For any $1\leq j\leq\break\min\{M,N\}$ , let $\widetilde{j}=\min\{\min\{N,M\}+1-j,j\}.$ Then

An analogous result holds for the smallest eigenvalues $\widetilde{\lambda}^{{\mathbf{v}}}_{\min\{M,N\}}$ and $\widetilde{\lambda}^{{\mathbf{w}}}_{\min\{M,N\}}$ .

As remarked in NYcov , Theorem 2.4 can be extended to finite correlation functions of extreme eigenvalues as follows:

for all $k$ fixed and sufficiently large $N$ . We remark that edge universality is usually formulated in terms of joint distributions of edge eigenvalues as in (2) with fixed parameters $s_{1},s_{2},\ldots$ etc. However, we note that Theorem 2.4 holds uniformly in these parameters, and thus they may depend on $N$ .

Key ideas and proof sketch

Our basic strategy is the so-called “Green function comparison” method initiated in a recent series of papers including EYYBulkuni , EYYgenwig , EYYrigid for proving universality for (generalized) Wigner matrices. The Green function comparison method has subsequently been applied to proving the spectral universality of adjacency matrices of random graphs EKYY11 , EKYY12 , the universality of eigenvectors of Wigner matrices KY1 , as well as the the spectrum of additive finite-rank deformations of Wigner matrices and the isotropic local semicircle law KY2 .

In this paper, we will show that (2.4) and (2) still hold with $\widetilde{X}^{\mathbf{v}}$ and $\widetilde{X}^{\mathbf{w}}$ replaced by the correlation matrix $X$ and the corresponding covariance matrix $\widetilde{X}$ , that is, Theorem 1.1. To show this result, we introduce a sufficient criteria for (2.4) and (2) derived in NYcov (see Theorem 7.5 of NYcov ).

Consider two matrix ensembles $X^{{\mathbf{v}}},X^{{\mathbf{w}}}$ (could be covariance, correlation or more general matrixNotice that throughout the paper we use $X$ for the correlation matrix and $\widetilde{X}$ for the covariance matrix. This is the only instance we denote a generic matrix by $X$ for compactness of notation.) and let their respective Green functions and empirical Stieltjes transforms [see (6) and (7)] be denoted by $G^{{\mathbf{v}}},G^{{\mathbf{w}}}$ and $m^{{\mathbf{v}}},m^{{\mathbf{w}}}$ . To prove that the asymptotic distribution of the extreme eigenvalues of the matrix ensembles $X^{{\mathbf{v}}},X^{{\mathbf{w}}}$ are identical in the sense of (2.4) and (2), it suffices to show the following NYcov : {longlist}

The matrices $X^{{\mathbf{v}}},X^{{\mathbf{w}}}$ satisfy the strong Marcenko–Pastur law and the rigidity of eigenvalues as given in Lemma 2.3.

The difference of the expectation of smooth functionals of the corresponding Green functions ( $G^{{\mathbf{v}}},G^{{\mathbf{w}}}$ and $m^{{\mathbf{v}}},m^{{\mathbf{w}}}$ ) evaluated at the spectral edge must vanish asymptotically. More precisely, as pointed out in NYcov , it suffices to establish Theorems 3.1 and 3.2 below for the matrices $X^{{\mathbf{v}}},X^{{\mathbf{w}}}$ .

and $\eta_{0}=N^{-2/3-{\varepsilon}}$ , we have

where the second term in the left-hand side above is obtained by changing the arguments of $F$ in the first term from $m^{\mathbf{v}}$ to $m^{\mathbf{w}}$ and keeping all the other parameters fixed.

Theorems 3.1 and 3.2 yield the edge universality of the $k$ -point correlation functions at the edge for $k=1$ and $k\geq 1$ , respectively.

Thus, to complete the proof of Theorem 1.1, by the Green function comparison method it suffices to show (i) and (ii) above for

where $X^{\dagger}X$ denotes the correlation matrix and $\widetilde{X}^{\dagger}\widetilde{X}$ is the corresponding covariance matrix. Here condition (i) is guaranteed by Theorem 2.3.

Verifying condition (ii) entails the heart of this paper. In previous works mentioned earlier, the authors use a Lindeberg replacement strategy, as in Chat , TaoVu09 . These proofs proceed via showing that the distribution of some smooth functional of the Green function (e.g., $G_{ii}$ , $m$ and $\langle{\mathbf{x}}_{1},G{\mathbf{x}}_{1}\rangle$ ) of the two matrix ensembles is identical asymptotically provided that the first two (in some cases up to four) moments of all matrix elements of these two ensembles are identical. For instance, if one needs to show the edge universality of two covariance matrices $\widetilde{X}{}^{{\mathbf{v}}}$ and $\widetilde{X}{}^{{\mathbf{w}}}$ , the basic strategy is to express

where $F$ is a smooth function and $\widetilde{G}_{\gamma}$ denotes the Green function of the ensemble $\widetilde{X}_{\gamma}$ (with $\widetilde{X}_{0}=\widetilde{X}^{{\mathbf{v}}}$ ) which is obtained from $\widetilde{X}_{\gamma-1}$ by replacing the distribution of the $ij$ th entry of $\widetilde{X}_{\gamma-1}[{ij}]$ with $\widetilde{X}^{{\mathbf{w}}}[{ij}]$ [here $\gamma=i+(j-1)M$ ] so that $\widetilde{X}_{MN}=\widetilde{X}^{{\mathbf{w}}}$ . The next step is to obtain an estimate

for each of the $N^{2}$ terms in the sum (28). Usually (29) is obtained by resolvent expansions, perturbation theory and the fact that $\widetilde{X}_{\gamma}$ and $\widetilde{X}_{\gamma-1}$ differ by a single entry and the first few moments of these two distributions are the same.

But clearly the above method does not work in our case, since the entries within the same column are not independent and, therefore, one cannot replace the distribution of a single entry of a column without changing the distribution of all the other $M-1$ entries. To circumvent this, in NYcov a new telescoping argument consisting of $O(N)$ ensembles was used for the comparison of Green functions. The idea is that instead of replacing entries one at a time, one can replace the entries of the data matrix column by column and thus require only $O(N)$ ensembles. This argument from NYcov is adapted here along with new insights for dealing with nonindependence of the entries and is outlined below.

Now we set $X^{\mathbf{v}}=X,X^{\mathbf{w}}=\widetilde{X}$ . For $1\leq\gamma\leq N$ , let $X_{\gamma}$ denote the random matrix whose $j$ th column is the same as that of $X^{{\mathbf{v}}}$ if $j>\gamma$ and that of $X^{{\mathbf{w}}}$ otherwise. In particular, we can choose $X_{0}=X^{{\mathbf{v}}}=X$ and $X_{N}=X^{{\mathbf{w}}}=\widetilde{X}$ , where $X$ is correlation matrix and $\widetilde{X}$ the corresponding covariance matrix of $X$ . As before, we define

Clearly, (25) will follow from (3) and the following estimate:

for some $\delta>0$ . Our strategy to obtain (31) is the following. First notice that

Let $X^{(\gamma)}$ be the $M\times(N-1)$ matrix obtained by removing the $\gamma$ th column of $X_{\gamma}$ , which has the same distribution of the $M\times(N-1)$ matrix obtained by removing the $\gamma$ th column of $X_{\gamma-1}$ . Define

In Lemma 4.1 we will establish (31) by showing that

Once (31) is verified, the main result follows by virtue of Theorems 3.1 and 3.2 as mentioned in the beginning of this section. Notice that since the columns of the data matrix $X^{\mathbf{v}}$ , $X^{\mathbf{w}}$ are assumed to be independent, $\mu$ is independent of the $\gamma$ th column of $X^{\mathbf{v}}$ , $X^{\mathbf{w}}$ or, equivalently, the $\gamma$ th column of $X_{\gamma}$ , $X_{\gamma-1}$ .

Thus, it boils down to establishing (3) in the case $X_{0}=X^{{\mathbf{v}}}=X$ and $X_{N}=X^{{\mathbf{w}}}=\widetilde{X}$ . Our proof relies on the key observation that even if the entries of the $\gamma$ th column vector ${\mathbf{x}}_{\gamma}$ are not independent, the difference between the moments of the entries of the standardized vector ${\mathbf{x}}_{\gamma}$ and its unnormalized counterpart $\widetilde{\mathbf{x}}_{\gamma}$ is at least an order of magnitude smaller than those of $\widetilde{\mathbf{x}}_{\gamma}$ . For instance, since ${\mathbf{x}}_{i\gamma}=O(N^{-1/2})$ for $1\leq i\leq M$ , for two independent ensembles of covariance matrices $\widetilde{X}{}^{{\mathbf{v}}}$ and $\widetilde{X}{}^{{\mathbf{w}}}$ satisfying (1) and (2), we have the bound

On the other hand, if $\widetilde{\mathbf{x}}_{\gamma}$ is the unnormalized counterpart of ${\mathbf{x}}_{\gamma}$ , as shown in Lemma 5.5,

The above observation combined with a resolvent expansion—detailed in Lemmas 4.3, 5.4 and 5.5—gives (3).

Proof of the main result

In this section we will prove (3) in the case $X_{0}=X^{{\mathbf{v}}}=X$ and $X_{N}=X^{{\mathbf{w}}}=\widetilde{X}$ . As discussed above, it implies (25) in Theorem 3.1. Similarly, one can prove (3.1) and (3.2) in Theorems 3.1 and 3.2, which complete the proof of Theorem 1.1, the main result of this paper.

It is easy to see that (3) is a direct consequence of the following lemma.

where $\widetilde{x}_{i1}$ are i.i.d. random variables with mean zero and variance $M^{-1}$ and have an exponentially decay in the tails as given by (2).

Let $\widetilde{X}$ be the random matrix whose entries have the same distribution as $X$ except for the first column, and the first column of $\widetilde{X}$ is given by

where $\widetilde{x}_{i1}$ are as in (36). The columns of $\widetilde{X}$ are also assumed to be mutually independent. Let $m,\widetilde{m}$ denote the empirical Stieltjes transforms of $X^{\dagger}X$ , $\widetilde{X}^{\dagger}\widetilde{X}$ .

Then for any function $F$ satisfying (24), there exists $\delta>0$ , ${\varepsilon}_{0}>0$ depending only on $C_{1}$ such that for any ${\varepsilon}<{\varepsilon}_{0}$ and for any real number $E$ satisfying

Note: In this lemma $X$ and $\widetilde{X}$ are neither pure correlation nor pure covariance matrices, but their respective first columns are distributed according to the standardized data matrix and raw data matrix.

Under condition (37) (see NYcov ), we have the bound

First we collect some properties on submatrices of a generic $M\times N$ matrix $Q$ which can be proved using standard results from linear algebra. Let $Q^{(1)}$ be the $M\times(N-1)$ matrix obtained by removing the first column of $Q$ . Define

Then by definition, $G_{Q}^{(1)}$ is a $(N-1)\times(N-1)$ matrix, $\mathcal{G}_{Q}^{(1)}$ is a $M\times M$ matrix and we have the identity

Using the Cauchy interlacing theorem (see Equation (8.5) of ESYY ), it can be shown that

Proof of Lemma 4.1 First we note that from Theorem 1.5 of NYcov , the conclusions of Theorem 2.3 hold for both $X$ and $\widetilde{X}$ .

Let $X^{(1)}$ be the $M\times(N-1)$ matrix obtained by removing the first column of $X$ . Define

where $F^{(s)}$ denotes the $s$ th derivative of $F$ and $y_{k}$ ’s are defined as

where ${\mathbf{x}}_{1}$ denotes the first column of $X$ . Define the quantity

First, recall the following identity (see (6.23) of NYcov ):

Furthermore, as proved in Lemma 2.5 of NYcov ,

Fix $\zeta>0$ . From (19), Remark 4.2 and the bound $|G_{11}|\leq|m_{W}|+O(1)$ , it follows that for $z=E+i\eta_{0}$ ,

with $\zeta$ -high probability (see Definition 2.1). Therefore, with $\zeta$ -high probability, we have the identity

Define $y$ to be the l.h.s. of (4) multiplied by $\eta_{0}$ , that is,

Since ${\mathbf{x}}_{1}$ satisfies (15), (16) and (17), and $\mathcal{G}^{(1)}$ is independent of ${\mathbf{x}}_{1}$ , using Lemma 2.2, we infer that for some $C_{\zeta}>0$

with $\zeta$ -high probability. Using its definition, we bound $\operatorname{Tr}(\mathcal{G}^{(1)})^{2}$ as

where for the last two inequalities we have used (41), (42), (18) and (39). Similarly, we bound the last term of (52) with

Equation (50) and the fact $|z|+|m_{W}(z)|=O(1)$ yields that

holds with $\zeta$ -high probability. Consequently, using (24) and (4), we see that the expansion

holds with $\zeta$ -high probability. From the bounds on $y_{k}$ ’s obtained above, equation (4) follows.

Now we estimate $\widetilde{G}$ , which is defined as

Let $\widetilde{X}{}^{(1)}$ be the $M\times(N-1)$ matrix obtained by removing the first column of $\widetilde{X}$ and $\widetilde{\mathbf{x}}_{1}$ denote its first column. Proceeding as in the previous calculations,

Notice that $\mu$ appears in (4) because the entries of $\widetilde{X}^{(1)}$ and $X^{(1)}$ are assumed to be identically distributed.

The symmetric matrices $Y$ and $Z$ are independent of ${\mathbf{x}}_{1}$ and $\widetilde{\mathbf{x}}_{1}$ . Clearly, $YZ=ZY$ . Therefore, using the fact that $z$ , $m_{W}\sim 1$ , we can write

where $C_{k,n}=O(1)$ . Let $\cal Y=({\mathbf{x}}_{1},Y{\mathbf{x}}_{1})$ and $\cal Z=({\mathbf{x}}_{1},Z{\mathbf{x}}_{1})$ . Then (4) can be written as

Define $\widetilde{\cal Y}=(\widetilde{\mathbf{x}}_{1},Y\widetilde{\mathbf{x}}_{1})$ and $\widetilde{\cal Z}=(\widetilde{\mathbf{x}}_{1},Z\widetilde{\mathbf{x}}_{1})$ . Using (4) and proceeding similarly as before, we obtain that (4) also holds for the case when $G$ , $\cal Y$ and $\cal Z$ are replaced with $\widetilde{G}$ , $\widetilde{\cal Y}$ and $\widetilde{\cal Z}$ , respectively. The following is the key technical lemma of this paper whose proof is deferred to the next section.

for some constant $C$ . Let $\cal A$ be of the form

where $Y_{i}=Y$ or $Y^{*}$ and $Z_{j}=Z$ or $Z^{*}$ with $Y,Z$ as defined in (58) and $a,b$ are integers with $1\leq a\leq 3,1\leq a+b\leq 3$ . Then, under the assumptions of Lemma 4.1, we have

where $\widetilde{\cal A}$ is obtained by replacing ${\mathbf{x}}$ with $\widetilde{\mathbf{x}}$ in (61).

Taking the difference of (4) and the equation obtained by replacing (4) with $\widetilde{G}$ , $\widetilde{\cal Y}$ and $\widetilde{\cal Z}$ , we deduce that the difference

Finally, we are ready to give the proof of the main result of this paper: {pf*}Proof of Theorem 1.1 By the Green function comparison theorem discussed in Section 3, it only remains to prove that Theorems 3.1 and 3.2 hold for the case

For simplicity, we will only prove (25) of Theorem 3.1; the rest can be proved using almost identical arguments.

For $1\leq\gamma\leq N$ , let $X_{\gamma}$ denote the random matrix whose $j$ th column is the same as that of $X^{{\mathbf{v}}}$ if $j\geq\gamma$ and that of $X^{{\mathbf{w}}}$ otherwise; in particular, $X_{0}=X^{{\mathbf{v}}}$ and $X_{N}=X^{{\mathbf{w}}}$ . As before, we define

Applying Lemma 4.1 on $X_{\gamma}$ and $X_{\gamma-1}$ gives the estimate

for some $\delta>0$ . Now (25) follows from (4) and (64) and the proof is finished.

Moment computations

In this section we prove Lemma 4.3. For notational convenience, let us denote ${\mathbf{x}}={\mathbf{x}}_{1},\widetilde{\mathbf{x}}=\widetilde{\mathbf{x}}_{1}$ . We will also write

Recall $\mu$ from (44). For the rest of this section, $a,b$ will denote two integers with

Before stating the key results of this section, let us first give some definitions.

For any partition $A$ of the set $\{1,2,\ldots,2a+2b\}$ , and a vector $\mathbf{k}=\{k_{1},k_{2},\ldots,k_{2a+2b}\},k_{i}\in\{1,2,\ldots,M\}$ , define the binary function $\mathcal{I}(A,\mathbf{k})$ as follows. The function $\mathcal{I}(A,\mathbf{k})$ is equal to 1 if (1) for any $i,j$ in the same block of $A$ we have $k_{i}=k_{j}$ , (2) if $i,j$ are in different blocks of $A$ , we have $k_{i}\neq k_{j}$ ; otherwise $\mathcal{I}(A,\mathbf{k})=0$ .

Given a partition $A$ of the set $\{1,2,\ldots,2a+2b\}$ , let $\mathcal{N}(A,1)$ be the number of the blocks in $A$ that contain only one element of the set $\{1,2,\ldots,2a+2b\}$ . Let $\mathcal{N}(A,2)$ be the number of the blocks in $A$ of the form $\{k_{2i-1},k_{2i}\}$ with $i>a$ . Note that $\mathcal{N}(A,2)$ depends on $a$ and $b$ in addition to $A$ . Let ${\mathbf{I}}_{(A,3)}$ be equal to one if and only if $a+b=3$ and $A$ is composed of $2$ blocks with three elements in each block.

The proof of Lemma 4.3 relies on Lemmas 5.4 and 5.5 stated below and proved at the end of this section.

Recall the matrices $Y,Z$ from (58). Then for any ${\varepsilon}>0$ the following estimate

holds with $\zeta$ -high probability for any fixed $\zeta>0$ . The result also holds if any of the $Y,Z$ are replaced by their complex conjugates $Y^{*},Z^{*}$ , respectively.

Let $\widetilde{y}_{i}$ be i.i.d. random variables such that

and have a subexponential decay as in (2). Let $A$ be a partition of the set $\{1,2,\ldots,2a+2b\}$ and let

Then for any vector $\mathbf{k}=(k_{1},k_{2},\ldots,k_{2a+2b})$ and for any ${\varepsilon}>0$ , we have

With the above two lemmas in hand, we are now ready to give the proof of Lemma 4.3. {pf*}Proof of Lemma 4.3 We will only prove the case when

The other cases can be proved similarly. First, let us write (61) as

where the summation index $A$ ranges over all the partitions of the set $\{1,2,\ldots,2a+2b\}$ . Taking expectations, and using the fact that ${\mathbf{x}}$ is independent of $Y$ , $Z$ and $\mu$ , leads to

Now we claim that the terms in the r.h.s. of (5) are bounded by $O_{\varepsilon}(N^{-7/6})$ . Indeed, note that $\mathcal{N}(A,1)>0$ implies ${\mathbf{I}}_{(A,3)}=0$ . Therefore, the worse case scenario is the case in which

since by definition we have $\mathcal{N}(A,2)\leq b$ . But it is easy to see the above scenario cannot occur, since if the first two conditions hold, then it follows that $\mathcal{N}(A,1)=0$ or $2$ . Thus, we have finished the proof of Lemma 4.3.

Proof of Lemma 5.4 Note that all of the bounds in this lemma hold with $\zeta$ -high probability, not in expectation. For simplicity, we will subsume this in the notation.

First let us prove a slightly different result. Define the binary function $\widetilde{\mathcal{I}}(A,\mathbf{k})$ [similar to $\mathcal{I}(A,\mathbf{k})$ ] as follows. $\widetilde{\mathcal{I}}(A,\mathbf{k})$ is equal to 1 in the following scenarios: (1) for any $i,j$ in the same block of $A$ we have $k_{i}=k_{j}$ , (2) if $i,j$ are in different blocks of $A$ , we have $k_{i}\neq k_{j}$ except that if one of the indices $i,j$ is in the block of $A$ which contains exactly two elements, then $k_{i}$ is allowed to be equal to $k_{j}$ . In all other instances $\widetilde{\mathcal{I}}(A,\mathbf{k})=0$ . For instance, in the previous example (65), we have

Let us first prove (5) when ${\mathbf{I}}_{(A,3)}=0$ . Define the functions

where $\alpha_{i}\in\{1,2\}$ and $m_{i}\leq 2a+b$ .

To this end, we will use the following 2–1–3 rule:

2: If the index $i$ appears in a block of $A$ which contains exactly two elements, first sum up over the index $k_{i}$ . Then estimate the remaining terms with absolute sum. For example, let $A=\{\{1\},\{2,3\},\{4\}\}$ . Recall that $Y=Z^{2}$ ,

1: Next do the summation over the index $k_{i}$ if $i$ appears in the block of $A$ which contains only one element as follows:

In the above inequalities, we have used the Cauchy–Schwarz and the fact that $Z$ is a symmetric matrix. Note that each summation of the above kind brings an extra $N^{1/2}$ factor.

3: Finally, sum up over the other indices. After the first two steps, (5) will be reduced to the product of following terms:

If $m+n=2$ , then using the Cauchy–Schwarz inequality, (72) can be estimated as

For $m+n>2$ , we bound $m+n-2$ of them [ $|(Z^{m_{i}})_{kk}|$ or $\sqrt{(|Z|^{2n_{j}})_{kk}}$ ] by the maximum as follows:

to reduce to the case of $m+n=2$ and use the bound (73).

Let us give an example in the case $a=1$ , $b=2$ and $A=\{\{1\},\{2,3\},\{4,5,6\}\}$ . Then the term (5) in this case reduces to

where the above inequality is obtained by applying rule 2. Next, applying rule 1 yields

and, finally, applying rule 3 leads to the bound

Using this 2–1–3 rule described above, we obtain (71). By the definition of the 2–1–3 rule, it is easy to see that

Recall $\eta_{0}=N^{-2/3-{\varepsilon}}$ . Using (4) and (54), we deduce that if $\alpha_{i}m_{i}\neq 1$ , then

For $\alpha_{i}m_{i}=1$ , using (41), (42), (18) and $m_{W}=O(1)$ , we see that $g_{1}(1)=O_{\varepsilon}(N)$ . Thus,

Combining equations (71)–(75), we have the

Now notice that by the definition, the term $g_{1}(1)$ in (71) can only be created during the first step of the 2–1–3 rule, that is, the 2 rule, and, therefore, we deduce that

which completes the proof of the claim made in (5) for the case ${\mathbf{I}}_{(A,3)}=0$ .

Now consider the case ${\mathbf{I}}_{(A,3)}=1$ . Using the fact that $Y,Z$ are symmetric matrices and the relation $Y=Z^{2}$ , we deduce that the term

reduces to one of the following situations:

for $m_{i}\in\{1,2\},i\in\{1,2,3\}$ . We bound the first scenario above as

where in the last inequality we have used the fact that $\sum_{i}m_{i}=2a+b$ . For the second case in (5), first we note

Summarizing the above computations, and noticing that $\mathcal{N}(A,1)=\break\mathcal{N}(A,2)=0$ when ${\mathbf{I}}_{(A,3)}=1$ , we obtain the bound

proving the claim (5) when ${\mathbf{I}}_{(A,3)}=1$ .

Now we return to prove Lemma 5.4. One can see that for any partition $A$ of the set $\{1,2,\ldots,2a+2b\}$ and a vector $\mathbf{k}$ , the function $\mathcal{I}(A,\mathbf{k})$ can be written as linear combinations of the functions $\widetilde{\mathcal{I}}(A_{i},\mathbf{k})$ for some partitions $A_{i}$ ’s of the set $\{1,2,\ldots,2a+2b\}$ such that

Using large deviation bounds, it is easy to see that for any ${\varepsilon}>0$

where $C_{n}=C_{a,b,n}$ is a combinatorial factor. Using (80), the r.h.s. of equation (5) may be expressed as

Since $n_{0},a,b=O(1)$ , the combinatorial factors do not increase with $N$ , that is, $C_{n}=O(1)$ , and, thus, we can bound

as follows. Notice that the number of distinct indices $k_{i}$ in (83) is equal to the number of blocks in the partition $A$ . Thus, for a given set of values for the indices $r_{1},r_{2},\ldots,r_{n}$ , the term (83) is nonzero only if at least $\mathcal{N}(A,1)$ of the indices $r_{j}$ belong to the set $\{k_{1},k_{2},\ldots,k_{2a+2b}\}$ . The above observation also implies that for (83) to be nonzero we must have

Therefore, the number of nonzero terms in the sum

is $O((N^{1/2})^{n-\mathcal{N}(A,1)})$ , and each of these terms are of the size $O_{\varepsilon}(N^{-(a+b)-n})$ , yielding

Combining (5) with (84) and the observation made in (85), we obtain that

obtaining (5.5), and the proof is finished.

Acknowledgments

The authors would like to thank Jiefeng Jiang, two anonymous referees, the Associate Editor and the Editor for very useful comments.