Edge universality of correlation matrices

Natesh S. Pillai, Jun Yin

Introduction

The aim of this paper is to prove the edge universality of correlation matrices. The data matrix X~=(x~ij)\widetilde{X}=(\widetilde{x}_{ij}) is an M×NM\times N matrix with independent centered real-valued entries. The entries in each column jj all are assumed to be identically distributed:

Furthermore, the entries qijq_{ij} have a subexponential decay, that is, there exists a constant ϑ>0\vartheta>0 such that for u>1u>1,

Notice that all our constants may depend on θ\theta and ϑ\vartheta, but we will subsume this dependence in the notation.

The matrix X~X~{\widetilde{X}}^{\dagger}\widetilde{X} is the usual covariance matrix. The jjth column of X~\widetilde{X} is denoted by x~j\widetilde{\mathbf{x}}_{j}. Define the matrix M×NM\times N matrix X=(xij)X=(x_{ij})

Since we are mainly interested in correlation matrices, without loss of generality, henceforth we will assume that

Covariance matrices are ubiquitous in modern multivariate statistics where the advance of technology has led to a profusion of high-dimensional data sets. See John01 , John07 , John08 , NYcov and the references therein for motivation and applications in a wide variety of fields. Correlation matrices are sometimes preferred in certain statistical applications. For instance, the classic exploratory method Principal Component Analysis (PCA) is not invariant to change of scale in the matrix entries. Therefore, it is often recommended first to standardize the matrix entries and then perform PCA on the resulting correlation matrix John01 .

Recent progress in random matrix theory has led to a wealth of techniques for proving universality of various matrix ensembles (see EKYY11 , EKYY12 , ESY1 , ESY2 , ESY3 , ESY4 , EPRSY , ESYY , EYYBulkuni , EYYgenwig , EYYrigid , Joha12 , KY1 , KY2 , TaoVu09 , TaoVu10 and the references therein). Here the word universality refers to the phenomenon that the asymptotic distributions of various functionals of covariance/correlation matrices (such as eigenvalues, eigenvector, etc.) are identical to those Gaussian covariance/correlation matrices. Thus, harnessing these methods to obtain universality results in statistical problems is an important step, since these results let us calculate the exact asymptotic distributions of various test statistics without having restrictive distributional assumptions of the matrix entries. For instance, an important consequence of universality is that in some cases one can perform various hypothesis tests under the assumption that the matrix entries are not normally distributed but use the same test statistic as in the Gaussian case.

In this context, in a recent paper NYcov we studied the asymptotic distribution of the eigenvalues of the covariance matrix X~X~{\widetilde{X}}^{\dagger}\widetilde{X} under the assumptions of (1) and (2). In NYcov , we proved that the Stieltjes transform of the empirical eigenvalue distribution of the sample covariance matrix is given by the Marcenko–Pastur law MP uniformly up to the edges of the spectrum with an error of order (Nη)1(N\eta)^{-1}, where η\eta is the imaginary part of the spectral parameter in the Stieltjes transform. From this strong local Marcenko–Pastur law, we derived the following results: (1) rigidity of eigenvalues (2) delocalization of eigenvectors (3) universality of eigenvalues in the bulk and (4) universality of eigenvalues at the edges. Furthermore, in our proof of edge universality of eigenvalues for covariance matrices (see Theorem 7.5 of NYcov ), we gave a sufficient criterion for checking whether two matrices of form QQQ^{\dagger}Q (QQ is a data matrix) have the same asymptotic eigenvalue distribution at the edge (see Section 3 for details). Here QQ{Q}^{\dagger}Q could be quite general, including covariance and correlation matrices.

Verifying the above criteria for correlation matrices is much more complicated, owing to the fact that even if it has the same form XX{X}^{\dagger}X as above, the matrix entries of XX are not independent. Fortunately in NYcov , as a byproduct, we also proved the strong Marcenko–Pastur law, the rigidity of eigenvalues and delocalization of eigenvectors of correlation matrices (see Lemma 2.3 in Section 2 below or Theorem 1.5 of NYcov ). In this paper, we complete the research program initiated in NYcov by proving the edge universality of correlation matrices. There are not many papers which study the asymptotics of the correlation matrices as compared to the relatively large literature on covariance matrices. The asymptotic distribution of the largest (appropriately rescaled) eigenvalue of the Gaussian correlation matrix was only very recently established by Bao1 . As will be explained below, we also obtain this result as a special case of our main result and, more importantly, we do not need this result in our proof (see Remark 1.3). The almost sure convergence of the largest and smallest eigenvalues of the correlation matrix was established in Jiang . The very recent paper Bao1 , relying on our results in NYcov , shows that the asymptotic distribution of the largest or smallest eigenvalue of the correlation matrix is given by the Tracy–Widom law, under the assumption that the data matrix XX satisfies (1) and its entries have symmetric distributions. In particular, the authors in Bao1 use the above mentioned sufficiency criteria for edge universality developed in NYcov . Furthermore, the assumption that the matrix entries are symmetric is very restrictive and not natural in statistical applications. In this paper we will build on our previous work NYcov and prove edge universality of correlation matrices just under the assumptions (1) and (2). Furthermore, we believe that all of our main results should hold if one replaces the subexponential tail decay of the matrix entries by a uniform bound on the ppth moment (p>4)(p>4) of the matrix entries (e.g., p=13p=13 will suffice), as proved in EKYY12 for Wigner matrices.

The central ideas in this paper are based on the general machinery for proving universality established in a series of recent papers EKYY11 , EKYY12 , ESY1 , ESY2 , ESY3 , ESY4 , EPRSY , ESYY , EYYBulkuni , EYYgenwig , EYYrigid , KY1 , KY2 , where the authors Yau, Erdős et al. study the distribution of eigenvalues and eigenvectors by studying the Green’s functions (resolvent) of the random matrices.

The proof of this paper is based on the comparison of Green’s functions first initiated in EYYBulkuni , but, as mentioned earlier, the key obstacle to be surmounted is the strong dependence of the entries of the correlation matrix. We achieve this via a novel argument which involves comparing the moments of the product of the entries of the standardized data matrix to those of the raw data matrix (see Section 3 for a summary of the key ideas). Our proof strategy may be extended for proving the edge universality of other random matrix ensembles with dependent entries and hence is of independent interest. Furthermore, it will be interesting to see if bulk universality of correlation matrices can be established using the methods developed in this paper.

Let us state the main result now. We denote λi\lambda_{i}, 1iN1\leq i\leq N, as the eigenvalues of XXX^{\dagger}X and λα=0\lambda_{\alpha}=0 for min{N,M}+1αmax{N,M}\min\{N,M\}+1\leq\alpha\leq\max\{N,M\}. We order them as

Analogously, let λ~α\widetilde{\lambda}_{\alpha} denote the eigenvalues values of the matrix X~X~{\widetilde{X}}^{\dagger}{\widetilde{X}}.

The following is the main result of this paper. It shows that the largest and smallest kk eigenvalues of the correlation matrix, after appropriate centering and rescaling, converge in distribution to those of the corresponding covariance matrix.

An analogous result holds for the kk smallest eigenvalues.

In Pech07 , Sod1 and Sosh1 , Peche, Soshnikov and Sodin proved that for some covariance matrices (including the Wishart matrix), the largest and smallest kk eigenvalues after appropriate centering and rescaling converge in distribution to the Tracy–Widom lawHere we use the term Tracy–Widom law as in Sosh1 . whose density is a smooth function. Combining with our recent result on the universality of covariance matrices in NYcov , we have the following immediate corollary for Theorem 1.1:

Let XX denote the correlation matrix as defined in (1)–(4). For any fixed k>0k>0, we have

Thus, as a special case, we also obtain the TW law for the Gaussian correlation matrices.

Although the current paper builds on our recent work NYcov , it is mostly self-contained and for the reader’s convenience, we will recall all of the needed results from NYcov . The rest of the paper is organized as follows. In Section 2, after establishing some notation, we give the key results establishing the strong Marcenko–Pastur law and rigidity of eigenvalues for correlation matrices, as obtained from NYcov . In Section 3 we give a brief proof sketch illustrating the key ideas. In Section 4 we give the proof of the main results and in Section 5 we prove some technical lemmas which constitute the key ingredients in the proof of the main result. For the rest of the paper the letter CC will denote a generic constant whose value might change from one line to the next, but will be independent of everything else. The notation Oε(Na)O_{\varepsilon}(N^{a}) will be used to denote O(Na+Cε)O(N^{a+C{\varepsilon}}).

Preliminaries

We will adopt the notation used in this paper from NYcov . Define the Green function of XX{X}^{\dagger}X by

The Stieltjes transform of the empirical eigenvalue distribution of XX{X}^{\dagger}X is given by

The Marcenko–Pastur (henceforth abbreviated by MP) law is given by

The function mWm_{W} depends on dd and has the closed form solution

where  \sqrt{\ } denotes the square root on a complex plane whose branch cut is the negative real line. We also define the classical location of the eigenvalues with ρW\rho_{W} as follows:

Let ζ>0\zeta>0. We say that an event Ω\Omega holds with ζ\zeta-high probability if there exists a constant C>0C>0 such that

Let us first give the following large deviation lemma for independent random variables (see EYYBulkuni , Appendix B for a proof).

Thus, the main result of NYcov (see Theorem 1.5 of NYcov ) is applicable for the correlation matrix XX, yielding the following strong local MP law and rigidity of eigenvalues:

Let X=[xij]X=[x_{ij}] be the correlation matrix given by (4). Then for any ζ>0\zeta>0 there exists a constant CζC_{\zeta} such that the following events hold with ζ\zeta-high probability.

The Stieltjes transform of the empirical eigenvalue distribution of XXX^{\dagger}X satisfies

where mS(Cζ){mS}(C_{\zeta}) defined as the set

The individual matrix elements of the Green function satisfy

The smallest nonzero and largest eigenvalues of XXX^{\dagger}X satisfy

Rigidity of the eigenvalues: recall γj\gamma_{j} in (12). For any 1j\breakmin{M,N}1\leq j\leq\break\min\{M,N\}, let j~=min{min{N,M}+1j,j}.\widetilde{j}=\min\{\min\{N,M\}+1-j,j\}. Then

An analogous result holds for the smallest eigenvalues λ~min{M,N}v\widetilde{\lambda}^{{\mathbf{v}}}_{\min\{M,N\}} and λ~min{M,N}w\widetilde{\lambda}^{{\mathbf{w}}}_{\min\{M,N\}}.

As remarked in NYcov , Theorem 2.4 can be extended to finite correlation functions of extreme eigenvalues as follows:

for all kk fixed and sufficiently large NN. We remark that edge universality is usually formulated in terms of joint distributions of edge eigenvalues as in (2) with fixed parameters s1,s2,s_{1},s_{2},\ldots etc. However, we note that Theorem 2.4 holds uniformly in these parameters, and thus they may depend on NN.

Key ideas and proof sketch

Our basic strategy is the so-called “Green function comparison” method initiated in a recent series of papers including EYYBulkuni , EYYgenwig , EYYrigid for proving universality for (generalized) Wigner matrices. The Green function comparison method has subsequently been applied to proving the spectral universality of adjacency matrices of random graphs EKYY11 , EKYY12 , the universality of eigenvectors of Wigner matrices KY1 , as well as the the spectrum of additive finite-rank deformations of Wigner matrices and the isotropic local semicircle law KY2 .

In this paper, we will show that (2.4) and (2) still hold with X~v\widetilde{X}^{\mathbf{v}} and X~w\widetilde{X}^{\mathbf{w}} replaced by the correlation matrix XX and the corresponding covariance matrix X~\widetilde{X}, that is, Theorem 1.1. To show this result, we introduce a sufficient criteria for (2.4) and (2) derived in NYcov (see Theorem 7.5 of NYcov ).

Consider two matrix ensembles Xv,XwX^{{\mathbf{v}}},X^{{\mathbf{w}}} (could be covariance, correlation or more general matrixNotice that throughout the paper we use XX for the correlation matrix and X~\widetilde{X} for the covariance matrix. This is the only instance we denote a generic matrix by XX for compactness of notation.) and let their respective Green functions and empirical Stieltjes transforms [see (6) and (7)] be denoted by Gv,GwG^{{\mathbf{v}}},G^{{\mathbf{w}}} and mv,mwm^{{\mathbf{v}}},m^{{\mathbf{w}}}. To prove that the asymptotic distribution of the extreme eigenvalues of the matrix ensembles Xv,XwX^{{\mathbf{v}}},X^{{\mathbf{w}}} are identical in the sense of (2.4) and (2), it suffices to show the following NYcov : {longlist}

The matrices Xv,XwX^{{\mathbf{v}}},X^{{\mathbf{w}}} satisfy the strong Marcenko–Pastur law and the rigidity of eigenvalues as given in Lemma 2.3.

The difference of the expectation of smooth functionals of the corresponding Green functions (Gv,GwG^{{\mathbf{v}}},G^{{\mathbf{w}}} and mv,mwm^{{\mathbf{v}}},m^{{\mathbf{w}}}) evaluated at the spectral edge must vanish asymptotically. More precisely, as pointed out in NYcov , it suffices to establish Theorems 3.1 and 3.2 below for the matrices Xv,XwX^{{\mathbf{v}}},X^{{\mathbf{w}}}.

and η0=N2/3ε\eta_{0}=N^{-2/3-{\varepsilon}}, we have

where the second term in the left-hand side above is obtained by changing the arguments of FF in the first term from mvm^{\mathbf{v}} to mwm^{\mathbf{w}} and keeping all the other parameters fixed.

Theorems 3.1 and 3.2 yield the edge universality of the kk-point correlation functions at the edge for k=1k=1 and k1k\geq 1, respectively.

Thus, to complete the proof of Theorem 1.1, by the Green function comparison method it suffices to show (i) and (ii) above for

where XXX^{\dagger}X denotes the correlation matrix and X~X~\widetilde{X}^{\dagger}\widetilde{X} is the corresponding covariance matrix. Here condition (i) is guaranteed by Theorem 2.3.

Verifying condition (ii) entails the heart of this paper. In previous works mentioned earlier, the authors use a Lindeberg replacement strategy, as in Chat , TaoVu09 . These proofs proceed via showing that the distribution of some smooth functional of the Green function (e.g., GiiG_{ii}, mm and x1,Gx1\langle{\mathbf{x}}_{1},G{\mathbf{x}}_{1}\rangle) of the two matrix ensembles is identical asymptotically provided that the first two (in some cases up to four) moments of all matrix elements of these two ensembles are identical. For instance, if one needs to show the edge universality of two covariance matrices X~v\widetilde{X}{}^{{\mathbf{v}}} and X~w\widetilde{X}{}^{{\mathbf{w}}}, the basic strategy is to express

where FF is a smooth function and G~γ\widetilde{G}_{\gamma} denotes the Green function of the ensemble X~γ\widetilde{X}_{\gamma} (with X~0=X~v\widetilde{X}_{0}=\widetilde{X}^{{\mathbf{v}}}) which is obtained from X~γ1\widetilde{X}_{\gamma-1} by replacing the distribution of the ijijth entry of X~γ1[ij]\widetilde{X}_{\gamma-1}[{ij}] with X~w[ij]\widetilde{X}^{{\mathbf{w}}}[{ij}] [here γ=i+(j1)M\gamma=i+(j-1)M] so that X~MN=X~w\widetilde{X}_{MN}=\widetilde{X}^{{\mathbf{w}}}. The next step is to obtain an estimate

for each of the N2N^{2} terms in the sum (28). Usually (29) is obtained by resolvent expansions, perturbation theory and the fact that X~γ\widetilde{X}_{\gamma} and X~γ1\widetilde{X}_{\gamma-1} differ by a single entry and the first few moments of these two distributions are the same.

But clearly the above method does not work in our case, since the entries within the same column are not independent and, therefore, one cannot replace the distribution of a single entry of a column without changing the distribution of all the other M1M-1 entries. To circumvent this, in NYcov a new telescoping argument consisting of O(N)O(N) ensembles was used for the comparison of Green functions. The idea is that instead of replacing entries one at a time, one can replace the entries of the data matrix column by column and thus require only O(N)O(N) ensembles. This argument from NYcov is adapted here along with new insights for dealing with nonindependence of the entries and is outlined below.

Now we set Xv=X,Xw=X~X^{\mathbf{v}}=X,X^{\mathbf{w}}=\widetilde{X}. For 1γN1\leq\gamma\leq N, let XγX_{\gamma} denote the random matrix whose jjth column is the same as that of XvX^{{\mathbf{v}}} if j>γj>\gamma and that of XwX^{{\mathbf{w}}} otherwise. In particular, we can choose X0=Xv=XX_{0}=X^{{\mathbf{v}}}=X and XN=Xw=X~X_{N}=X^{{\mathbf{w}}}=\widetilde{X}, where XX is correlation matrix and X~\widetilde{X} the corresponding covariance matrix of XX. As before, we define

Clearly, (25) will follow from (3) and the following estimate:

for some δ>0\delta>0. Our strategy to obtain (31) is the following. First notice that

Let X(γ)X^{(\gamma)} be the M×(N1)M\times(N-1) matrix obtained by removing the γ\gammath column of XγX_{\gamma}, which has the same distribution of the M×(N1)M\times(N-1) matrix obtained by removing the γ\gammath column of Xγ1X_{\gamma-1}. Define

In Lemma 4.1 we will establish (31) by showing that

Once (31) is verified, the main result follows by virtue of Theorems 3.1 and 3.2 as mentioned in the beginning of this section. Notice that since the columns of the data matrix XvX^{\mathbf{v}}, XwX^{\mathbf{w}} are assumed to be independent, μ\mu is independent of the γ\gammath column of XvX^{\mathbf{v}}, XwX^{\mathbf{w}} or, equivalently, the γ\gammath column of XγX_{\gamma}, Xγ1X_{\gamma-1}.

Thus, it boils down to establishing (3) in the case X0=Xv=XX_{0}=X^{{\mathbf{v}}}=X and XN=Xw=X~X_{N}=X^{{\mathbf{w}}}=\widetilde{X}. Our proof relies on the key observation that even if the entries of the γ\gammath column vector xγ{\mathbf{x}}_{\gamma} are not independent, the difference between the moments of the entries of the standardized vector xγ{\mathbf{x}}_{\gamma} and its unnormalized counterpart x~γ\widetilde{\mathbf{x}}_{\gamma} is at least an order of magnitude smaller than those of x~γ\widetilde{\mathbf{x}}_{\gamma}. For instance, since xiγ=O(N1/2){\mathbf{x}}_{i\gamma}=O(N^{-1/2}) for 1iM1\leq i\leq M, for two independent ensembles of covariance matrices X~v\widetilde{X}{}^{{\mathbf{v}}} and X~w\widetilde{X}{}^{{\mathbf{w}}} satisfying (1) and (2), we have the bound

On the other hand, if x~γ\widetilde{\mathbf{x}}_{\gamma} is the unnormalized counterpart of xγ{\mathbf{x}}_{\gamma}, as shown in Lemma 5.5,

The above observation combined with a resolvent expansion—detailed in Lemmas 4.3, 5.4 and 5.5—gives (3).

Proof of the main result

In this section we will prove (3) in the case X0=Xv=XX_{0}=X^{{\mathbf{v}}}=X and XN=Xw=X~X_{N}=X^{{\mathbf{w}}}=\widetilde{X}. As discussed above, it implies (25) in Theorem 3.1. Similarly, one can prove (3.1) and (3.2) in Theorems 3.1 and 3.2, which complete the proof of Theorem 1.1, the main result of this paper.

It is easy to see that (3) is a direct consequence of the following lemma.

where x~i1\widetilde{x}_{i1} are i.i.d. random variables with mean zero and variance M1M^{-1} and have an exponentially decay in the tails as given by (2).

Let X~\widetilde{X} be the random matrix whose entries have the same distribution as XX except for the first column, and the first column of X~\widetilde{X} is given by

where x~i1\widetilde{x}_{i1} are as in (36). The columns of X~\widetilde{X} are also assumed to be mutually independent. Let m,m~m,\widetilde{m} denote the empirical Stieltjes transforms of XXX^{\dagger}X, X~X~\widetilde{X}^{\dagger}\widetilde{X}.

Then for any function FF satisfying (24), there exists δ>0\delta>0, ε0>0{\varepsilon}_{0}>0 depending only on C1C_{1} such that for any ε<ε0{\varepsilon}<{\varepsilon}_{0} and for any real number EE satisfying

Note: In this lemma XX and X~\widetilde{X} are neither pure correlation nor pure covariance matrices, but their respective first columns are distributed according to the standardized data matrix and raw data matrix.

Under condition (37) (see NYcov ), we have the bound

First we collect some properties on submatrices of a generic M×NM\times N matrix QQ which can be proved using standard results from linear algebra. Let Q(1)Q^{(1)} be the M×(N1)M\times(N-1) matrix obtained by removing the first column of QQ. Define

Then by definition, GQ(1)G_{Q}^{(1)} is a (N1)×(N1)(N-1)\times(N-1) matrix, GQ(1)\mathcal{G}_{Q}^{(1)} is a M×MM\times M matrix and we have the identity

Using the Cauchy interlacing theorem (see Equation (8.5) of ESYY ), it can be shown that

Proof of Lemma 4.1 First we note that from Theorem 1.5 of NYcov , the conclusions of Theorem 2.3 hold for both XX and X~\widetilde{X}.

Let X(1)X^{(1)} be the M×(N1)M\times(N-1) matrix obtained by removing the first column of XX. Define

where F(s)F^{(s)} denotes the ssth derivative of FF and yky_{k}’s are defined as

where x1{\mathbf{x}}_{1} denotes the first column of XX. Define the quantity

First, recall the following identity (see (6.23) of NYcov ):

Furthermore, as proved in Lemma 2.5 of NYcov ,

Fix ζ>0\zeta>0. From (19), Remark 4.2 and the bound G11mW+O(1)|G_{11}|\leq|m_{W}|+O(1), it follows that for z=E+iη0z=E+i\eta_{0},

with ζ\zeta-high probability (see Definition 2.1). Therefore, with ζ\zeta-high probability, we have the identity

Define yy to be the l.h.s. of (4) multiplied by η0\eta_{0}, that is,

Since x1{\mathbf{x}}_{1} satisfies (15), (16) and (17), and G(1)\mathcal{G}^{(1)} is independent of x1{\mathbf{x}}_{1}, using Lemma 2.2, we infer that for some Cζ>0C_{\zeta}>0

with ζ\zeta-high probability. Using its definition, we bound Tr(G(1))2\operatorname{Tr}(\mathcal{G}^{(1)})^{2} as

where for the last two inequalities we have used (41), (42), (18) and (39). Similarly, we bound the last term of (52) with

Equation (50) and the fact z+mW(z)=O(1)|z|+|m_{W}(z)|=O(1) yields that

holds with ζ\zeta-high probability. Consequently, using (24) and (4), we see that the expansion

holds with ζ\zeta-high probability. From the bounds on yky_{k}’s obtained above, equation (4) follows.

Now we estimate G~\widetilde{G}, which is defined as

Let X~(1)\widetilde{X}{}^{(1)} be the M×(N1)M\times(N-1) matrix obtained by removing the first column of X~\widetilde{X} and x~1\widetilde{\mathbf{x}}_{1} denote its first column. Proceeding as in the previous calculations,

Notice that μ\mu appears in (4) because the entries of X~(1)\widetilde{X}^{(1)} and X(1)X^{(1)} are assumed to be identically distributed.

The symmetric matrices YY and ZZ are independent of x1{\mathbf{x}}_{1} and x~1\widetilde{\mathbf{x}}_{1}. Clearly, YZ=ZYYZ=ZY. Therefore, using the fact that zz, mW1m_{W}\sim 1, we can write

where Ck,n=O(1)C_{k,n}=O(1). Let Y=(x1,Yx1)\cal Y=({\mathbf{x}}_{1},Y{\mathbf{x}}_{1}) and Z=(x1,Zx1)\cal Z=({\mathbf{x}}_{1},Z{\mathbf{x}}_{1}). Then (4) can be written as

Define Y~=(x~1,Yx~1)\widetilde{\cal Y}=(\widetilde{\mathbf{x}}_{1},Y\widetilde{\mathbf{x}}_{1}) and Z~=(x~1,Zx~1)\widetilde{\cal Z}=(\widetilde{\mathbf{x}}_{1},Z\widetilde{\mathbf{x}}_{1}). Using (4) and proceeding similarly as before, we obtain that (4) also holds for the case when GG, Y\cal Y and Z\cal Z are replaced with G~\widetilde{G}, Y~\widetilde{\cal Y} and Z~\widetilde{\cal Z}, respectively. The following is the key technical lemma of this paper whose proof is deferred to the next section.

for some constant CC. Let A\cal A be of the form

where Yi=YY_{i}=Y or YY^{*} and Zj=ZZ_{j}=Z or ZZ^{*} with Y,ZY,Z as defined in (58) and a,ba,b are integers with 1a3,1a+b31\leq a\leq 3,1\leq a+b\leq 3. Then, under the assumptions of Lemma 4.1, we have

where A~\widetilde{\cal A} is obtained by replacing x{\mathbf{x}} with x~\widetilde{\mathbf{x}} in (61).

Taking the difference of (4) and the equation obtained by replacing (4) with G~\widetilde{G}, Y~\widetilde{\cal Y} and Z~\widetilde{\cal Z}, we deduce that the difference

Finally, we are ready to give the proof of the main result of this paper: {pf*}Proof of Theorem 1.1 By the Green function comparison theorem discussed in Section 3, it only remains to prove that Theorems 3.1 and 3.2 hold for the case

For simplicity, we will only prove (25) of Theorem 3.1; the rest can be proved using almost identical arguments.

For 1γN1\leq\gamma\leq N, let XγX_{\gamma} denote the random matrix whose jjth column is the same as that of XvX^{{\mathbf{v}}} if jγj\geq\gamma and that of XwX^{{\mathbf{w}}} otherwise; in particular, X0=XvX_{0}=X^{{\mathbf{v}}} and XN=XwX_{N}=X^{{\mathbf{w}}}. As before, we define

Applying Lemma 4.1 on XγX_{\gamma} and Xγ1X_{\gamma-1} gives the estimate

for some δ>0\delta>0. Now (25) follows from (4) and (64) and the proof is finished.

Moment computations

In this section we prove Lemma 4.3. For notational convenience, let us denote x=x1,x~=x~1{\mathbf{x}}={\mathbf{x}}_{1},\widetilde{\mathbf{x}}=\widetilde{\mathbf{x}}_{1}. We will also write

Recall μ\mu from (44). For the rest of this section, a,ba,b will denote two integers with

Before stating the key results of this section, let us first give some definitions.

For any partition AA of the set {1,2,,2a+2b}\{1,2,\ldots,2a+2b\}, and a vector k={k1,k2,,k2a+2b},ki{1,2,,M}\mathbf{k}=\{k_{1},k_{2},\ldots,k_{2a+2b}\},k_{i}\in\{1,2,\ldots,M\}, define the binary function I(A,k)\mathcal{I}(A,\mathbf{k}) as follows. The function I(A,k)\mathcal{I}(A,\mathbf{k}) is equal to 1 if (1) for any i,ji,j in the same block of AA we have ki=kjk_{i}=k_{j}, (2) if i,ji,j are in different blocks of AA, we have kikjk_{i}\neq k_{j}; otherwise I(A,k)=0\mathcal{I}(A,\mathbf{k})=0.

Given a partition AA of the set {1,2,,2a+2b}\{1,2,\ldots,2a+2b\}, let N(A,1)\mathcal{N}(A,1) be the number of the blocks in AA that contain only one element of the set {1,2,,2a+2b}\{1,2,\ldots,2a+2b\}. Let N(A,2)\mathcal{N}(A,2) be the number of the blocks in AA of the form {k2i1,k2i}\{k_{2i-1},k_{2i}\} with i>ai>a. Note that N(A,2)\mathcal{N}(A,2) depends on aa and bb in addition to AA. Let I(A,3){\mathbf{I}}_{(A,3)} be equal to one if and only if a+b=3a+b=3 and AA is composed of 22 blocks with three elements in each block.

The proof of Lemma 4.3 relies on Lemmas 5.4 and 5.5 stated below and proved at the end of this section.

Recall the matrices Y,ZY,Z from (58). Then for any ε>0{\varepsilon}>0 the following estimate

holds with ζ\zeta-high probability for any fixed ζ>0\zeta>0. The result also holds if any of the Y,ZY,Z are replaced by their complex conjugates Y,ZY^{*},Z^{*}, respectively.

Let y~i\widetilde{y}_{i} be i.i.d. random variables such that

and have a subexponential decay as in (2). Let AA be a partition of the set {1,2,,2a+2b}\{1,2,\ldots,2a+2b\} and let

Then for any vector k=(k1,k2,,k2a+2b)\mathbf{k}=(k_{1},k_{2},\ldots,k_{2a+2b}) and for any ε>0{\varepsilon}>0, we have

With the above two lemmas in hand, we are now ready to give the proof of Lemma 4.3. {pf*}Proof of Lemma 4.3 We will only prove the case when

The other cases can be proved similarly. First, let us write (61) as

where the summation index AA ranges over all the partitions of the set {1,2,,2a+2b}\{1,2,\ldots,2a+2b\}. Taking expectations, and using the fact that x{\mathbf{x}} is independent of YY, ZZ and μ\mu, leads to

Now we claim that the terms in the r.h.s. of (5) are bounded by Oε(N7/6)O_{\varepsilon}(N^{-7/6}). Indeed, note that N(A,1)>0\mathcal{N}(A,1)>0 implies I(A,3)=0{\mathbf{I}}_{(A,3)}=0. Therefore, the worse case scenario is the case in which

since by definition we have N(A,2)b\mathcal{N}(A,2)\leq b. But it is easy to see the above scenario cannot occur, since if the first two conditions hold, then it follows that N(A,1)=0\mathcal{N}(A,1)=0 or 22. Thus, we have finished the proof of Lemma 4.3.

Proof of Lemma 5.4 Note that all of the bounds in this lemma hold with ζ\zeta-high probability, not in expectation. For simplicity, we will subsume this in the notation.

First let us prove a slightly different result. Define the binary function I~(A,k)\widetilde{\mathcal{I}}(A,\mathbf{k}) [similar to I(A,k)\mathcal{I}(A,\mathbf{k})] as follows. I~(A,k)\widetilde{\mathcal{I}}(A,\mathbf{k}) is equal to 1 in the following scenarios: (1) for any i,ji,j in the same block of AA we have ki=kjk_{i}=k_{j}, (2) if i,ji,j are in different blocks of AA, we have kikjk_{i}\neq k_{j} except that if one of the indices i,ji,j is in the block of AA which contains exactly two elements, then kik_{i} is allowed to be equal to kjk_{j}. In all other instances I~(A,k)=0\widetilde{\mathcal{I}}(A,\mathbf{k})=0. For instance, in the previous example (65), we have

Let us first prove (5) when I(A,3)=0{\mathbf{I}}_{(A,3)}=0. Define the functions

where αi{1,2}\alpha_{i}\in\{1,2\} and mi2a+bm_{i}\leq 2a+b.

To this end, we will use the following 2–1–3 rule:

2: If the index ii appears in a block of AA which contains exactly two elements, first sum up over the index kik_{i}. Then estimate the remaining terms with absolute sum. For example, let A={{1},{2,3},{4}}A=\{\{1\},\{2,3\},\{4\}\}. Recall that Y=Z2Y=Z^{2},

1: Next do the summation over the index kik_{i} if ii appears in the block of AA which contains only one element as follows:

In the above inequalities, we have used the Cauchy–Schwarz and the fact that ZZ is a symmetric matrix. Note that each summation of the above kind brings an extra N1/2N^{1/2} factor.

3: Finally, sum up over the other indices. After the first two steps, (5) will be reduced to the product of following terms:

If m+n=2m+n=2, then using the Cauchy–Schwarz inequality, (72) can be estimated as

For m+n>2m+n>2, we bound m+n2m+n-2 of them [(Zmi)kk|(Z^{m_{i}})_{kk}| or (Z2nj)kk\sqrt{(|Z|^{2n_{j}})_{kk}}] by the maximum as follows:

to reduce to the case of m+n=2m+n=2 and use the bound (73).

Let us give an example in the case a=1a=1, b=2b=2 and A={{1},{2,3},{4,5,6}}A=\{\{1\},\{2,3\},\{4,5,6\}\}. Then the term (5) in this case reduces to

where the above inequality is obtained by applying rule 2. Next, applying rule 1 yields

and, finally, applying rule 3 leads to the bound

Using this 2–1–3 rule described above, we obtain (71). By the definition of the 2–1–3 rule, it is easy to see that

Recall η0=N2/3ε\eta_{0}=N^{-2/3-{\varepsilon}}. Using (4) and (54), we deduce that if αimi1\alpha_{i}m_{i}\neq 1, then

For αimi=1\alpha_{i}m_{i}=1, using (41), (42), (18) and mW=O(1)m_{W}=O(1), we see that g1(1)=Oε(N)g_{1}(1)=O_{\varepsilon}(N). Thus,

Combining equations (71)–(75), we have the

Now notice that by the definition, the term g1(1)g_{1}(1) in (71) can only be created during the first step of the 2–1–3 rule, that is, the 2 rule, and, therefore, we deduce that

which completes the proof of the claim made in (5) for the case I(A,3)=0{\mathbf{I}}_{(A,3)}=0.

Now consider the case I(A,3)=1{\mathbf{I}}_{(A,3)}=1. Using the fact that Y,ZY,Z are symmetric matrices and the relation Y=Z2Y=Z^{2}, we deduce that the term

reduces to one of the following situations:

for mi{1,2},i{1,2,3}m_{i}\in\{1,2\},i\in\{1,2,3\}. We bound the first scenario above as

where in the last inequality we have used the fact that imi=2a+b\sum_{i}m_{i}=2a+b. For the second case in (5), first we note

Summarizing the above computations, and noticing that N(A,1)=\breakN(A,2)=0\mathcal{N}(A,1)=\break\mathcal{N}(A,2)=0 when I(A,3)=1{\mathbf{I}}_{(A,3)}=1, we obtain the bound

proving the claim (5) when I(A,3)=1{\mathbf{I}}_{(A,3)}=1.

Now we return to prove Lemma 5.4. One can see that for any partition AA of the set {1,2,,2a+2b}\{1,2,\ldots,2a+2b\} and a vector k\mathbf{k}, the function I(A,k)\mathcal{I}(A,\mathbf{k}) can be written as linear combinations of the functions I~(Ai,k)\widetilde{\mathcal{I}}(A_{i},\mathbf{k}) for some partitions AiA_{i}’s of the set {1,2,,2a+2b}\{1,2,\ldots,2a+2b\} such that

Using large deviation bounds, it is easy to see that for any ε>0{\varepsilon}>0

where Cn=Ca,b,nC_{n}=C_{a,b,n} is a combinatorial factor. Using (80), the r.h.s. of equation (5) may be expressed as

Since n0,a,b=O(1)n_{0},a,b=O(1), the combinatorial factors do not increase with NN, that is, Cn=O(1)C_{n}=O(1), and, thus, we can bound

as follows. Notice that the number of distinct indices kik_{i} in (83) is equal to the number of blocks in the partition AA. Thus, for a given set of values for the indices r1,r2,,rnr_{1},r_{2},\ldots,r_{n}, the term (83) is nonzero only if at least N(A,1)\mathcal{N}(A,1) of the indices rjr_{j} belong to the set {k1,k2,,k2a+2b}\{k_{1},k_{2},\ldots,k_{2a+2b}\}. The above observation also implies that for (83) to be nonzero we must have

Therefore, the number of nonzero terms in the sum

is O((N1/2)nN(A,1))O((N^{1/2})^{n-\mathcal{N}(A,1)}), and each of these terms are of the size Oε(N(a+b)n)O_{\varepsilon}(N^{-(a+b)-n}), yielding

Combining (5) with (84) and the observation made in (85), we obtain that

obtaining (5.5), and the proof is finished.

Acknowledgments

The authors would like to thank Jiefeng Jiang, two anonymous referees, the Associate Editor and the Editor for very useful comments.

References