Edge universality of correlation matrices
Natesh S. Pillai, Jun Yin
Introduction
The aim of this paper is to prove the edge universality of correlation matrices. The data matrix is an matrix with independent centered real-valued entries. The entries in each column all are assumed to be identically distributed:
Furthermore, the entries have a subexponential decay, that is, there exists a constant such that for ,
Notice that all our constants may depend on and , but we will subsume this dependence in the notation.
The matrix is the usual covariance matrix. The th column of is denoted by . Define the matrix matrix
Since we are mainly interested in correlation matrices, without loss of generality, henceforth we will assume that
Covariance matrices are ubiquitous in modern multivariate statistics where the advance of technology has led to a profusion of high-dimensional data sets. See John01 , John07 , John08 , NYcov and the references therein for motivation and applications in a wide variety of fields. Correlation matrices are sometimes preferred in certain statistical applications. For instance, the classic exploratory method Principal Component Analysis (PCA) is not invariant to change of scale in the matrix entries. Therefore, it is often recommended first to standardize the matrix entries and then perform PCA on the resulting correlation matrix John01 .
Recent progress in random matrix theory has led to a wealth of techniques for proving universality of various matrix ensembles (see EKYY11 , EKYY12 , ESY1 , ESY2 , ESY3 , ESY4 , EPRSY , ESYY , EYYBulkuni , EYYgenwig , EYYrigid , Joha12 , KY1 , KY2 , TaoVu09 , TaoVu10 and the references therein). Here the word universality refers to the phenomenon that the asymptotic distributions of various functionals of covariance/correlation matrices (such as eigenvalues, eigenvector, etc.) are identical to those Gaussian covariance/correlation matrices. Thus, harnessing these methods to obtain universality results in statistical problems is an important step, since these results let us calculate the exact asymptotic distributions of various test statistics without having restrictive distributional assumptions of the matrix entries. For instance, an important consequence of universality is that in some cases one can perform various hypothesis tests under the assumption that the matrix entries are not normally distributed but use the same test statistic as in the Gaussian case.
In this context, in a recent paper NYcov we studied the asymptotic distribution of the eigenvalues of the covariance matrix under the assumptions of (1) and (2). In NYcov , we proved that the Stieltjes transform of the empirical eigenvalue distribution of the sample covariance matrix is given by the Marcenko–Pastur law MP uniformly up to the edges of the spectrum with an error of order , where is the imaginary part of the spectral parameter in the Stieltjes transform. From this strong local Marcenko–Pastur law, we derived the following results: (1) rigidity of eigenvalues (2) delocalization of eigenvectors (3) universality of eigenvalues in the bulk and (4) universality of eigenvalues at the edges. Furthermore, in our proof of edge universality of eigenvalues for covariance matrices (see Theorem 7.5 of NYcov ), we gave a sufficient criterion for checking whether two matrices of form ( is a data matrix) have the same asymptotic eigenvalue distribution at the edge (see Section 3 for details). Here could be quite general, including covariance and correlation matrices.
Verifying the above criteria for correlation matrices is much more complicated, owing to the fact that even if it has the same form as above, the matrix entries of are not independent. Fortunately in NYcov , as a byproduct, we also proved the strong Marcenko–Pastur law, the rigidity of eigenvalues and delocalization of eigenvectors of correlation matrices (see Lemma 2.3 in Section 2 below or Theorem 1.5 of NYcov ). In this paper, we complete the research program initiated in NYcov by proving the edge universality of correlation matrices. There are not many papers which study the asymptotics of the correlation matrices as compared to the relatively large literature on covariance matrices. The asymptotic distribution of the largest (appropriately rescaled) eigenvalue of the Gaussian correlation matrix was only very recently established by Bao1 . As will be explained below, we also obtain this result as a special case of our main result and, more importantly, we do not need this result in our proof (see Remark 1.3). The almost sure convergence of the largest and smallest eigenvalues of the correlation matrix was established in Jiang . The very recent paper Bao1 , relying on our results in NYcov , shows that the asymptotic distribution of the largest or smallest eigenvalue of the correlation matrix is given by the Tracy–Widom law, under the assumption that the data matrix satisfies (1) and its entries have symmetric distributions. In particular, the authors in Bao1 use the above mentioned sufficiency criteria for edge universality developed in NYcov . Furthermore, the assumption that the matrix entries are symmetric is very restrictive and not natural in statistical applications. In this paper we will build on our previous work NYcov and prove edge universality of correlation matrices just under the assumptions (1) and (2). Furthermore, we believe that all of our main results should hold if one replaces the subexponential tail decay of the matrix entries by a uniform bound on the th moment of the matrix entries (e.g., will suffice), as proved in EKYY12 for Wigner matrices.
The central ideas in this paper are based on the general machinery for proving universality established in a series of recent papers EKYY11 , EKYY12 , ESY1 , ESY2 , ESY3 , ESY4 , EPRSY , ESYY , EYYBulkuni , EYYgenwig , EYYrigid , KY1 , KY2 , where the authors Yau, Erdős et al. study the distribution of eigenvalues and eigenvectors by studying the Green’s functions (resolvent) of the random matrices.
The proof of this paper is based on the comparison of Green’s functions first initiated in EYYBulkuni , but, as mentioned earlier, the key obstacle to be surmounted is the strong dependence of the entries of the correlation matrix. We achieve this via a novel argument which involves comparing the moments of the product of the entries of the standardized data matrix to those of the raw data matrix (see Section 3 for a summary of the key ideas). Our proof strategy may be extended for proving the edge universality of other random matrix ensembles with dependent entries and hence is of independent interest. Furthermore, it will be interesting to see if bulk universality of correlation matrices can be established using the methods developed in this paper.
Let us state the main result now. We denote , , as the eigenvalues of and for . We order them as
Analogously, let denote the eigenvalues values of the matrix .
The following is the main result of this paper. It shows that the largest and smallest eigenvalues of the correlation matrix, after appropriate centering and rescaling, converge in distribution to those of the corresponding covariance matrix.
An analogous result holds for the smallest eigenvalues.
In Pech07 , Sod1 and Sosh1 , Peche, Soshnikov and Sodin proved that for some covariance matrices (including the Wishart matrix), the largest and smallest eigenvalues after appropriate centering and rescaling converge in distribution to the Tracy–Widom lawHere we use the term Tracy–Widom law as in Sosh1 . whose density is a smooth function. Combining with our recent result on the universality of covariance matrices in NYcov , we have the following immediate corollary for Theorem 1.1:
Let denote the correlation matrix as defined in (1)–(4). For any fixed , we have
Thus, as a special case, we also obtain the TW law for the Gaussian correlation matrices.
Although the current paper builds on our recent work NYcov , it is mostly self-contained and for the reader’s convenience, we will recall all of the needed results from NYcov . The rest of the paper is organized as follows. In Section 2, after establishing some notation, we give the key results establishing the strong Marcenko–Pastur law and rigidity of eigenvalues for correlation matrices, as obtained from NYcov . In Section 3 we give a brief proof sketch illustrating the key ideas. In Section 4 we give the proof of the main results and in Section 5 we prove some technical lemmas which constitute the key ingredients in the proof of the main result. For the rest of the paper the letter will denote a generic constant whose value might change from one line to the next, but will be independent of everything else. The notation will be used to denote .
Preliminaries
We will adopt the notation used in this paper from NYcov . Define the Green function of by
The Stieltjes transform of the empirical eigenvalue distribution of is given by
The Marcenko–Pastur (henceforth abbreviated by MP) law is given by
The function depends on and has the closed form solution
where denotes the square root on a complex plane whose branch cut is the negative real line. We also define the classical location of the eigenvalues with as follows:
Let . We say that an event holds with -high probability if there exists a constant such that
Let us first give the following large deviation lemma for independent random variables (see EYYBulkuni , Appendix B for a proof).
Thus, the main result of NYcov (see Theorem 1.5 of NYcov ) is applicable for the correlation matrix , yielding the following strong local MP law and rigidity of eigenvalues:
Let be the correlation matrix given by (4). Then for any there exists a constant such that the following events hold with -high probability.
The Stieltjes transform of the empirical eigenvalue distribution of satisfies
where defined as the set
The individual matrix elements of the Green function satisfy
The smallest nonzero and largest eigenvalues of satisfy
Rigidity of the eigenvalues: recall in (12). For any , let Then
An analogous result holds for the smallest eigenvalues and .
As remarked in NYcov , Theorem 2.4 can be extended to finite correlation functions of extreme eigenvalues as follows:
for all fixed and sufficiently large . We remark that edge universality is usually formulated in terms of joint distributions of edge eigenvalues as in (2) with fixed parameters etc. However, we note that Theorem 2.4 holds uniformly in these parameters, and thus they may depend on .
Key ideas and proof sketch
Our basic strategy is the so-called “Green function comparison” method initiated in a recent series of papers including EYYBulkuni , EYYgenwig , EYYrigid for proving universality for (generalized) Wigner matrices. The Green function comparison method has subsequently been applied to proving the spectral universality of adjacency matrices of random graphs EKYY11 , EKYY12 , the universality of eigenvectors of Wigner matrices KY1 , as well as the the spectrum of additive finite-rank deformations of Wigner matrices and the isotropic local semicircle law KY2 .
In this paper, we will show that (2.4) and (2) still hold with and replaced by the correlation matrix and the corresponding covariance matrix , that is, Theorem 1.1. To show this result, we introduce a sufficient criteria for (2.4) and (2) derived in NYcov (see Theorem 7.5 of NYcov ).
Consider two matrix ensembles (could be covariance, correlation or more general matrixNotice that throughout the paper we use for the correlation matrix and for the covariance matrix. This is the only instance we denote a generic matrix by for compactness of notation.) and let their respective Green functions and empirical Stieltjes transforms [see (6) and (7)] be denoted by and . To prove that the asymptotic distribution of the extreme eigenvalues of the matrix ensembles are identical in the sense of (2.4) and (2), it suffices to show the following NYcov : {longlist}
The matrices satisfy the strong Marcenko–Pastur law and the rigidity of eigenvalues as given in Lemma 2.3.
The difference of the expectation of smooth functionals of the corresponding Green functions ( and ) evaluated at the spectral edge must vanish asymptotically. More precisely, as pointed out in NYcov , it suffices to establish Theorems 3.1 and 3.2 below for the matrices .
and , we have
where the second term in the left-hand side above is obtained by changing the arguments of in the first term from to and keeping all the other parameters fixed.
Theorems 3.1 and 3.2 yield the edge universality of the -point correlation functions at the edge for and , respectively.
Thus, to complete the proof of Theorem 1.1, by the Green function comparison method it suffices to show (i) and (ii) above for
where denotes the correlation matrix and is the corresponding covariance matrix. Here condition (i) is guaranteed by Theorem 2.3.
Verifying condition (ii) entails the heart of this paper. In previous works mentioned earlier, the authors use a Lindeberg replacement strategy, as in Chat , TaoVu09 . These proofs proceed via showing that the distribution of some smooth functional of the Green function (e.g., , and ) of the two matrix ensembles is identical asymptotically provided that the first two (in some cases up to four) moments of all matrix elements of these two ensembles are identical. For instance, if one needs to show the edge universality of two covariance matrices and , the basic strategy is to express
where is a smooth function and denotes the Green function of the ensemble (with ) which is obtained from by replacing the distribution of the th entry of with [here ] so that . The next step is to obtain an estimate
for each of the terms in the sum (28). Usually (29) is obtained by resolvent expansions, perturbation theory and the fact that and differ by a single entry and the first few moments of these two distributions are the same.
But clearly the above method does not work in our case, since the entries within the same column are not independent and, therefore, one cannot replace the distribution of a single entry of a column without changing the distribution of all the other entries. To circumvent this, in NYcov a new telescoping argument consisting of ensembles was used for the comparison of Green functions. The idea is that instead of replacing entries one at a time, one can replace the entries of the data matrix column by column and thus require only ensembles. This argument from NYcov is adapted here along with new insights for dealing with nonindependence of the entries and is outlined below.
Now we set . For , let denote the random matrix whose th column is the same as that of if and that of otherwise. In particular, we can choose and , where is correlation matrix and the corresponding covariance matrix of . As before, we define
Clearly, (25) will follow from (3) and the following estimate:
for some . Our strategy to obtain (31) is the following. First notice that
Let be the matrix obtained by removing the th column of , which has the same distribution of the matrix obtained by removing the th column of . Define
In Lemma 4.1 we will establish (31) by showing that
Once (31) is verified, the main result follows by virtue of Theorems 3.1 and 3.2 as mentioned in the beginning of this section. Notice that since the columns of the data matrix , are assumed to be independent, is independent of the th column of , or, equivalently, the th column of , .
Thus, it boils down to establishing (3) in the case and . Our proof relies on the key observation that even if the entries of the th column vector are not independent, the difference between the moments of the entries of the standardized vector and its unnormalized counterpart is at least an order of magnitude smaller than those of . For instance, since for , for two independent ensembles of covariance matrices and satisfying (1) and (2), we have the bound
On the other hand, if is the unnormalized counterpart of , as shown in Lemma 5.5,
The above observation combined with a resolvent expansion—detailed in Lemmas 4.3, 5.4 and 5.5—gives (3).
Proof of the main result
In this section we will prove (3) in the case and . As discussed above, it implies (25) in Theorem 3.1. Similarly, one can prove (3.1) and (3.2) in Theorems 3.1 and 3.2, which complete the proof of Theorem 1.1, the main result of this paper.
It is easy to see that (3) is a direct consequence of the following lemma.
where are i.i.d. random variables with mean zero and variance and have an exponentially decay in the tails as given by (2).
Let be the random matrix whose entries have the same distribution as except for the first column, and the first column of is given by
where are as in (36). The columns of are also assumed to be mutually independent. Let denote the empirical Stieltjes transforms of , .
Then for any function satisfying (24), there exists , depending only on such that for any and for any real number satisfying
Note: In this lemma and are neither pure correlation nor pure covariance matrices, but their respective first columns are distributed according to the standardized data matrix and raw data matrix.
Under condition (37) (see NYcov ), we have the bound
First we collect some properties on submatrices of a generic matrix which can be proved using standard results from linear algebra. Let be the matrix obtained by removing the first column of . Define
Then by definition, is a matrix, is a matrix and we have the identity
Using the Cauchy interlacing theorem (see Equation (8.5) of ESYY ), it can be shown that
Proof of Lemma 4.1 First we note that from Theorem 1.5 of NYcov , the conclusions of Theorem 2.3 hold for both and .
Let be the matrix obtained by removing the first column of . Define
where denotes the th derivative of and ’s are defined as
where denotes the first column of . Define the quantity
First, recall the following identity (see (6.23) of NYcov ):
Furthermore, as proved in Lemma 2.5 of NYcov ,
Fix . From (19), Remark 4.2 and the bound , it follows that for ,
with -high probability (see Definition 2.1). Therefore, with -high probability, we have the identity
Define to be the l.h.s. of (4) multiplied by , that is,
Since satisfies (15), (16) and (17), and is independent of , using Lemma 2.2, we infer that for some
with -high probability. Using its definition, we bound as
where for the last two inequalities we have used (41), (42), (18) and (39). Similarly, we bound the last term of (52) with
Equation (50) and the fact yields that
holds with -high probability. Consequently, using (24) and (4), we see that the expansion
holds with -high probability. From the bounds on ’s obtained above, equation (4) follows.
Now we estimate , which is defined as
Let be the matrix obtained by removing the first column of and denote its first column. Proceeding as in the previous calculations,
Notice that appears in (4) because the entries of and are assumed to be identically distributed.
The symmetric matrices and are independent of and . Clearly, . Therefore, using the fact that , , we can write
where . Let and . Then (4) can be written as
Define and . Using (4) and proceeding similarly as before, we obtain that (4) also holds for the case when , and are replaced with , and , respectively. The following is the key technical lemma of this paper whose proof is deferred to the next section.
for some constant . Let be of the form
where or and or with as defined in (58) and are integers with . Then, under the assumptions of Lemma 4.1, we have
where is obtained by replacing with in (61).
Taking the difference of (4) and the equation obtained by replacing (4) with , and , we deduce that the difference
Finally, we are ready to give the proof of the main result of this paper: {pf*}Proof of Theorem 1.1 By the Green function comparison theorem discussed in Section 3, it only remains to prove that Theorems 3.1 and 3.2 hold for the case
For simplicity, we will only prove (25) of Theorem 3.1; the rest can be proved using almost identical arguments.
For , let denote the random matrix whose th column is the same as that of if and that of otherwise; in particular, and . As before, we define
Applying Lemma 4.1 on and gives the estimate
for some . Now (25) follows from (4) and (64) and the proof is finished.
Moment computations
In this section we prove Lemma 4.3. For notational convenience, let us denote . We will also write
Recall from (44). For the rest of this section, will denote two integers with
Before stating the key results of this section, let us first give some definitions.
For any partition of the set , and a vector , define the binary function as follows. The function is equal to 1 if (1) for any in the same block of we have , (2) if are in different blocks of , we have ; otherwise .
Given a partition of the set , let be the number of the blocks in that contain only one element of the set . Let be the number of the blocks in of the form with . Note that depends on and in addition to . Let be equal to one if and only if and is composed of blocks with three elements in each block.
The proof of Lemma 4.3 relies on Lemmas 5.4 and 5.5 stated below and proved at the end of this section.
Recall the matrices from (58). Then for any the following estimate
holds with -high probability for any fixed . The result also holds if any of the are replaced by their complex conjugates , respectively.
Let be i.i.d. random variables such that
and have a subexponential decay as in (2). Let be a partition of the set and let
Then for any vector and for any , we have
With the above two lemmas in hand, we are now ready to give the proof of Lemma 4.3. {pf*}Proof of Lemma 4.3 We will only prove the case when
The other cases can be proved similarly. First, let us write (61) as
where the summation index ranges over all the partitions of the set . Taking expectations, and using the fact that is independent of , and , leads to
Now we claim that the terms in the r.h.s. of (5) are bounded by . Indeed, note that implies . Therefore, the worse case scenario is the case in which
since by definition we have . But it is easy to see the above scenario cannot occur, since if the first two conditions hold, then it follows that or . Thus, we have finished the proof of Lemma 4.3.
Proof of Lemma 5.4 Note that all of the bounds in this lemma hold with -high probability, not in expectation. For simplicity, we will subsume this in the notation.
First let us prove a slightly different result. Define the binary function [similar to ] as follows. is equal to 1 in the following scenarios: (1) for any in the same block of we have , (2) if are in different blocks of , we have except that if one of the indices is in the block of which contains exactly two elements, then is allowed to be equal to . In all other instances . For instance, in the previous example (65), we have
Let us first prove (5) when . Define the functions
where and .
To this end, we will use the following 2–1–3 rule:
2: If the index appears in a block of which contains exactly two elements, first sum up over the index . Then estimate the remaining terms with absolute sum. For example, let . Recall that ,
1: Next do the summation over the index if appears in the block of which contains only one element as follows:
In the above inequalities, we have used the Cauchy–Schwarz and the fact that is a symmetric matrix. Note that each summation of the above kind brings an extra factor.
3: Finally, sum up over the other indices. After the first two steps, (5) will be reduced to the product of following terms:
If , then using the Cauchy–Schwarz inequality, (72) can be estimated as
For , we bound of them [ or ] by the maximum as follows:
to reduce to the case of and use the bound (73).
Let us give an example in the case , and . Then the term (5) in this case reduces to
where the above inequality is obtained by applying rule 2. Next, applying rule 1 yields
and, finally, applying rule 3 leads to the bound
Using this 2–1–3 rule described above, we obtain (71). By the definition of the 2–1–3 rule, it is easy to see that
Recall . Using (4) and (54), we deduce that if , then
For , using (41), (42), (18) and , we see that . Thus,
Combining equations (71)–(75), we have the
Now notice that by the definition, the term in (71) can only be created during the first step of the 2–1–3 rule, that is, the 2 rule, and, therefore, we deduce that
which completes the proof of the claim made in (5) for the case .
Now consider the case . Using the fact that are symmetric matrices and the relation , we deduce that the term
reduces to one of the following situations:
for . We bound the first scenario above as
where in the last inequality we have used the fact that . For the second case in (5), first we note
Summarizing the above computations, and noticing that when , we obtain the bound
proving the claim (5) when .
Now we return to prove Lemma 5.4. One can see that for any partition of the set and a vector , the function can be written as linear combinations of the functions for some partitions ’s of the set such that
Using large deviation bounds, it is easy to see that for any
where is a combinatorial factor. Using (80), the r.h.s. of equation (5) may be expressed as
Since , the combinatorial factors do not increase with , that is, , and, thus, we can bound
as follows. Notice that the number of distinct indices in (83) is equal to the number of blocks in the partition . Thus, for a given set of values for the indices , the term (83) is nonzero only if at least of the indices belong to the set . The above observation also implies that for (83) to be nonzero we must have
Therefore, the number of nonzero terms in the sum
is , and each of these terms are of the size , yielding
Combining (5) with (84) and the observation made in (85), we obtain that
obtaining (5.5), and the proof is finished.
Acknowledgments
The authors would like to thank Jiefeng Jiang, two anonymous referees, the Associate Editor and the Editor for very useful comments.