Random matrices: Law of the determinant
Hoi H. Nguyen, Van Vu
Introduction
Let be an by random matrix whose entries , are independent real random variables of zero mean and unit variance. We will refer to the entries as the atom variables.
As determinant is one of the most fundamental matrix functions, it is a basic problem in the theory of random matrices to study the distribution of and indeed this study has a long and rich history. The earliest paper we find on the subject is a paper of Szekeres and Turán SzT from 1937, in which they studied an extremal problem. In the 1950s, there is a series of papers FT , NRR , Turan , Pre devoted to the computation of moments of fixed orders of (see also Gbook ). The explicit formula for higher moments gets very complicated and is in general not available, except in the case when the atom variables have some special distribution (see, e.g., Dembo ).
One can use the estimate for the moments and Markov inequality to obtain an upper bound on . However, no lower bound was known for a long time. In particular, Erdős asked whether is nonzero with probability tending to one. In 1967, Komlós Kom , Kom1 addressed this question, proving that almost surely for random Bernoulli matrices (where the atom variables are i.i.d. Bernoulli, taking values with probability ). His method also works for much more general models. Following Kom , the upper bound on the probability that has been improved in KKS , TVdet , TVsing , BVW . However, these results do not say much about the value of itself.
In a recent paper TVdet , Tao and the second author proved that for Bernoulli random matrices, with probability tending to one (as tends to infinity),
for any function tending to infinity with . This shows that almost surely, is , but does not provide any distributional information. For related works concerning other models of random matrices, we refer to Ro .
In Goodman , Goodman considered random Gaussian matrices where the atom variables are i.i.d. standard Gaussian variables. He noticed that in this case the determinant is a product of independent Chi-square variables. Therefore, its logarithm is the sum of independent variables and, thus, one expects a central limit theorem to hold. In fact, using properties of Chi-square distribution, it is not very hard to prove
We refer the reader to RW , Section 4, for further discussion on this model.
In G2 , Girko stated that (2) holds for general random matrices under the additional assumption that the fourth moment of the atom variables is 3. Twenty years later, he claimed a much stronger result which replaced the above assumption by the assumption that the atom variables have bounded th moment G . However, there are points which are not clear in these papers and we have not found any researcher who can explain the whole proof to us. In our own attempt, we could not pass the proof of Theorem 2 in G . In particular, definition (3.7) of this paper requires the matrix \Xi\bigl{(}{1\atop k}\bigr{)} to be invertible, but this assumption can easily fail.
In this paper, we provide a transparent proof for the central limit theorem of the log-determinant. The next question to consider, naturally, is the rate of convergence. We are able to obtain a rate which we believe to be near optimal.
We say that a random variable satisfies condition C0 (with positive constants ) if
Assume that all atom variables satisfy condition C0 with some positive constants . Then
Here and later, . In the remaining part of the paper, we will actually prove the following equivalent form:
The reader is invited to consult Figure 1 for our simulation. To give some feeling about (5), let us consider the case when are i.i.d. standard Gaussian. For , let be the subspace generated by the first rows of . Let denote the distance from to , where is the th row vector of . Then, by the “base times height” formula, we have
As the are i.i.d. standard Gaussian, are independent Chi-square random variables of degree . Thus, the right-hand side of (7) is a sum of independent random variables. Notice that has mean and variance and is very strongly concentrated. Thus, with high probability is roughly and so it is easy to show that has mean close to and variance . So the variance of is . To get the precise value , one needs to carry out some careful (but rather routine) calculation, which we leave as an exercise.
The reason for which we think that the rate might be near optimal is that (as the reader will see though the proofs) is only an asymptotic value of the variance of . This approximation has an error term of order at least and since , it seems that one cannot have rate of convergence better than . It is a quite interesting question whether one can obtain a polynomial rate by replacing and by other, relatively simple, functions of .
Our arguments rely on recent developments in random matrix theory and look quite different from those in Girko’s papers. In particular, we benefit from the arguments developed in TVdet , TVhard , TVlocal . We also use Talagrand’s famous concentration inequality frequently to obtain most of the large deviation results needed in this paper.
Our approach and main lemmas
We first make two extra assumptions about . We assume that the entries are bounded in absolute value by for some constant and has full rank with probability one. We will prove Theorem 1.1 under these two extra assumptions. In Appendix, we will explain why we can implement these assumptions without violating the generality of Theorem 1.1.
Assume that all atom variables satisfy condition C0 and are bounded in absolute value by for some constant . Assume furthermore that has full rank with probability one. Then
In the first, and main, step of the proof, we prove the claim of Theorem 2.1 but with the last rows being replaced by Gaussian rows (for some properly chosen constant ). We remark that the replacement trick was also used in G , but for an entirely different reason. Our reason here is that for the last few rows, Lemma 2.4 is not very effective.
For any constant the following holds for any sufficiently large constant . Let be an by matrix whose entries , are independent real random variables of zero mean, unit variance and absolute values at most . Assume furthermore that has full rank with probability one and the components of the last rows of are independent standard Gaussian random variables. Then
In the second (and simpler) step of the proof, we carry out a replacement procedure, replacing the Gaussian rows by the original rows one at a time,j and show that the replacement does not effect the central limit theorem. This step is motivated by the Lindeberg replacement method used in TVlocal .
We present the verification of Theorem 2.1 using Theorem 2.2 in Section 8. In the rest of this section, we focus on the proof of Theorem 2.2.
Notice that in the setting of this theorem, the variables are no longer independent. However, with some work, we can make the RHS of (7) into a sum of martingale differences plus a negligible error, which lays ground for an application of a central limit theorem of martingales. (In G , Girko also used the CLT for martingales via the base times height formula, but his analysis looks very different from ours.) We are going to use the following theorem, due to Machkouri and Ouchti MO .
There exists an absolute constant such that the following holds. Assume that are martingale differences with respect to the nested -algebras . Let , and . Assume that with probability one for all , where is a sequence of positive real numbers. Then we have
To make use of this theorem, we need some preparation. Conditioning on the first rows , we can view as the distance from a random vector to . Since has full rank with probability one, with probability one for all . The following is a direct corollary of TVlocal , Lemma 43.
For any constant there is a constant depending on such that the following holds. Assume that is a subspace of dimension . Let be a random vector whose components are independent variables of zero mean and unit variance and absolute values at most . Denote by the distance from to . Then we have
where is a sufficiently large constant (which may depend on ). We will use shorthand to denote , the co-dimension of (and the expected value of ),
We next consider each term of the right-hand side of (7) where . Using the Taylor expansion, we write
By applying Lemma 2.4 with and by choosing sufficiently large, we have with probability at least [the probability here is with respect to the random th row, fixing the first rows arbitrarily]
Thus, with probability at least
Hence, by a uniform bound, the following holds with probability at least :
again by having sufficiently large.
With probability at least
Theorem 2.2 follows from the above four lemmas and the following trivial fact (used repeatedly and with proper scaling):
The reader is invited to fill in the simple details using the following observation:
We will prove Lemma 2.6 using Theorem 2.3. Lemma 2.7 will be verified by the moment method and Lemma 2.8 by elementary properties of Chi-square variables. The key to the proof of Lemmas 2.6 and 2.7 is an estimate on the entries of the projection matrix onto the space , presented in Section 4.
Proof of Lemmas 2.6 and 2.7: Opening
We recall from the previous section that . Denote by the projection matrix onto the orthogonal complement . A standard fact in linear algebra is
where are the coordinates of the vector and
By (11) we have and
Because and , and the are mutually independent, we can show by using a routine calculation that [see (6) from Section 6]
where is the -algebra generated by the first rows of .
The reason we split into the sum of and is that and its variance can be easily computed.
To complete the proof of Lemma 2.7 from Lemma 3.1, it suffices to show that the sum of the is negligible,
Our main technical tool will be the following lemma.
Noticing that is uniformly bounded (by condition C0), it follows that with probability ,
Proof of Lemmas 2.6 and 2.7: Mid game
The key idea for proving Lemma 3.2 is to establish a good upper bound for . For this, we need some new tools. Our main ingredient is the following delocalization result, which is a variant of a result from TVhard (see also E and TVsurvey for recent surveys), asserting that with high probability all unit vectors in the orthogonal complement of a random subspace with high dimension have small infinity norm.
For any constant the following holds for all sufficiently large constant . Assume that the components of , where , are independent random variables of mean zero, variance one and bounded in absolute value by . Then with probability , the following holds for all unit vectors of the space :
Proof of Lemma 3.2 assuming Lemma 4.1 Write
Note that as ,
for some unit vector .
Thus, if , then and, hence, by Lemma 4.1
We now focus on the infinity norm of and follow an argument from TVhard .
Proof of Lemma 4.1 By the union bound, it suffices to show that with probability at least , where is the first coordinate of .
Let be the matrix formed by the first rows of . Assume that is a unit vector, then
Let be the first column of , and be the matrix obtained by deleting from . Clearly,
where is the vector obtained from by deleting .
We next invoke the following result, which is a variant of TVhard , Lemma 4.1. This lemma was proved using a method of Guionet and Zeitouni GZ , based on Talagrand’s inequality.
For any constant the following holds for all sufficiently large constant . Let be a random matrix of size by , where the entries are independent random variables of mean zero, variance one and bounded in absolute value by . Then for any , there exist singular values of in the interval , for some absolute constant , with probability at least .
We can prove Lemma 4.2 by following the arguments in TVhard , Lemma 4.1, almost word by word.
By the interlacing law and Lemma 4.2, we conclude that has singular values in the interval with probability .
Let be the space spanned by the left singular vectors of these singular values, and let be the orthogonal projection onto . By definition, the spectral norm of is bounded,
here we used the fact that is independent from , and thus from .
On the other hand, since the dimension of is , Lemma 2.4 implies that with probability .
Proof of Lemma 2.6: End game
Recall from (10) that conditioned on any first rows, with probability . So, by paying an extra term of in probability, it suffices to justify Lemma 2.6 for the sequence .
On the other hand, the sequence is not a martingale difference sequence, so we slightly modify to and prove the claim for the sequence , here we recall that is the -algebra generated by the first rows of . In order to show that this modification has no effect whatsoever, we first demonstrate that is extremely small.
Recall from (12) that . By the Cauchy–Schwarz inequality and the assumption that are bounded in absolute value by , we have with probability one
To justify Lemma 2.6 for the sequence , we apply Theorem 2.3.
The key point here is that thanks to the indicator function in the definition of and the fact that the difference between and is negligible, is bounded by with probability one, so the conditions in Theorem 2.3 are satisfied with
We need to estimate with respect to the sequence . However, thanks to the observations above, and are very close, and so it suffices to compute these values with respect to the sequence .
Also, recall from Section 4 that with probability ,
This bound, together with (13) and (5), imply that with probability one
which in turn implies that with probability .
Using (5) again, because , we deduce that
With another application of (5), we obtain
By the conclusion of Theorem 2.3 and setting sufficiently large, we conclude
Proof of Lemma 2.7: End game
Our goal is to justify Lemma 3.1, which together with (14) verify Lemma 2.7.
We will show that the variance is small and then use Chebyshev’s inequality. The proof is based on a series of routine, but somewhat tedious calculations. We first show that the expectations of the ’s are zero, and so are the covariances by an elementary manipulation. The variances will be bounded from above by the Cauchy–Schwarz inequality.
We start with the formula . Observe that
Expanding each term, using the fact that and , we have
As , and the ’s are mutually independent with each other and with every row of index at most [and in particular with ’s], every term in the last formula is zero, and so we infer that and , confirming (13). With the same reasoning, we can also infer that the covariance for all .
It is thus enough to work with the diagonal terms . We have
After a series of cancellations, and because of condition C0, we have
where the first two rows consist of the squares of the terms appearing in (after deleting several sums of zero expected value), and each of the following rows was obtained by expanding the product of each term with the rest in the order of their appearance.
Because , one has for all . Recall furthermore that and for all . We next estimate the terms under consideration one by one as follows.
First, the sums , , , and can be bounded by, and so by .
Second, by applying the Cauchy–Schwarz inequality if needed, one can bound the sums , , and by , and so by .
.
where we applied Lemma 3.2 in the last estimate.
To complete the proof, we note from the estimate of of Section 5 and from Lemma 3.2 that . Thus, by Chebyshev’s inequality
Proof of Lemma 2.8
We recall that, with , is a Chi-square random variable of degree . Let us first consider the lower tail; it suffices to show
By properties of the normal distribution, it is easy to show that and are at least with probability , so we can omit these terms from the sum. It now suffices to show that
Flipping the inequality inside the probability (by changing the sign of the RHS and swapping the denominators and numerators in the logarithms of the LHS) and using the Laplace transform trick (based on the fact that the are independent), we see that the probability in question is at most
Recall that is a Chi-square random variable with degree of freedom , so . Therefore, the numerator in the previous formula is .
The proof for the upper tail is similar (in fact simpler as we do not need to treat the first two terms separately) and we omit the details.
Deduction of Theorem 2.1 from Theorem 2.2
Our plan is to replace one by one the last Gaussian rows of by vectors of components having zero mean, unit variance and satisfying condition C0. Our key tool here is the classical Berry–Eseen inequality. In order to apply this lemma, we will make a crucial use of Lemma 4.1.
Assume that is a unit vector. Assume that are independent random variables of mean zero, variance one and satisfying condition C0. Then we have
where is an absolute constant depending on the parameters appearing in (3).
We remark that in the original setting of Berry and Esseen, it suffices to assume the finite third moment.
In application, plays the role of the normal vector of the hyperplane spanned by the remaining rows of , and , where is the vector to be replaced.
For the deduction, it is enough to show the following.
Let be a random matrix with atom variables satisfying condition C0 and nonsingular with probability one. Assume furthermore that has at least one and at most Gaussian rows. Let be the random matrix obtained from by replacing a Gaussian row vector of by a random vector whose coordinates are independent atom variables satisfying condition C0 such that the resulting matrix is nonsingular with probability one. Then
Clearly, Theorem 1.1 follows from Theorem 2.2 by applying Lemma 8.2 times.
Proof of Lemma 8.2 Without loss of generality, we can assume that is obtained from by replacing the last row . As is nonsingular, .
By Lemma 4.1, by paying an extra term of in probability (which will be absorbed by the eventual bound ), we may also assume that the normal vector of satisfies
where and are the distance from and to , respectively.
Appendix: Simplifying the model: Deducing Theorem 1.1 from Theorem 2.1
In this section we show that the two extra assumptions that and has full rank with probability one do not violate the generality of Theorem 1.1.
To start with, we need a very weak lower bound on .
It follows from TVcir , Theorem 2.1, that there is a constant such that . Since is the product of its singular values, the bound follows.
The above bound is extremely weak. By modifying the proof in TVdet , one can actually prove the Tao–Vu lower bound (1) for random matrices satisfying C0. Also, sharper bounds on the least singular value are obtained in TVsmooth , RV . However, for the arguments in this section, we only need the bound on Lemma .3.
Let us start with the assumption . We can achieve this assumption using the standard truncation method (see BS or TVlocal ). In what follows, we sketch the idea.
Notice that by condition C0, we have, with probability at least , that all entries of have absolute value at most , for some constant which may depend on the constants in C0.
We replace the variable by the variable , for all and let be the random matrix formed by . Since with probability at least , , it is easy to show that if satisfies the claim of Theorem 1.1, then so does .
While the entries of are bounded by , there is still one problem we need to address, namely, that the new variables do not have mean 0 and variance one. We can achieve this by a simple normalization trick. First observe that by property C0, taking sufficiently large, it is easy to show that has absolute value at most and , where is the standard deviation of . Now define
Note that now does have mean zero and variance one. Let and be the corresponding matrices of and , respectively.
where is the matrix formed by .
Since , by Hadamard’s bound . On the other hand, we have by Lemma .3 that . It thus follows that
We can prove a matching lower bound by the same argument. From here, we conclude that if satisfies the conclusion of Theorem 1.1, then so does .
To pass from to , we apply the Brunn–Minkowski inequality again,
where is the matrix form by . Noting that and , we infer that and are comparable with high probability
Now we address the assumption that has full rank with probability one. Notice that this is usually not true when the have discrete distribution (such as Bernoulli). However, we find the following simple trick that makes the assumption valid for our study.
Instead of the entry , consider where is uniform on the interval $\varepsilonn^{-1000n}A_{n}^{\prime}a_{ij}^{\prime}$ has full rank with probability one. On the other hand, it is easy to show that by the Brunn–Minkowski inequality and Hadamard’s bound
Furthermore, by Lemma .3, with probability , and so we can conclude as in the previous argument.