Random matrices: Law of the determinant

Hoi H. Nguyen, Van Vu

Introduction

Let $A_{n}$ be an $n$ by $n$ random matrix whose entries $a_{ij},1\leq i,j\leq n$ , are independent real random variables of zero mean and unit variance. We will refer to the entries $a_{ij}$ as the atom variables.

As determinant is one of the most fundamental matrix functions, it is a basic problem in the theory of random matrices to study the distribution of $\det A_{n}$ and indeed this study has a long and rich history. The earliest paper we find on the subject is a paper of Szekeres and Turán SzT from 1937, in which they studied an extremal problem. In the 1950s, there is a series of papers FT , NRR , Turan , Pre devoted to the computation of moments of fixed orders of $\det A_{n}$ (see also Gbook ). The explicit formula for higher moments gets very complicated and is in general not available, except in the case when the atom variables have some special distribution (see, e.g., Dembo ).

One can use the estimate for the moments and Markov inequality to obtain an upper bound on $|\det A_{n}|$ . However, no lower bound was known for a long time. In particular, Erdős asked whether $\det A_{n}$ is nonzero with probability tending to one. In 1967, Komlós Kom , Kom1 addressed this question, proving that almost surely $|\det A_{n}|>0$ for random Bernoulli matrices (where the atom variables are i.i.d. Bernoulli, taking values $\pm 1$ with probability $1/2$ ). His method also works for much more general models. Following Kom , the upper bound on the probability that $\det A_{n}=0$ has been improved in KKS , TVdet , TVsing , BVW . However, these results do not say much about the value of $|\det A_{n}|$ itself.

In a recent paper TVdet , Tao and the second author proved that for Bernoulli random matrices, with probability tending to one (as $n$ tends to infinity),

for any function $\omega(n)$ tending to infinity with $n$ . This shows that almost surely, $\log|\det A_{n}|$ is $(\frac{1}{2}+o(1))n\log n$ , but does not provide any distributional information. For related works concerning other models of random matrices, we refer to Ro .

In Goodman , Goodman considered random Gaussian matrices where the atom variables are i.i.d. standard Gaussian variables. He noticed that in this case the determinant is a product of independent Chi-square variables. Therefore, its logarithm is the sum of independent variables and, thus, one expects a central limit theorem to hold. In fact, using properties of Chi-square distribution, it is not very hard to prove

We refer the reader to RW , Section 4, for further discussion on this model.

In G2 , Girko stated that (2) holds for general random matrices under the additional assumption that the fourth moment of the atom variables is 3. Twenty years later, he claimed a much stronger result which replaced the above assumption by the assumption that the atom variables have bounded $(4+\delta)$ th moment G . However, there are points which are not clear in these papers and we have not found any researcher who can explain the whole proof to us. In our own attempt, we could not pass the proof of Theorem 2 in G . In particular, definition (3.7) of this paper requires the matrix $\Xi\bigl{(}{1\atop k}\bigr{)}$ to be invertible, but this assumption can easily fail.

In this paper, we provide a transparent proof for the central limit theorem of the log-determinant. The next question to consider, naturally, is the rate of convergence. We are able to obtain a rate which we believe to be near optimal.

We say that a random variable $\xi$ satisfies condition C0 (with positive constants $C_{1},C_{2}$ ) if

Assume that all atom variables $a_{ij}$ satisfy condition C0 with some positive constants $C_{1},C_{2}$ . Then

Here and later, $\Phi(x)={\mathbf{P}}({\mathbf{N}}(0,1)<x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}\exp(-t^{2}/2)\,dt$ . In the remaining part of the paper, we will actually prove the following equivalent form:

The reader is invited to consult Figure 1 for our simulation. To give some feeling about (5), let us consider the case when $a_{ij}$ are i.i.d. standard Gaussian. For $0\leq i\leq n-1$ , let $V_{i}$ be the subspace generated by the first $i$ rows of $A_{n}$ . Let $\Delta_{i+1}$ denote the distance from $\mathbf{a}_{i+1}$ to $V_{i}$ , where $\mathbf{a}_{i+1}=(a_{i+1,1},\ldots,a_{i+1,n})$ is the $(i+1)$ th row vector of $A_{n}$ . Then, by the “base times height” formula, we have

As the $a_{ij}$ are i.i.d. standard Gaussian, $\Delta_{i+1}^{2}$ are independent Chi-square random variables of degree $n-i$ . Thus, the right-hand side of (7) is a sum of independent random variables. Notice that $\Delta^{2}_{i+1}$ has mean $n-i$ and variance $O(n-i)$ and is very strongly concentrated. Thus, with high probability $\log\Delta_{i+1}^{2}$ is roughly $\log((n-i)+O(\sqrt{n-i}))$ and so it is easy to show that $\log\Delta_{i+1}^{2}$ has mean close to $\log(n-i)$ and variance $O(\frac{1}{n-i})$ . So the variance of $\sum_{i=0}^{n-1}\log\Delta_{i+1}^{2}$ is $O(\log n)$ . To get the precise value $\sqrt{2\log n}$ , one needs to carry out some careful (but rather routine) calculation, which we leave as an exercise.

The reason for which we think that the rate $\log^{-1/3+o(1)}n$ might be near optimal is that (as the reader will see though the proofs) $2\log n$ is only an asymptotic value of the variance of $\log|\det A_{n}|$ . This approximation has an error term of order at least $\Omega(1)$ and since $\sqrt{2\log n+\Omega(1)}$ $-\sqrt{2\log n}=\Omega(\log^{-1/2}n)$ , it seems that one cannot have rate of convergence better than $\log^{-1/2+o(1)}n$ . It is a quite interesting question whether one can obtain a polynomial rate by replacing $\log(n-1)!$ and $2\log n$ by other, relatively simple, functions of $n$ .

Our arguments rely on recent developments in random matrix theory and look quite different from those in Girko’s papers. In particular, we benefit from the arguments developed in TVdet , TVhard , TVlocal . We also use Talagrand’s famous concentration inequality frequently to obtain most of the large deviation results needed in this paper.

Our approach and main lemmas

We first make two extra assumptions about $A_{n}$ . We assume that the entries $a_{ij}$ are bounded in absolute value by $\log^{\beta}n$ for some constant $\beta>0$ and $A_{n}$ has full rank with probability one. We will prove Theorem 1.1 under these two extra assumptions. In Appendix, we will explain why we can implement these assumptions without violating the generality of Theorem 1.1.

Assume that all atom variables $a_{ij}$ satisfy condition C0 and are bounded in absolute value by $\log^{\beta}n$ for some constant $\beta$ . Assume furthermore that $A_{n}$ has full rank with probability one. Then

In the first, and main, step of the proof, we prove the claim of Theorem 2.1 but with the last $\log^{\alpha}n$ rows being replaced by Gaussian rows (for some properly chosen constant $\alpha$ ). We remark that the replacement trick was also used in G , but for an entirely different reason. Our reason here is that for the last few rows, Lemma 2.4 is not very effective.

For any constant $\beta>1$ the following holds for any sufficiently large constant $\alpha>0$ . Let $A_{n}$ be an $n$ by $n$ matrix whose entries $a_{ij},1\leq i\leq n_{0},1\leq j\leq n$ , are independent real random variables of zero mean, unit variance and absolute values at most $\log^{\beta}n$ . Assume furthermore that $A_{n}$ has full rank with probability one and the components of the last $\log^{\alpha}n$ rows of $A$ are independent standard Gaussian random variables. Then

In the second (and simpler) step of the proof, we carry out a replacement procedure, replacing the Gaussian rows by the original rows one at a time,j and show that the replacement does not effect the central limit theorem. This step is motivated by the Lindeberg replacement method used in TVlocal .

We present the verification of Theorem 2.1 using Theorem 2.2 in Section 8. In the rest of this section, we focus on the proof of Theorem 2.2.

Notice that in the setting of this theorem, the variables $\Delta_{i}$ are no longer independent. However, with some work, we can make the RHS of (7) into a sum of martingale differences plus a negligible error, which lays ground for an application of a central limit theorem of martingales. (In G , Girko also used the CLT for martingales via the base times height formula, but his analysis looks very different from ours.) We are going to use the following theorem, due to Machkouri and Ouchti MO .

There exists an absolute constant $L$ such that the following holds. Assume that $X_{1},\ldots,X_{m}$ are martingale differences with respect to the nested $\sigma$ -algebras $\mathcal{E}_{0},\mathcal{E}_{1},\ldots,\mathcal{E}_{m-1}$ . Let $v_{m}^{2}:=\sum_{i=0}^{m-1}{\mathbf{E}}(X_{i+1}^{2}|\mathcal{E}_{i})$ , and $s_{m}^{2}:=\sum_{i=1}^{m}{\mathbf{E}}(X_{i}^{2})$ . Assume that ${\mathbf{E}}(|X_{i+1}^{3}||\mathcal{E}_{i})\leq\gamma_{i}{\mathbf{E}}(X_{i+1}^{2}|\mathcal{E}_{i})$ with probability one for all $i$ , where $(\gamma_{i})_{1}^{m}$ is a sequence of positive real numbers. Then we have

To make use of this theorem, we need some preparation. Conditioning on the first $i$ rows $\mathbf{a}_{1},\ldots,\mathbf{a}_{i}$ , we can view $\Delta_{i+1}$ as the distance from a random vector to $V_{i}:=\operatorname{Span}(\mathbf{a}_{1},\ldots,\mathbf{a}_{i})$ . Since $A_{n}$ has full rank with probability one, $\dim V_{i}=i$ with probability one for all $i$ . The following is a direct corollary of TVlocal , Lemma 43.

For any constant $\beta>0$ there is a constant $C_{3}>0$ depending on $\beta$ such that the following holds. Assume that $V\subset{\mathbf{R}}^{n}$ is a subspace of dimension $\dim(V)\leq n-4$ . Let $\mathbf{a}$ be a random vector whose components are independent variables of zero mean and unit variance and absolute values at most $\log^{\beta}n$ . Denote by $\Delta$ the distance from $\mathbf{a}$ to $V$ . Then we have

where $\alpha$ is a sufficiently large constant (which may depend on $\beta$ ). We will use shorthand $k_{i}$ to denote $n-i$ , the co-dimension of $V_{i}$ (and the expected value of $\Delta_{i}^{2}$ ),

We next consider each term of the right-hand side of (7) where $0\leq i<n_{0}$ . Using the Taylor expansion, we write

By applying Lemma 2.4 with $t=k_{i}^{1/8}\geq\log^{\alpha/8}n$ and by choosing $\alpha$ sufficiently large, we have with probability at least $1-O(\exp(-\log^{2}n))$ [the probability here is with respect to the random $(i+1)$ th row, fixing the first $i$ rows arbitrarily]

Thus, with probability at least $1-O(\exp(-\log^{2}n))$

Hence, by a uniform bound, the following holds with probability at least $1-n\cdot O(\exp(-\log^{2}n))=1-O(\exp(-\log^{2}n/2))$ :

again by having $\alpha$ sufficiently large.

With probability at least $1-O(\exp(-\log^{2}n/2))$

Theorem 2.2 follows from the above four lemmas and the following trivial fact (used repeatedly and with proper scaling):

The reader is invited to fill in the simple details using the following observation:

We will prove Lemma 2.6 using Theorem 2.3. Lemma 2.7 will be verified by the moment method and Lemma 2.8 by elementary properties of Chi-square variables. The key to the proof of Lemmas 2.6 and 2.7 is an estimate on the entries of the projection matrix onto the space $V_{i}^{\bot}$ , presented in Section 4.

Proof of Lemmas 2.6 and 2.7: Opening

We recall from the previous section that $X_{i+1}=\frac{\Delta_{i+1}^{2}-k_{i}}{k_{i}}$ . Denote by $P_{i}=(p_{st}(i))_{s,t}$ the projection matrix onto the orthogonal complement $V_{i}^{\bot}$ . A standard fact in linear algebra is

where $a_{1}=a_{i+1,1},\ldots,a_{n}=a_{i+1,n}$ are the coordinates of the vector $\mathbf{a}_{i+1}$ and

By (11) we have $\sum_{s}q_{ss}(i)=1$ and $\sum_{s,t}q_{st}(i)^{2}=\frac{1}{k_{i}}.$

Because ${\mathbf{E}}a_{s}=0$ and ${\mathbf{E}}a_{s}^{2}=1$ , and the $a_{s}$ are mutually independent, we can show by using a routine calculation that [see (6) from Section 6]

where $\mathcal{E}_{i}$ is the $\sigma$ -algebra generated by the first $i$ rows of $A_{n}$ .

The reason we split $-\frac{X_{i+1}^{2}}{2}+\frac{1}{k_{i}}$ into the sum of $Y_{i+1}$ and $Z_{i+1}$ is that ${\mathbf{E}}(Y_{i+1}|\mathcal{E}_{i})=0$ and its variance can be easily computed.

To complete the proof of Lemma 2.7 from Lemma 3.1, it suffices to show that the sum of the $Z_{i}$ is negligible,

Our main technical tool will be the following lemma.

Noticing that ${\mathbf{E}}a_{s}^{4}$ is uniformly bounded (by condition C0), it follows that with probability $1-O(n^{-100})$ ,

Proof of Lemmas 2.6 and 2.7: Mid game

The key idea for proving Lemma 3.2 is to establish a good upper bound for $|q_{ss}(i)|$ . For this, we need some new tools. Our main ingredient is the following delocalization result, which is a variant of a result from TVhard (see also E and TVsurvey for recent surveys), asserting that with high probability all unit vectors in the orthogonal complement of a random subspace with high dimension have small infinity norm.

For any constant $\beta>0$ the following holds for all sufficiently large constant $\alpha>0$ . Assume that the components of $\mathbf{a}_{1},\ldots,\mathbf{a}_{n_{1}}$ , where $n_{1}:=n-n\log^{-4\alpha}n$ , are independent random variables of mean zero, variance one and bounded in absolute value by $\log^{\beta}n$ . Then with probability $1-O(n^{-100})$ , the following holds for all unit vectors $\mathbf{v}$ of the space $V_{n_{1}}^{\bot}$ :

Proof of Lemma 3.2 assuming Lemma 4.1 Write

Note that as $q_{st}(i)=p_{st}(i)/k_{i}$ ,

for some unit vector $\mathbf{v}\in V_{i}^{\bot}$ .

Thus, if $i>n_{1}$ , then $V_{i}^{\bot}\subset V_{n_{1}}^{\bot}$ and, hence, by Lemma 4.1

We now focus on the infinity norm of $\mathbf{v}$ and follow an argument from TVhard .

Proof of Lemma 4.1 By the union bound, it suffices to show that $|v_{1}|=O(\log^{-2\alpha}n)$ with probability at least $1-O(n^{-101})$ , where $v_{1}$ is the first coordinate of $\mathbf{v}$ .

Let $B$ be the matrix formed by the first $n_{1}$ rows $\mathbf{a}_{1},\ldots,\mathbf{a}_{n_{1}}$ of $A$ . Assume that $\mathbf{v}\in V_{n_{1}}^{\bot}$ is a unit vector, then

Let $\mathbf{w}$ be the first column of $B$ , and $B^{\prime}$ be the matrix obtained by deleting $\mathbf{w}$ from $B$ . Clearly,

where $\mathbf{v}^{\prime}$ is the vector obtained from $\mathbf{v}$ by deleting $v_{1}$ .

We next invoke the following result, which is a variant of TVhard , Lemma 4.1. This lemma was proved using a method of Guionet and Zeitouni GZ , based on Talagrand’s inequality.

For any constant $\beta>0$ the following holds for all sufficiently large constant $\alpha>0$ . Let $A_{n}$ be a random matrix of size $n$ by $n$ , where the entries $a_{ij}$ are independent random variables of mean zero, variance one and bounded in absolute value by $\log^{\beta}n$ . Then for any $n/\log^{\alpha}n\leq k\leq n/2$ , there exist $2k$ singular values of $A_{n}$ in the interval $[0,ck/\sqrt{n}]$ , for some absolute constant $c$ , with probability at least $1-O(n^{-101})$ .

We can prove Lemma 4.2 by following the arguments in TVhard , Lemma 4.1, almost word by word.

By the interlacing law and Lemma 4.2, we conclude that $B^{\prime}$ has $n-n_{1}$ singular values in the interval $[0,c(n-n_{1})/\sqrt{n}]$ with probability $1-O(n^{-101})$ .

Let $H$ be the space spanned by the left singular vectors of these singular values, and let $\pi$ be the orthogonal projection onto $H$ . By definition, the spectral norm of $\pi B^{\prime}$ is bounded,

here we used the fact that $\mathbf{w}$ is independent from $B^{\prime}$ , and thus from $\pi$ .

On the other hand, since the dimension of $H$ is $n-n_{1}$ , Lemma 2.4 implies that $\|\pi\mathbf{w}\|\geq\sqrt{n-n_{1}}/2$ with probability $1-4\exp(-(n-n_{1})/16)=1-O(n^{-\omega(1)})$ .

Proof of Lemma 2.6: End game

Recall from (10) that conditioned on any first $i$ rows, $|X_{i}|=O(k_{i}^{-3/8})$ with probability $1-O(\exp(-\log^{2}n/2))$ . So, by paying an extra term of $O(\exp(-\log^{2}n/2))$ in probability, it suffices to justify Lemma 2.6 for the sequence $X_{i}^{\prime}:=X_{i}\cdot\mathbf{I}_{|X_{i}|=O(k_{i}^{-3/8})}$ .

On the other hand, the sequence $X_{i+1}^{\prime}$ is not a martingale difference sequence, so we slightly modify $X_{i+1}^{\prime}$ to $X_{i+1}^{\prime\prime}:=X_{i+1}^{\prime}-{\mathbf{E}}(X_{i+1}^{\prime}|\mathcal{E}_{i})$ and prove the claim for the sequence $X_{i+1}^{\prime}$ , here we recall that $\mathcal{E}_{i}$ is the $\sigma$ -algebra generated by the first $i$ rows of $A_{n}$ . In order to show that this modification has no effect whatsoever, we first demonstrate that ${\mathbf{E}}(X_{i+1}^{\prime}|\mathcal{E}_{i})$ is extremely small.

Recall from (12) that $X_{i+1}=\sum_{s,t}q_{st}(i)a_{s}a_{t}-1$ . By the Cauchy–Schwarz inequality and the assumption that $a_{s}$ are bounded in absolute value by $\log^{O(1)}n$ , we have with probability one

To justify Lemma 2.6 for the sequence $X_{i+1}^{\prime\prime}$ , we apply Theorem 2.3.

The key point here is that thanks to the indicator function in the definition of $X_{i+1}^{\prime}$ and the fact that the difference between $X_{i+1}^{\prime\prime}$ and $X_{i+1}^{\prime}$ is negligible, $X_{i+1}^{\prime\prime}$ is bounded by $O(k_{i}^{-3/8})$ with probability one, so the conditions ${\mathbf{E}}(|X_{i+1}^{\prime\prime}|^{3}|\mathcal{E}_{i})\leq\gamma_{i}{\mathbf{E}}({X_{i+1}^{\prime\prime}}^{2}|\mathcal{E}_{i})$ in Theorem 2.3 are satisfied with

We need to estimate $s_{n_{0}},v_{n_{0}}$ with respect to the sequence $X_{i+1}^{\prime\prime}$ . However, thanks to the observations above, $X_{i+1}$ and $X_{i+1}^{\prime\prime}$ are very close, and so it suffices to compute these values with respect to the sequence $X_{i+1}$ .

Also, recall from Section 4 that with probability $1-O(n^{-100})$ ,

This bound, together with (13) and (5), imply that with probability one

which in turn implies that $v_{n_{0}}^{2}=2\log n+O(\log\log n)$ with probability $1-O(n^{-100})$ .

Using (5) again, because $n^{-100}n^{2}\log^{O(1)}n=o(1)$ , we deduce that

With another application of (5), we obtain

By the conclusion of Theorem 2.3 and setting $\alpha$ sufficiently large, we conclude

Proof of Lemma 2.7: End game

Our goal is to justify Lemma 3.1, which together with (14) verify Lemma 2.7.

We will show that the variance $\operatorname{Var}(\sum_{i<n_{0}}Y_{i+1})$ is small and then use Chebyshev’s inequality. The proof is based on a series of routine, but somewhat tedious calculations. We first show that the expectations of the $Y_{i+1}$ ’s are zero, and so are the covariances ${\mathbf{E}}(Y_{i+1}Y_{j+1})$ by an elementary manipulation. The variances $\operatorname{Var}(Y_{i+1})$ will be bounded from above by the Cauchy–Schwarz inequality.

We start with the formula $X_{i+1}^{2}=(\sum_{s,t}q_{st}(i)a_{s}a_{t})^{2}-2\sum_{s,t}q_{st}(i)a_{s}a_{t}+1$ . Observe that

Expanding each term, using the fact that $\sum_{s}q_{ss}(i)=1$ and $\sum_{s,t}q_{st}(i)^{2}=\frac{1}{k_{i}}$ , we have

As ${\mathbf{E}}a_{s}=0,{\mathbf{E}}a_{s}^{2}=1$ , and the $a_{s}$ ’s are mutually independent with each other and with every row of index at most $i$ [and in particular with $q_{st}(i)$ ’s], every term in the last formula is zero, and so we infer that ${\mathbf{E}}(Y_{i+1})=0$ and ${\mathbf{E}}(Y_{i+1}|\mathcal{E}_{i})=0$ , confirming (13). With the same reasoning, we can also infer that the covariance ${\mathbf{E}}(Y_{i+1}Y_{j+1})=0$ for all $j<i$ .

It is thus enough to work with the diagonal terms $\operatorname{Var}(Y_{i+1})$ . We have

After a series of cancellations, and because of condition C0, we have

where the first two rows consist of the squares of the terms appearing in $Y_{i+1}$ (after deleting several sums of zero expected value), and each of the following rows was obtained by expanding the product of each term with the rest in the order of their appearance.

Because $\sum_{s,t}q_{st}(i)^{2}=\frac{1}{k_{i}}$ , one has $\max_{s,t}|q_{st}(i)|\leq\frac{1}{\sqrt{k_{i}}}$ for all $s,t$ . Recall furthermore that $\sum_{s}q_{ss}(i)=1$ and $0\leq q_{ss}(i)$ for all $s$ . We next estimate the terms under consideration one by one as follows.

First, the sums $\sum_{s}q_{ss}^{3}(i),\sum_{s}q_{ss}(i)^{4}$ , $\sum_{s,t}q_{ss}(i)q_{st}(i)^{2},\sum_{s,t}q_{ss}(i)^{2}q_{st}(i)^{2}$ , $\sum_{s,t}q_{ss}(i)q_{tt}(i)q_{st}(i)^{2}$ , and $\sum_{s,t}|q_{st}(i)^{3}q_{ss}(i)|$ can be bounded by $\max_{s,t}|q_{st}(i)|\sum_{s,t}q_{st}^{2}(i)$ , and so by $k_{i}^{-3/2}$ .

Second, by applying the Cauchy–Schwarz inequality if needed, one can bound the sums $\sum_{s,t_{1},t_{2}}q_{st_{1}}(i)^{2}q_{st_{2}}(i)^{2}$ , $\sum_{s_{1},t_{1},s_{2},t_{2}}|q_{s_{1}t_{1}}(i)q_{s_{1}t_{2}}(i)q_{s_{2}t_{1}}(i)q_{s_{2}t_{2}}(i)|$ , and $\sum_{s,t_{1},t_{2}}|q_{ss}(i)q_{st_{1}}(i)q_{st_{2}}(i)q_{t_{1}t_{2}}(i)|$ by $2(\sum_{s,t}q_{st}^{2}(i))^{2}$ , and so by $2k_{i}^{-2}$ .

$\sum_{s,t_{1},t_{2}}q_{ss}(i)^{2}q_{t_{1}t_{1}}(i)q_{t_{2}t_{2}}(i)=(\sum_{s}q_{ss}(i)^{2})(\sum_{t}q_{tt}(i))^{2}=\sum_{s}q_{ss}(i)^{2}.$

$\sum_{s,t}q_{ss}(i)^{2}q_{tt}(i)+\sum_{s,t}q_{ss}(i)^{3}q_{tt}(i)\leq 2(\sum_{s}q_{ss}(i)^{2})(\sum_{t}q_{tt}(i))=2\sum_{s}q_{ss}(i)^{2}.$

$\sum_{s,t}|q_{ss}(i)q_{tt}(i)q_{st}(i)|\leq\sum_{s,t}q_{ss}(i)(q_{tt}(i)^{2}+q_{st}(i)^{2})\leq\sum_{t}q_{tt}(i)^{2}+\break\max_{s}q_{ss}(i)\sum_{s,t}q_{st}(i)^{2}$ $\leq\sum_{t}q_{tt}(i)^{2}+k_{i}^{-3/2}.$

$\sum_{s,t}|q_{ss}(i)^{2}q_{tt}(i)q_{st}(i)|\leq\sup_{s,t}|q_{st}(i)|\sum_{s,t}q_{ss}(i)^{2}q_{tt}(i)\leq\sum_{s}q_{ss}(i)^{2}/\sqrt{k_{i}}$ .

where we applied Lemma 3.2 in the last estimate.

To complete the proof, we note from the estimate of $s_{n_{0}}^{2}$ of Section 5 and from Lemma 3.2 that $|\sum_{i<n_{0}}{\mathbf{E}}Y_{i+1}|=O(\log\log n)$ . Thus, by Chebyshev’s inequality

Proof of Lemma 2.8

We recall that, with $i\geq n_{0}$ , $\Delta_{i+1}^{2}$ is a Chi-square random variable of degree $n-i$ . Let us first consider the lower tail; it suffices to show

By properties of the normal distribution, it is easy to show that $\Delta_{n}^{2}$ and $\Delta_{n-1}^{2}$ are at least $\exp(-\frac{\sqrt{2}}{4}\log^{c}n)$ with probability $1-\exp(-\Omega(\log^{c}n))$ , so we can omit these terms from the sum. It now suffices to show that

Flipping the inequality inside the probability (by changing the sign of the RHS and swapping the denominators and numerators in the logarithms of the LHS) and using the Laplace transform trick (based on the fact that the $\Delta^{2}_{i}$ are independent), we see that the probability in question is at most

Recall that $\Delta_{i+1}^{2}$ is a Chi-square random variable with degree of freedom $n-i$ , so ${\mathbf{E}}\frac{1}{\Delta^{2}_{i+1}}=\frac{1}{n-i-2}$ . Therefore, the numerator in the previous formula is $\frac{(n-n_{0})(n-n_{0}-1)}{2}\leq\log^{2\alpha}n$ .

The proof for the upper tail is similar (in fact simpler as we do not need to treat the first two terms separately) and we omit the details.

Deduction of Theorem 2.1 from Theorem 2.2

Our plan is to replace one by one the last $n-n_{0}$ Gaussian rows of $A_{n}$ by vectors of components having zero mean, unit variance and satisfying condition C0. Our key tool here is the classical Berry–Eseen inequality. In order to apply this lemma, we will make a crucial use of Lemma 4.1.

Assume that $\mathbf{v}=(v_{1},\ldots,v_{n})$ is a unit vector. Assume that $b_{1},\ldots,b_{n}$ are independent random variables of mean zero, variance one and satisfying condition C0. Then we have

where $c$ is an absolute constant depending on the parameters appearing in (3).

We remark that in the original setting of Berry and Esseen, it suffices to assume the finite third moment.

In application, $\mathbf{v}$ plays the role of the normal vector of the hyperplane spanned by the remaining $n-1$ rows of $A$ , and $\Delta_{n}=|v_{1}b_{1}+\cdots+v_{n}b_{n}|$ , where $(b_{1},\ldots,b_{n})=\mathbf{b}$ is the vector to be replaced.

For the deduction, it is enough to show the following.

Let $A_{n}$ be a random matrix with atom variables satisfying condition C0 and nonsingular with probability one. Assume furthermore that $A_{n}$ has at least one and at most $\log^{\alpha}n$ Gaussian rows. Let $B_{n}$ be the random matrix obtained from $A_{n}$ by replacing a Gaussian row vector $\mathbf{a}$ of $A_{n}$ by a random vector $\mathbf{b}=(b_{1},\ldots,b_{n})$ whose coordinates are independent atom variables satisfying condition C0 such that the resulting matrix is nonsingular with probability one. Then

Clearly, Theorem 1.1 follows from Theorem 2.2 by applying Lemma 8.2 $\log^{\alpha}n$ times.

Proof of Lemma 8.2 Without loss of generality, we can assume that $B_{n}$ is obtained from $A_{n}$ by replacing the last row $\mathbf{a}_{n}$ . As $A_{n}$ is nonsingular, $\dim(V_{n-1})=n-1$ .

By Lemma 4.1, by paying an extra term of $O(n^{-100})$ in probability (which will be absorbed by the eventual bound $\log^{-2\alpha}n$ ), we may also assume that the normal vector $\mathbf{v}$ of $V_{n-1}$ satisfies

where $\Delta_{n}$ and $\Delta^{\prime}_{n}$ are the distance from $\mathbf{a}_{n}$ and $\mathbf{b}_{n}$ to $V_{n-1}$ , respectively.

Appendix: Simplifying the model: Deducing Theorem 1.1 from Theorem 2.1

In this section we show that the two extra assumptions that $|a_{ij}|\leq\log^{\beta}n$ and $A_{n}$ has full rank with probability one do not violate the generality of Theorem 1.1.

To start with, we need a very weak lower bound on $|\det A_{n}|$ .

It follows from TVcir , Theorem 2.1, that there is a constant $C$ such that ${\mathbf{P}}(\sigma_{n}(A_{n})\leq n^{-C})\leq n^{-1}$ . Since $|\det A_{n}|$ is the product of its singular values, the bound follows.

The above bound is extremely weak. By modifying the proof in TVdet , one can actually prove the Tao–Vu lower bound (1) for random matrices satisfying C0. Also, sharper bounds on the least singular value are obtained in TVsmooth , RV . However, for the arguments in this section, we only need the bound on Lemma .3.

Let us start with the assumption $|a_{ij}|\leq\log^{\beta}n$ . We can achieve this assumption using the standard truncation method (see BS or TVlocal ). In what follows, we sketch the idea.

Notice that by condition C0, we have, with probability at least $1-\exp\times(-\log^{10}n)$ , that all entries of $A_{n}$ have absolute value at most $\log^{\beta}n$ , for some constant $\beta>0$ which may depend on the constants in C0.

We replace the variable $a_{ij}$ by the variable $a_{ij}^{\prime}:=a_{ij}{\mathbf{I}}_{|a_{ij}|\leq\log^{\beta}n}$ , for all $1\leq i,j\leq n$ and let $A_{n}^{\prime}$ be the random matrix formed by $a_{ij}^{\prime}$ . Since with probability at least $1-\exp(-\log^{10}n)$ , $A_{n}=A_{n}^{\prime}$ , it is easy to show that if $A_{n}^{\prime}$ satisfies the claim of Theorem 1.1, then so does $A_{n}$ .

While the entries of $A_{n}^{\prime}$ are bounded by $\log^{\beta}n$ , there is still one problem we need to address, namely, that the new variables $a_{ij}^{\prime}$ do not have mean 0 and variance one. We can achieve this by a simple normalization trick. First observe that by property C0, taking $\beta$ sufficiently large, it is easy to show that $\mu_{ij}={\mathbf{E}}a_{ij}^{\prime}$ has absolute value at most $n^{-\omega(1)}$ and $|1-\sigma_{ij}|\leq n^{-\omega(1)}$ , where $\sigma_{ij}$ is the standard deviation of $a^{\prime}_{ij}$ . Now define

Note that $a_{ij}^{\prime\prime\prime}$ now does have mean zero and variance one. Let $A_{n}^{\prime\prime}$ and $A_{n}^{\prime\prime\prime}$ be the corresponding matrices of $a_{ij}^{\prime\prime}$ and $a_{ij}^{\prime\prime\prime}$ , respectively.

where $N_{n}$ is the matrix formed by $\mu_{ij}$ .

Since $|\mu_{ij}|=n^{-\omega(1)}$ , by Hadamard’s bound $|\det N_{n}|^{1/n}\leq n^{-\omega(1)}$ . On the other hand, we have by Lemma .3 that ${\mathbf{P}}(|\det A_{n}^{\prime\prime}|^{1/n}\geq n^{-C})\geq 1-n^{-1}$ . It thus follows that

We can prove a matching lower bound by the same argument. From here, we conclude that if $|\det A_{n}^{\prime\prime}|$ satisfies the conclusion of Theorem 1.1, then so does $|\det A_{n}^{\prime}|$ .

To pass from $\det(A_{n}^{\prime\prime})$ to $\det(A_{n}^{\prime\prime\prime})$ , we apply the Brunn–Minkowski inequality again,

where $N_{n}^{\prime}$ is the matrix form by $a_{ij}^{\prime\prime}(1-\sigma_{ij}^{-1})$ . Noting that $|1-\sigma_{ij}^{-1}|\leq n^{-\omega(1)}$ and $|a_{ij}^{\prime\prime}|=\log^{O(1)}n$ , we infer that $|\det(A_{n}^{\prime\prime})|$ and $|\det(A_{n}^{\prime\prime\prime})|$ are comparable with high probability

Now we address the assumption that $A_{n}$ has full rank with probability one. Notice that this is usually not true when the $a_{ij}$ have discrete distribution (such as Bernoulli). However, we find the following simple trick that makes the assumption valid for our study.

Instead of the entry $a_{ij}$ , consider $a_{ij}^{\prime}:=(1-\varepsilon^{2})^{1/2}a_{ij}+\varepsilon\xi_{0}$ where $\xi_{0}$ is uniform on the interval $ $and$ \varepsilon $is very small, say,$ n^{-1000n} $. It is clear that the matrix$ A_{n}^{\prime} $formed by the$ a_{ij}^{\prime}$ has full rank with probability one. On the other hand, it is easy to show that by the Brunn–Minkowski inequality and Hadamard’s bound

Furthermore, by Lemma .3, $|\det A_{n}|\geq n^{-Cn}$ with probability $1-n^{-1}$ , and so we can conclude as in the previous argument.