Hanson-Wright inequality and sub-gaussian concentration

Mark Rudelson, Roman Vershynin

Hanson-Wright inequality

Hanson-Wright inequality is a general concentration result for quadratic forms in sub-gaussian random variables. A version of this theorem was first proved in , however with one weak point mentioned in Remark 1.2. In this article we give a modern proof of Hanson-Wright inequality, which automatically fixes the original weak point. We then deduce a useful concentration inequality for sub-gaussian random vectors, and illustrate it with two applications.

One of the aims of this note is to give a simple and self-contained proof of the Hanson–Wright inequality using only the standard toolkit of the large deviation theory. Several partial results and alternative proofs are scattered in the literature.

Improving upon an earlier result on Hanson-Wright , Wright established a slightly weaker version of Theorem 1.1. Instead of $\|A\|=\|(a_{ij})\|$ , both papers had $\|(|a_{ij}|)\|$ in the right side. The latter norm can be much larger than the norm of $A$ , and it is often less easy to compute. This weak point went unnoticed in several later applications of Hanson-Wright inequality, however it was clear to experts that it could be fixed.

A proof for the case where $X_{1},\ldots,X_{n}$ are independent symmetric Bernoulli random variables appears in the lecture notes of Nelson . The moment inequality which essentially implies the result of can be also found in . A different approach to Hanson-Wright inequality, due to Rauhut and Tropp, can be found in [8, Proposition 8.13]. It is presented for diagonal-free matrices (however this assumption can be removed by treating the diagonal separately as is done below), and for independent symmetric Bernoulli random variables (but the proof can be extended to sub-gaussian random variables).

The paper contains an alternative short proof of Hanson–Wright inequality due to Latala for diagonal-free matrices. Like in the proof below, Latala’s argument uses decoupling of the order 2 chaos. However, unlike the current paper, which uses a simple decoupling argument of Bourgain , his proof uses a more general and more difficult decoupling theorem for U-statistics due to de la Peña and Montgomery-Smith . For an extensive discussion of modern decoupling methods see .

Large deviation inequalities for polynomials of higher degree, which extend the Hanson-Wright type inequalities, have been obtained by Latala and recently by Adamczak and Wolff .

By replacing $X$ with $X/K$ we can assume without loss of generality that $K=1$ . Let us first estimate

Let $A=(a_{ij})_{i,j=1}^{n}$ . By independence and zero mean of $X_{i}$ , we can represent

The problem reduces to estimating the diagonal and off-diagonal sums:

These standard bounds can be found in [18, Remark 5.18 and Lemma 5.14]. Then we can use a Bernstein-type inequality (see [18, Proposition 5.16]) and obtain

Step 2: decoupling. It remains to bound the off-diagonal sum

The argument will be based on estimating the moment generating function of $S$ by decoupling and reduction to normal random variables.

Let $\lambda>0$ be a parameter whose value we will determine later. By Chebyshev’s inequality, we have

where $E_{X,\delta}$ denotes expectation with respect to both $X$ and $\delta$ . Consider the set of indices $\Lambda_{\delta}=\{i\in[n]:\,\delta_{i}=1\}$ and express

Next, we use a standard estimate of the moment generating function of centered sub-gaussian random variables, see [18, Lemma 5.5]. It yields

Taking expectations of both sides with respect to $(X_{i})_{i\in\Lambda_{\delta}}$ , we obtain

Recall that this estimate holds for every fixed $\delta$ . It remains to estimate $E_{\delta}$ .

Step 3: reduction to normal random variables. Consider $g=(g_{1},\ldots,g_{n})$ where $g_{i}$ are independent $N(0,1)$ random variables. The rotation invariance of normal distribution implies that for each fixed $\delta$ and $X$ , we have

Rearranging the terms, we can write $Z=\sum_{i\in\Lambda_{\delta}}X_{i}\Big{(}\sum_{j\in\Lambda_{\delta}^{c}}a_{ij}g_{j}\Big{)}$ . Then we can bound the moment generating function of $Z$ in the same way we bounded the moment generating function of $S_{\delta}$ in Step 2, only now relying on the sub-gaussian properties of $X_{i}$ , $i\in\Lambda_{\delta}$ . We obtain

Recall that this bound holds for each fixed $\delta$ . We have removed the original random variables $X_{i}$ from the problem, so it now becomes a problem about normal random variables $g_{i}$ .

Step 4: calculation for normal random variables. By the rotation invariance of the distribution of $g$ , the random variable $\|A_{\delta}g\|_{2}^{2}$ is distributed identically with $\sum_{i}s_{i}^{2}g_{i}^{2}$ where $s_{i}$ denote the singular values of $A_{\delta}$ . Hence by independence,

Using the numeric inequality $(1-z)^{-1/2}\leq e^{z}$ which is valid for all $0\leq z\leq 1/2$ , we can simplify this as follows:

This is a uniform bound for all $\delta$ . Now we take expectation with respect to $\delta$ . Recalling (1.3) and (1.4), we obtain the following estimate on the moment generating function of $S$ :

Step 5: conclusion. Putting this estimate into the exponential Chebyshev’s inequality (1.2), we obtain

Optimizing over $\lambda$ , we conclude that

Now we combine with a similar estimate (1.1) for $p_{1}$ and obtain

Sub-gaussian concentration

Hanson-Wright inequality has a useful consequence, a concentration inequality for random vectors with independent sub-gaussian coordinates.

A few special cases of Theorem 2.1 can be easily deduced from classical concentration inequalities. For Gaussian random variables $X_{i}$ , this result is a standard consequence of Gaussian concentration, see e.g. . For bounded random variables $X_{i}$ , it can be deduced in a similar way from Talagrand’s concentration for convex Lipschitz functions , see [16, Theorem 2.1.13]. For more general random variables, one can find versions of Theorem 2.1 with varying degrees of generality scattered in the literature (e.g. the appendix of ). However, we were unable to find Theorem 2.1 in the existing literature.

Using a standard symmetrization argument, we can deduce from Theorem 2.1 some bounds on small ball probabilities. The following result is due to Latala et al. [12, Theorem 2.5].

The components of the random vector $X-X^{\prime}$ have mean zero, variances bounded below by $2$ and sub-gaussian norms bounded above by $2K$ . Thus we can apply Theorem 2.1 for $\frac{1}{\sqrt{2}}(X-X^{\prime})$ and conclude that

Suppose this event occurs. Then by triangle inequality, $\|AX-y\|_{2}\geq\|y\|_{2}-\|AX\|_{2}\geq\|y\|_{2}-\frac{3}{2}h$ . Combining this with the second inequality in (2.3), we obtain that

Two applications

Concentration results like Theorem 2.1 have many useful consequences. We include two applications in this article; the reader will certainly find more.

The first application is a concentration of distance from a random vector to a fixed subspace. For random vectors with bounded components, one can find a similar result in [16, Corollary 2.1.19], where it was deduced from Talagrand’s concentration inequality.

Our second application of Theorem 2.1 is for operator norms of random matrices. The result essentially states that an $m\times n$ matrix $BG$ obtained as a product of a deterministic matrix $B$ and a random matrix $G$ with independent sub-gaussian entries satisfies

with high probability. For random matrices with heavy-tailed rather than sub-gaussian components, this problem was studied in .

The last part of the proof is a standard covering argument. Let $\mathcal{N}$ be an $1/2$ -net of $S^{n-1}$ in the Euclidean metric. We can choose this net so that $|\mathcal{N}|\leq 5^{n}$ , see [18, Lemma 5.2]. By a union bound, with probability at least

The same holds if $B=P$ is an $r\times N$ matrix such that $PP^{\mathsf{T}}=I_{r}$ .

We formulated the results in Sections 2 and 3 for real matrices and real valued random variables. Using a standard complexification trick, one can easily obtain complex versions of these results. Let us show how to complexify Theorem 2.1; the other applications follow from it.

Suppose now that both $A$ and $X$ are complex. Let us assume that the components $X_{i}$ have independent real and imaginary parts, such that