Hanson-Wright inequality and sub-gaussian concentration

Mark Rudelson, Roman Vershynin

Hanson-Wright inequality

Hanson-Wright inequality is a general concentration result for quadratic forms in sub-gaussian random variables. A version of this theorem was first proved in , however with one weak point mentioned in Remark 1.2. In this article we give a modern proof of Hanson-Wright inequality, which automatically fixes the original weak point. We then deduce a useful concentration inequality for sub-gaussian random vectors, and illustrate it with two applications.

One of the aims of this note is to give a simple and self-contained proof of the Hanson–Wright inequality using only the standard toolkit of the large deviation theory. Several partial results and alternative proofs are scattered in the literature.

Improving upon an earlier result on Hanson-Wright , Wright established a slightly weaker version of Theorem 1.1. Instead of A=(aij)\|A\|=\|(a_{ij})\|, both papers had (aij)\|(|a_{ij}|)\| in the right side. The latter norm can be much larger than the norm of AA, and it is often less easy to compute. This weak point went unnoticed in several later applications of Hanson-Wright inequality, however it was clear to experts that it could be fixed.

A proof for the case where X1,,XnX_{1},\ldots,X_{n} are independent symmetric Bernoulli random variables appears in the lecture notes of Nelson . The moment inequality which essentially implies the result of can be also found in . A different approach to Hanson-Wright inequality, due to Rauhut and Tropp, can be found in [8, Proposition 8.13]. It is presented for diagonal-free matrices (however this assumption can be removed by treating the diagonal separately as is done below), and for independent symmetric Bernoulli random variables (but the proof can be extended to sub-gaussian random variables).

The paper contains an alternative short proof of Hanson–Wright inequality due to Latala for diagonal-free matrices. Like in the proof below, Latala’s argument uses decoupling of the order 2 chaos. However, unlike the current paper, which uses a simple decoupling argument of Bourgain , his proof uses a more general and more difficult decoupling theorem for U-statistics due to de la Peña and Montgomery-Smith . For an extensive discussion of modern decoupling methods see .

Large deviation inequalities for polynomials of higher degree, which extend the Hanson-Wright type inequalities, have been obtained by Latala and recently by Adamczak and Wolff .

By replacing XX with X/KX/K we can assume without loss of generality that K=1K=1. Let us first estimate

Let A=(aij)i,j=1nA=(a_{ij})_{i,j=1}^{n}. By independence and zero mean of XiX_{i}, we can represent

The problem reduces to estimating the diagonal and off-diagonal sums:

These standard bounds can be found in [18, Remark 5.18 and Lemma 5.14]. Then we can use a Bernstein-type inequality (see [18, Proposition 5.16]) and obtain

Step 2: decoupling. It remains to bound the off-diagonal sum

The argument will be based on estimating the moment generating function of SS by decoupling and reduction to normal random variables.

Let λ>0\lambda>0 be a parameter whose value we will determine later. By Chebyshev’s inequality, we have

where EX,δE_{X,\delta} denotes expectation with respect to both XX and δ\delta. Consider the set of indices Λδ={i[n]:δi=1}\Lambda_{\delta}=\{i\in[n]:\,\delta_{i}=1\} and express

Next, we use a standard estimate of the moment generating function of centered sub-gaussian random variables, see [18, Lemma 5.5]. It yields

Taking expectations of both sides with respect to (Xi)iΛδ(X_{i})_{i\in\Lambda_{\delta}}, we obtain

Recall that this estimate holds for every fixed δ\delta. It remains to estimate EδE_{\delta}.

Step 3: reduction to normal random variables. Consider g=(g1,,gn)g=(g_{1},\ldots,g_{n}) where gig_{i} are independent N(0,1)N(0,1) random variables. The rotation invariance of normal distribution implies that for each fixed δ\delta and XX, we have

Rearranging the terms, we can write Z=\sum_{i\in\Lambda_{\delta}}X_{i}\Big{(}\sum_{j\in\Lambda_{\delta}^{c}}a_{ij}g_{j}\Big{)}. Then we can bound the moment generating function of ZZ in the same way we bounded the moment generating function of SδS_{\delta} in Step 2, only now relying on the sub-gaussian properties of XiX_{i}, iΛδi\in\Lambda_{\delta}. We obtain

Recall that this bound holds for each fixed δ\delta. We have removed the original random variables XiX_{i} from the problem, so it now becomes a problem about normal random variables gig_{i}.

Step 4: calculation for normal random variables. By the rotation invariance of the distribution of gg, the random variable Aδg22\|A_{\delta}g\|_{2}^{2} is distributed identically with isi2gi2\sum_{i}s_{i}^{2}g_{i}^{2} where sis_{i} denote the singular values of AδA_{\delta}. Hence by independence,

Using the numeric inequality (1z)1/2ez(1-z)^{-1/2}\leq e^{z} which is valid for all 0z1/20\leq z\leq 1/2, we can simplify this as follows:

This is a uniform bound for all δ\delta. Now we take expectation with respect to δ\delta. Recalling (1.3) and (1.4), we obtain the following estimate on the moment generating function of SS:

Step 5: conclusion. Putting this estimate into the exponential Chebyshev’s inequality (1.2), we obtain

Optimizing over λ\lambda, we conclude that

Now we combine with a similar estimate (1.1) for p1p_{1} and obtain

Sub-gaussian concentration

Hanson-Wright inequality has a useful consequence, a concentration inequality for random vectors with independent sub-gaussian coordinates.

A few special cases of Theorem 2.1 can be easily deduced from classical concentration inequalities. For Gaussian random variables XiX_{i}, this result is a standard consequence of Gaussian concentration, see e.g. . For bounded random variables XiX_{i}, it can be deduced in a similar way from Talagrand’s concentration for convex Lipschitz functions , see [16, Theorem 2.1.13]. For more general random variables, one can find versions of Theorem 2.1 with varying degrees of generality scattered in the literature (e.g. the appendix of ). However, we were unable to find Theorem 2.1 in the existing literature.

Using a standard symmetrization argument, we can deduce from Theorem 2.1 some bounds on small ball probabilities. The following result is due to Latala et al. [12, Theorem 2.5].

The components of the random vector XXX-X^{\prime} have mean zero, variances bounded below by 22 and sub-gaussian norms bounded above by 2K2K. Thus we can apply Theorem 2.1 for 12(XX)\frac{1}{\sqrt{2}}(X-X^{\prime}) and conclude that

Suppose this event occurs. Then by triangle inequality, AXy2y2AX2y232h\|AX-y\|_{2}\geq\|y\|_{2}-\|AX\|_{2}\geq\|y\|_{2}-\frac{3}{2}h. Combining this with the second inequality in (2.3), we obtain that

Two applications

Concentration results like Theorem 2.1 have many useful consequences. We include two applications in this article; the reader will certainly find more.

The first application is a concentration of distance from a random vector to a fixed subspace. For random vectors with bounded components, one can find a similar result in [16, Corollary 2.1.19], where it was deduced from Talagrand’s concentration inequality.

Our second application of Theorem 2.1 is for operator norms of random matrices. The result essentially states that an m×nm\times n matrix BGBG obtained as a product of a deterministic matrix BB and a random matrix GG with independent sub-gaussian entries satisfies

with high probability. For random matrices with heavy-tailed rather than sub-gaussian components, this problem was studied in .

The last part of the proof is a standard covering argument. Let N\mathcal{N} be an 1/21/2-net of Sn1S^{n-1} in the Euclidean metric. We can choose this net so that N5n|\mathcal{N}|\leq 5^{n}, see [18, Lemma 5.2]. By a union bound, with probability at least

The same holds if B=PB=P is an r×Nr\times N matrix such that PPT=IrPP^{\mathsf{T}}=I_{r}.

We formulated the results in Sections 2 and 3 for real matrices and real valued random variables. Using a standard complexification trick, one can easily obtain complex versions of these results. Let us show how to complexify Theorem 2.1; the other applications follow from it.

Suppose now that both AA and XX are complex. Let us assume that the components XiX_{i} have independent real and imaginary parts, such that

References