A tail inequality for quadratic forms of subgaussian random vectors

Daniel Hsu, Sham M. Kakade, Tong Zhang

Introduction

Consider the special case where x1,,xnx_{1},\ldots,x_{n} are independent standard Gaussian random variables. The following proposition provides an (upper) tail bound for Ax2\|Ax\|^{2}.

The proof, given in Appendix A.2, is straightforward given the rotational invariance of the multivariate Gaussian distribution, together with a tail bound for linear combinations of χ2\chi^{2} random variables due to Laurent and Massart (2000). We note that a slightly weaker form of Proposition 1 can be proved directly using Gaussian concentration (Pisier, 1989).

We provide a sharp upper tail bound for this case analogous to one that holds in the Gaussian case (indeed, the same as Proposition 1 when σ=1\sigma=1).

One motivation for our main result comes from the following observations about sums of random vectors. Let a1,,ana_{1},\dotsc,a_{n} be vectors in a Euclidean space, and let A=[a1an]A=[a_{1}|\dotsb|a_{n}] be the matrix with aia_{i} as its iith column. Consider the squared norm of the random sum

falls off exponentially fast. This can be shown, for instance, using the following lemma by taking ui=aixiu_{i}=a_{i}x_{i} (the proof is standard, but we give it for completeness in Appendix A.1).

for all i=1,,ni=1,\dotsc,n, almost surely. For all t>0t>0,

After squaring the quantities in the stated probabilistic event, Proposition 2 gives the bound

with probability at least 1et1-e^{-t} when the xix_{i} are almost surely bounded by 11 (or any constant).

Unfortunately, this bound obtained from Proposition 2 can be suboptimal when the xix_{i} are subgaussian. For instance, if the xix_{i} are Rademacher random variables, so Pr[xi=+1]=Pr[xi=1]=1/2\Pr[x_{i}=+1]=\Pr[x_{i}=-1]=1/2, then it is known that

with probability at least 1et1-e^{-t}. A similar result holds for any subgaussian distribution on the xix_{i} (Hanson and Wright, 1971). This is an improvement over the previous bound because the deviation terms (i.e., those involving tt) can be significantly smaller, especially for large tt.

In this work, we give a simple proof of (2) with explicit constants that match the analogous bound when the xix_{i} are independent standard Gaussian random variables.

Positive semidefinite quadratic forms

Our main theorem, given below, is a generalization of (2).

Note that when μ=0\mu=0 and σ=1\sigma=1 we have:

Our proof actually establishes the following upper bounds on the moment generating function of Ax2\|Ax\|^{2} for 0η<1/(2σ2Σ)0\leq\eta<1/(2\sigma^{2}\|\varSigma\|):

where zz is a vector of mm independent standard Gaussian random variables.

Let USVUSV^{\top} be a singular value decomposition of AA; where UU and VV are, respectively, matrices of orthonormal left and right singular vectors; and S=diag(ρ1,,ρm)S=\operatorname{diag}(\sqrt{\rho_{1}},\dotsc,\sqrt{\rho_{m}}) is the diagonal matrix of corresponding singular values. Note that

By rotational invariance, y:=Uzy:=U^{\top}z is an isotropic multivariate Gaussian random vector with mean zero. Therefore Az2=zUS2Uz=ρ1y12++ρmym2\|A^{\top}z\|^{2}=z^{\top}US^{2}U^{\top}z=\rho_{1}y_{1}^{2}+\dotsb+\rho_{m}y_{m}^{2} and μAz=νy=ν1y1++νmym\mu^{\top}A^{\top}z=\nu^{\top}y=\nu_{1}y_{1}+\dotsb+\nu_{m}y_{m}, where ν:=SVμ\nu:=SV^{\top}\mu (note that ν2=SVμ2=Aμ2\|\nu\|^{2}=\|SV^{\top}\mu\|^{2}=\|A\mu\|^{2}). Let γ:=λ2σ2/2\gamma:=\lambda^{2}\sigma^{2}/2. By Lemma 1,

for 0γ<1/(2ρ)0\leq\gamma<1/(2\|\rho\|_{\infty}). Combining (4), (5), and (6) gives

for 0γ<1/(2ρ)0\leq\gamma<1/(2\|\rho\|_{\infty}) and ε0\varepsilon\geq 0. Choosing

where h1(a):=1+a1+2ah_{1}(a):=1+a-\sqrt{1+2a}, which has the inverse function h11(b)=2b+bh_{1}^{-1}(b)=\sqrt{2b}+b. The result follows by setting τ:=2ρ22t+2ρt=2tr(Σ2)t+2Σt\tau:=2\sqrt{\|\rho\|_{2}^{2}t}+2\|\rho\|_{\infty}t=2\sqrt{\operatorname{tr}(\varSigma^{2})t}+2\|\varSigma\|t. ∎

The following lemma is a standard estimate of the logarithmic moment generating function of a quadratic form in standard Gaussian random variables, proved much along the lines of the estimate due to Laurent and Massart (2000).

The right-hand side can be bounded using the inequalities

We give a simple application of Theorem 1 to fixed-design linear regression with the ordinary least squares estimator.

for independent subgaussian zero-mean noise variables ε1,,εn\varepsilon_{1},\dotsc,\varepsilon_{n}. Let Σ:=i=1nxixi/n\varSigma:=\sum_{i=1}^{n}x_{i}x_{i}^{\top}/n, which we assume is invertible without loss of generality. Let

be the coefficient vector of minimum expected squared error. The ordinary least squares estimator is given by

The excess loss R(β^)R(\hat{\beta}) of β^\hat{\beta} is the difference between the expected squared error of β^\hat{\beta} and that of β\beta:

so the tail inequality above is essentially tight when the yiy_{i} are independent Gaussian random variables.

References

Appendix A Standard tail inequalities

The following is a standard form of Bernstein’s inequality stated for martingale difference sequences.

The proof of Proposition 2, which is entirely standard, is an immediate consequence of the following two lemmas together with Jensen’s inequality.

Let u1,,unu_{1},\dotsc,u_{n} be random vectors such that

for all i=1,,ni=1,\dotsc,n, almost surely. For all t>0t>0,

Let sn:=u1++uns_{n}:=u_{1}+\dotsb+u_{n}. Define the Doob martingale

The claim now follows from Bernstein’s inequality (Lemma 2). ∎

Let si:=u1++uis_{i}:=u_{1}+\dotsb+u_{i} for i=1,,ni=1,\dotsc,n; we have

It is well-known that if zN(0,1)z\sim\mathcal{N}(0,1) is a standard Gaussian random variable, then z2z^{2} follows a χ2\chi^{2} distribution with one degree of freedom. The following inequality due to Laurent and Massart (2000) gives a bound on linear combinations of χ2\chi^{2} random variables.

Let VΛVV\Lambda V^{\top} be an eigen-decomposition of AAA^{\top}A, where VV is a matrix of orthonormal eigenvectors, and Λ:=diag(ρ1,,ρn)\Lambda:=\operatorname{diag}(\rho_{1},\dotsc,\rho_{n}) is the diagonal matrix of corresponding eigenvalues ρ1,,ρn\rho_{1},\dotsc,\rho_{n}. By the rotational invariance of the distribution, z:=Vxz:=V^{\top}x is an isotropic multivariate Gaussian random vector with mean zero. Thus, Ax2=zΛz=ρ1z12++ρnzn2\|Ax\|^{2}=z^{\top}\Lambda z=\rho_{1}z_{1}^{2}+\dotsb+\rho_{n}z_{n}^{2}, and the zi2z_{i}^{2} are independent χ2\chi^{2} random variables, each with one degree of freedom. The claim now follows from a tail bound for χ2\chi^{2} random variables (Lemma 5, due to Laurent and Massart, 2000). ∎