Sparse PCA via Covariance Thresholding

Yash Deshpande, Andrea Montanari

Introduction

The standard method of principal component analysis involves computing the sample covariance matrix G=n1i=1nxixiT{\mathbf{G}}=n^{-1}\sum_{i=1}^{n}{\mathbf{x}}_{i}{\mathbf{x}}_{i}^{{\sf T}} and estimates v=v1{\mathbf{v}}={\mathbf{v}}_{1} by its principal eigenvector v\mbox\scPC(G){\mathbf{v}}_{\mbox{\tiny{\sc PC}}}({\mathbf{G}}). It is a well-known fact that, in the high dimensional regime, this yields an inconsistent estimate (see [JL09]). Namely v\mbox\scPCv↛0\|{\mathbf{v}}_{\mbox{\tiny{\sc PC}}}-{\mathbf{v}}\|\not\to 0 unless p/n0p/n\to 0. Even worse, [BBAP05] and [Pau07] demonstrate the following phase transition phenomenon. Assuming that p/nα(0,)p/n\to\alpha\in(0,\infty), if β<α\beta<\sqrt{\alpha} the estimate is asymptotically orthogonal to the signal, i.e. v\mbox\scPC,v0\langle{\mathbf{v}}_{\mbox{\tiny{\sc PC}}},{\mathbf{v}}\rangle\to 0. On the other hand, for β>α\beta>\sqrt{\alpha}, v\mbox\scPC,v|\langle{\mathbf{v}}_{\mbox{\tiny{\sc PC}}},{\mathbf{v}}\rangle| remains bounded away from zero as n,pn,p\to\infty. This phase transition phenomenon has attracted considerable attention recently within random matrix theory [FP07, CDMF09, BGN11, KY13].

These inconsistency results motivated several efforts to exploit additional structural information on the signal v{\mathbf{v}}. In two influential papers, [JL04, JL09] considered the case of a signal v{\mathbf{v}} that is sparse in a suitable basis, e.g. in the wavelet domain. Without loss of generality, we will assume here that v{\mathbf{v}} is sparse in the canonical basis e1{\mathbf{e}}_{1}, …ep{\mathbf{e}}_{p}. In a nutshell, [JL09] propose the following:

Order the diagonal entries of the Gram matrix Gi(1),i(1)Gi(2),i(2)Gi(p),i(p){\mathbf{G}}_{i(1),i(1)}\geq{\mathbf{G}}_{i(2),i(2)}\geq\dots\geq{\mathbf{G}}_{i(p),i(p)}, and let J{i(1),i(2),,i(k)}J\equiv\{i(1),i(2),\dots,i(k)\} be the set of indices corresponding to the s0s_{0} largest entries.

Set to zero all the entries Gi,j{\mathbf{G}}_{i,j} of G{\mathbf{G}} unless i,jJi,j\in J, and estimate v{\mathbf{v}} with the principal eigenvector of the resulting matrix.

At the same time, this result is not as strong as might have been expected. By searching exhaustively over all possible supports of size s0s_{0} (a method that has complexity of order ps0p^{s_{0}}) the correct support can be identified with high probability as soon as ns0logpn\gtrsim s_{0}\log p. No method can succeed for much smaller nn, because of information theoretic obstructions. We refer the reader to [AW09] for more details.

Over the last ten years, a significant effort has been devoted to developing practical algorithms that outperform diagonal thresholding, see e.g. [MWA05, ZHT06, dEGJL07, dBG08, WTH09]. In particular, [dEGJL07] developed a promising M-estimator based on a semidefinite programming (SDP) relaxation. [AW09] also carried out an analysis of this method and proved that, ifThroughout the paper, we denote by KK constants that can depend on problem parameters rr and β\beta. We denote by upper case CC (lower case cc) generic absolute constants that are bigger (resp. smaller) than 1, but which might change from line to line. (i) nK(β)s0log(ps0)pn\geq K(\beta)\,s_{0}\log(p-s_{0})p, and (ii) the SDP solution has rank one, then the SDP relaxation provides a consistent estimator of the support of v{\mathbf{v}}.

At first sight, this appears as a satisfactory solution of the original problem. No procedure can estimate the support of v{\mathbf{v}} from less than s0logps_{0}\log p samples, and the SDP relaxation succeeds in doing it from –at most– a constant factor more samples. This picture was upset by a recent, remarkable result by [KNV13] who showed that the rank-one condition assumed by Amini and Wainwright does not hold for ns0(n/logp)\sqrt{n}\lesssim s_{0}\lesssim(n/\log p). This result is consistent with recent work of [BR13] demonstrating that sparse PCA cannot be performed in polynomial time in the regime s0ns_{0}\gtrsim\sqrt{n}, under a certain computational complexity conjecture for the so-called planted clique problem.

In summary, the sparse PCA problem demonstrates a fascinating interplay between computational and statistical barriers.

and disregarding computational considerations, the support of v{\mathbf{v}} can be estimated consistently if and only if s0n/logps_{0}\lesssim n/\log p. This can be done, for instance, by exhaustive search over all the (ps0)\binom{p}{s_{0}} possible supports of v{\mathbf{v}}. We refer to [VL12, CMW+13] for a minimax analysis.

the problem appears to be much more difficult. There is rigorous evidence [BR13, MW+15b, MW15a, WBS14] that no polynomial algorithm can reconstruct the support unless s0ns_{0}\lesssim\sqrt{n}. On the positive side, a very simple algorithm (Johnstone and Lu’s diagonal thresholding) succeeds for s0n/logps_{0}\lesssim\sqrt{n/\log p}.

Of course, several elements are still missing in this emerging picture. In the present paper we address one of them, providing an answer to the following question:

Is there a polynomial time algorithm that is guaranteed to solve the sparse PCA problem with high probability for n/logps0n\sqrt{n/\log p}\lesssim s_{0}\lesssim\sqrt{n}?

We answer this question positively by analyzing a covariance thresholding algorithm that proceeds, briefly, as follows. (A precise, general definition, with some technical changes is given in the next section.)

Form the empirical covariance matrix G{\mathbf{G}} and set to zero all its entries that are in modulus smaller than τ/n\tau/\sqrt{n}, for τ\tau a suitably chosen constant.

Compute the principal eigenvector v^1\mathbf{\widehat{v}}_{1} of this thresholded matrix.

Estimate the support of v{\mathbf{v}} by thresholding Gv^1{\mathbf{G}}\mathbf{\widehat{v}}_{1}.

Such a covariance thresholding approach was proposed in [KNV13], and is in turn related to earlier work by [BL08b, CZZ+10]. The formulation discussed in the next section presents some technical differences that have been introduced to simplify the analysis. Notice that, to simplify proofs, we assume s0s_{0} to be known: this issue is discussed in the next two sections.

The rest of the paper is organized as follows. In the next section we provide a detailed description of the algorithm and state our main results. The proof strategy for our results is explained in Section 3. Our theoretical results assume full knowledge of problem parameters for ease of proof. In light of this, in Section 4 we discuss a practical implementation of the same idea that does not require prior knowledge of problem parameters, and is data-driven. We also illustrate the method through simulations. The complete proofs are in Sections 5, 7 and 6 respectively.

A preliminary version of this paper appeared in [DM14]. This paper extends significantly the results in [DM14]. In particular, by following an analogous strategy, we improve greatly the bounds obtained by [DM14]. This significantly improves the regimes of (s0,p,n)(s_{0},p,n) on which we can obtain non-trivial results. The proofs follow a similar strategy but are, correspondingly, more careful.

Algorithm and main results

We provide a detailed description of the covariance thresholding algorithm for the general model (1) in Table 1. For notational convenience, we shall assume that 2n2n sample vectors are given (instead of nn): {xi}1i2n\{{\mathbf{x}}_{i}\}_{1\leq i\leq 2n}.

We start by splitting the data into two halves: (xi)1in({\mathbf{x}}_{i})_{1\leq i\leq n} and (xi)n<i2n({\mathbf{x}}_{i})_{n<i\leq 2n} and compute the respective sample covariance matrices G{\mathbf{G}} and G{\mathbf{G}}^{\prime} respectively. Define Σ{\mathbf{\Sigma}} to be the population covariance minus identity, i.e.

Throughout, we let Qq{\sf Q}_{q} and sqs_{q} denote the support of vq{\mathbf{v}}_{q} and its size respectively, for q{1,2,,r}q\in\{1,2,\dots,r\}. We further let Q=q=1rQq{\sf Q}=\cup_{q=1}^{r}{\sf Q}_{q} and s0=Qs_{0}=|{\sf Q}|. The matrix G{\mathbf{G}} is used, in steps 11 to 44 to obtain a good estimate η(Σ^)\eta({\mathbf{\widehat{\Sigma}}}) for the low rank part of the population covariance Σ{\mathbf{\Sigma}}. The algorithm first computes Σ^{\mathbf{\widehat{\Sigma}}}, a centered version of the empirical covariance as follows:

where G=n1inxixiT{\mathbf{G}}=n^{-1}\sum_{i\leq n}{\mathbf{x}}_{i}{\mathbf{x}}_{i}^{\sf T} is the sample covariance matrix.

In step 55 of the algorithm, this estimate is used to construct good estimates v^q\mathbf{\widehat{v}}_{q} of the eigenvectors vq{\mathbf{v}}_{q}. Finally, in step 66, these estimates are combined with the (independent) second half of the data G{\mathbf{G}}^{\prime} to construct estimators Q^q{\widehat{\sf Q}}_{q} for the support of the individual eigenvectors vq{\mathbf{v}}_{q}. In the first two subsections we will focus on the estimation of Σ{\mathbf{\Sigma}} and the individual principal components. Our results on support recovery are provided in the final subsection.

Our first result bounds the estimation error of the soft thresholding procedure in operator norm.

There exist numerical constants C1,C2,C>0C_{1},C_{2},C>0 such that the following happens. Assume n>Clogpn>C\log p, n>s02n>s_{0}^{2} and let τ=C1(β1)log(p/s02)\tau_{*}=C_{1}(\beta\vee 1)\sqrt{\log(p/s_{0}^{2})}. We keep the thresholding level τ\tau according to

[BL08a, BL08b, Kar08, CZZ+10, CL11] considered the operator norm error of thresholding estimators for structured covariance matrices. Specializing to our case of exact sparsity, the result of [BL08a] implies that, with high probability:

Theorem 1 ensures consistency under a weaker sparsity condition, viz. s02/n0s_{0}^{2}/n\to 0 is sufficient. Also, the resulting rate depends on log(p/s02)\log(p/s_{0}^{2}) instead of logp\log p. In other words, in order to achieve η(Σ^)Σop<ε\|\eta({\mathbf{\widehat{\Sigma}}})-{\mathbf{\Sigma}}\|_{op}<{\varepsilon} for a fixed ε{\varepsilon}, it is sufficient s0εns_{0}\lesssim{\varepsilon}\sqrt{n} as opposed to s0n/logps_{0}\lesssim\sqrt{n/\log p}.

Crucially, in this regime for s0=Θ(εn)s_{0}=\Theta({\varepsilon}\sqrt{n}), Theorem 1 suggests a threshold of order τ=Θ(log(1/ε))\tau=\Theta(\sqrt{\log(1/{\varepsilon})}) as opposed to τ=C1logp\tau=C_{1}\sqrt{\log p} which is used in [BL08a, CZ+12]. As we will see in Section 3, this regime mathematically more challenging than the one of [BL08a, CZ+12]. By setting τ=C1logp\tau=C_{1}\sqrt{\log p} for a large enough constant C1C_{1}, all the entries of Σ^{\mathbf{\widehat{\Sigma}}} outside the support of Σ{\mathbf{\Sigma}} are set to . In contrast, a large part of our proof is devoted to control the operator norm of the noise part of Σ^{\mathbf{\widehat{\Sigma}}}.

2 Estimating the principal components

We next turn to the question of estimating the principal components v1,vr{\mathbf{v}}_{1},\dots\mathbf{v}_{r}. Of course, these are not identifiable if there are degeneracies in the population eigenvalues β1,β2,,βr\beta_{1},\beta_{2},\dots,\beta_{r}. We thus introduce the following identifiability condition.

The spike strengths β1>β2>βr\beta_{1}>\beta_{2}>\dots\beta_{r} are all distinct. We denote by βmax(β1,,βr)\beta\equiv\max(\beta_{1},\dots,\beta_{r}) and βminminqq(β1β2,β2β3,,βr)\beta_{\rm min}\equiv\min_{q\neq q^{\prime}}(\beta_{1}-\beta_{2},\beta_{2}-\beta_{3},\dots,\beta_{r}). Namely, β\beta is the largest signal strength and βmin\beta_{\rm min} is the minimum gap.

Notice the minimization over the sign s{+1,1}s\in\{+1,-1\}. This is required because the principal components v1,,vr{\mathbf{v}}_{1},\dots,{\mathbf{v}}_{r} are only identifiable up to a sign. Analogous results can obtained for alternate loss functions such as the projection distance:

The theorem below is an immediate consequence of Theorem 1. In particular, it uses the guarantee of Theorem 1 to show that the corresponding principal components of η(Σ^)\eta({\mathbf{\widehat{\Sigma}}}) provide good estimates of the principal components vq,1qr{\mathbf{v}}_{q},1\leq q\leq r.

There exists a numerical constant CC such that the following holds. Suppose that Assumption A1 holds in addition to the conditions n>Clogpn>C\log p, s02<ns_{0}^{2}<n, and s02<p/es_{0}^{2}<p/e. Set τ\tau as according to Theorem 1, and let v^1,,v^r\mathbf{\widehat{v}}_{1},\dots,\mathbf{\widehat{v}}_{r} denote the rr principal eigenvectors of η(Σ^;τ/n)\eta({\mathbf{\widehat{\Sigma}}};\tau/\sqrt{n}). Then, with probability 1o(1)1-o(1)

Let Δη(Σ^;τ/n)Σ{\mathbf{\Delta}}\equiv\eta({\mathbf{\widehat{\Sigma}}};\tau/\sqrt{n})-{\mathbf{\Sigma}}. By Davis-Kahn sin-theta theorem [DK70], we have, for βmin>Δop\beta_{\rm min}>\|{\mathbf{\Delta}}\|_{op},

For βmin2>2C(s02(β21)/n)log(p/s02)\beta_{\rm min}^{2}>2C(s_{0}^{2}(\beta^{2}\vee 1)/n)\,\log(p/s_{0}^{2}), the claim follows by using Theorem 1. If βmin22C(s02(β21)/n)log(p/s02)\beta_{\rm min}^{2}\leq 2C(s_{0}^{2}(\beta^{2}\vee 1)/n)\,\log(p/s_{0}^{2}), the claim is obviously true since L(v^q,vq)1L(\mathbf{\widehat{v}}_{q},{\mathbf{v}}_{q})\leq 1 always. ∎

3 Support recovery

Finally, we consider the question of support recovery of the principal components vq{\mathbf{v}}_{q}. The second phase of our algorithm aims at estimating union of the supports Q=Q1Qr{\sf Q}={\sf Q}_{1}\cup\dots\cup{\sf Q}_{r} from the estimated principal components v^q\mathbf{\widehat{v}}_{q}. Note that, although v^q\mathbf{\widehat{v}}_{q} is not even expected to be sparse, it is easy to see that the largest entries of v^q\mathbf{\widehat{v}}_{q} should have significant overlap with supp(vq){\rm supp}({\mathbf{v}}_{q}). Step 6 of the algorithm exploit this property to construct a consistent estimator Q^q{\widehat{\sf Q}}_{q} of the support of the spike vq{\mathbf{v}}_{q}.

We will require the following assumption to ensure support recovery.

There exist constants θ,γ>0\theta,\gamma>0 such that the following holds. The non-zero entries of the spikes satisfy vq,iθ/s0|v_{q,i}|\geq\theta/\sqrt{s_{0}} for all iQqi\in{\sf Q}_{q}. Further, for any q,qq,q^{\prime} vq,i/vq,iγ\lvert{v_{q,i}/v_{q^{\prime},i}}\rvert\leq\gamma for every iQqQqi\in{\sf Q}_{q}\cap{\sf Q}_{q^{\prime}}. Without loss of generality, we will assume γ1\gamma\geq 1.

Assume the spiked covariance model of Eq. (1) satisfying assumptions A1 and A2, and further n>Clogpn>C\log p, s02<ns_{0}^{2}<n, and s02<p/es_{0}^{2}<p/e for CC a large enough numerical constant. Consider the Covariance Thresholding algorithm of Table 1, with τ\tau as in Theorem 1 ρ=βminθ/(2s0)\rho=\beta_{\rm min}\theta/(2\sqrt{s_{0}}).

Then there exists K0=K0(θ,γ,β,βmin)K_{0}=K_{0}(\theta,\gamma,\beta,\beta_{\rm min}) such that, if

then the algorithm recovers the union of supports of vq{\mathbf{v}}_{q} with probability 1o(1)1-o(1) (i.e. we have Q^=Q{\widehat{\sf Q}}={\sf Q}).

The proof in Section 7 also provides an explicit expression for the constant K0K_{0}.

In Assumption A2, the requirement on the minimum size of vq,i|v_{q,i}| is standard in support recovery literature [Wai09, MB06]. Additionally, however, we require that when the supports of vq,vq{\mathbf{v}}_{q},{\mathbf{v}}_{q^{\prime}} overlap, they are of the same order, quantified by the parameter γ\gamma. Relaxing this condition is a potential direction for future work.

Recovering the signed supports Qq,+={i[p]:vq,i>0}{\sf Q}_{q,+}=\{i\in[p]:v_{q,i}>0\} and Qq,={i[p]:vq,i<0}{\sf Q}_{q,-}=\{i\in[p]:v_{q,i}<0\}, up to a sign flip, is possible using the same technique as recovering the supports supp(vq){\rm supp}({\mathbf{v}}_{q}) above, and poses no additional difficulty.

Algorithm intuition and proof strategy

Recall that Σ^=n1XTXIp=GIp{\mathbf{\widehat{\Sigma}}}=n^{-1}{\mathbf{X}}^{\sf T}{\mathbf{X}}-{\rm I}_{p}={\mathbf{G}}-{\rm I}_{p}. For β>p/n\beta>\sqrt{p/n}, the principal eigenvector of G{\mathbf{G}}, and hence of Σ^{\mathbf{\widehat{\Sigma}}} is positively correlated with v{\mathbf{v}}, i.e. v^1(Σ^),v|\langle\mathbf{\widehat{v}}_{1}({\mathbf{\widehat{\Sigma}}}),{\mathbf{v}}\rangle| is bounded away from zero. However, for β<p/n\beta<\sqrt{p/n}, the noise component in Σ^{\mathbf{\widehat{\Sigma}}} dominates and the two vectors become asymptotically orthogonal, i.e. for instance limnv^1(Σ^),v=0\lim_{n\to\infty}|\langle\mathbf{\widehat{v}}_{1}({\mathbf{\widehat{\Sigma}}}),{\mathbf{v}}\rangle|=0. In order to reduce the noise level, we must exploit the sparsity of the spike v{\mathbf{v}}.

Now, letting ββu2/nβ\beta^{\prime}\equiv\beta\|{\mathbf{u}}\|^{2}/n\approx\beta, and wβZTu/n{\mathbf{w}}\equiv\sqrt{\beta}{\mathbf{Z}}^{{\sf T}}{\mathbf{u}}/n, we can rewrite Σ^{\mathbf{\widehat{\Sigma}}} as

Consider again the decomposition (16). Since the soft thresholding function η(z;τ/n)\eta(z;\tau/\sqrt{n}) is affine when zτ/n|z|\geq\tau/\sqrt{n}, we would expect that the following decomposition holds approximately (for instance, in operator norm):

Since ββ\beta^{\prime}\approx\beta and each entry of vvT{\mathbf{v}}{\mathbf{v}}^{\sf T} has magnitude at least θ2/s0\theta^{2}/s_{0}, the first term is still approximately rank one, with

This is straightforward to see since soft thresholding introduces a maximum bias of τ/n\tau/\sqrt{n} per entry of the matrix, while the factor s0s_{0} comes due to the support size of vvT{\mathbf{v}}{\mathbf{v}}^{\sf T} (see Proposition 6.2 below for a rigorous argument).

The main technical challenge now is to control the operator norm of the perturbation η(ZTZ/nIp)\eta({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}). We know that η(ZTZ/nIp)\eta({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}) has entries of variance δ(τ)/n\delta(\tau)/n, for δ(τ)exp(cτ2)\delta(\tau)\approx\exp(-c\tau^{2}). If entries were independent with mild tail conditions, this would imply –with high probability–

for some constant CC. Combining the bias bound from Eq. (18) and the heuristic decomposition of Eq. (19) with the decomposition (17) results in the bound

Our analysis formalizes this argument and shows that such a bound is essentially correct when pCnp\leq C\,n. A modified bound is proved for p>Cnp>C\,n (see Theorem 4 for a general statement).

The key technical challenge in our proof is the analysis of the operator norm of such matrices, when fnf_{n} is the soft-thresholding function, with threshold of order 1/n1/\sqrt{n}. Earlier results are not general enough to cover this case. [EK10a, EK10b] provide conditions under which the spectrum of fn(ZTZ/nIp)f_{n}({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}) is close to a rescaling of the spectrum of (ZTZ/nIp)({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}). We are interested instead in a different regime in which the spectrum of fn(ZTZ/nIp)f_{n}({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}) is very different from the one of (ZTZ/nIp)({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}). [CS13] consider nn-dependent kernels, but their results are asymptotic and concern the weak limit of the empirical spectral distribution of fn(ZTZ/nIp)f_{n}({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}). This does not yield an upper bound on the spectral norm of fn(ZTZ/nIp)f_{n}({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}). Finally, [FM15] consider the spectral norm of kernel random matrices for smooth kernels ff, only in the proportional regime n/pc(0,)n/p\to c\in(0,\infty).

Our approach to proving Theorem 1 follows instead the ε{\varepsilon}-net method: we develop high probability bounds on the maximum Rayleigh quotient:

Unfortunately, this turns out not to be true over the whole space of Z{\mathbf{Z}}, i.e. the Rayleigh quotient is not Lipschitz continuous in the underlying Gaussian variables Z{\mathbf{Z}}. Our approach, instead, shows that for typical values of Z{\mathbf{Z}}, we can control the gradient of y,η(ZTZ/nIp)y\langle{\mathbf{y}},\eta({\mathbf{Z}}^{\sf T}{\mathbf{Z}}/n-{\rm I}_{p}){\mathbf{y}}\rangle with respect to Z{\mathbf{Z}}, and extract the required concentration only from such local information of the function. This is formalized in our concentration lemma 5.4, which we apply extensively while proving Theorem 1. This lemma is a significantly improved version of the analogous result in [DM14].

Practical aspects and empirical results

Specializing to the rank one case, Theorems 2 and 3 show that Covariance Thresholding succeeds with high probability for a number of samples ns02n\gtrsim s_{0}^{2}, while Diagonal Thresholding requires ns02logpn\gtrsim s_{0}^{2}\log p. The reader might wonder whether eliminating the logp\log p factor has any practical relevance or is a purely conceptual improvement. Figure 1 presents simulations on synthetic data under the strictly sparse model, and the Covariance Thresholding algorithm of Table 1, used in the proof of Theorem 3. The objective is to check whether the logp\log p factor has an impact at moderate pp. We compare this with Diagonal Thresholding.

We plot the empirical success probability as a function of s0/ns_{0}/\sqrt{n} for several values of pp, with p=np=n. The empirical success probability was computed by using 100100 independent instances of the problem. A few observations are of interest: (i)(i) Covariance Thresholding appears to have a significantly larger success probability in the ‘difficult’ regime where Diagonal Thresholding starts to fail; (ii)(ii) The curves for Diagonal Thresholding appear to decrease monotonically with pp indicating that s0s_{0} proportional to n\sqrt{n} is not the right scaling for this algorithm (as is known from theory); (iii)(iii) In contrast, the curves for Covariance Thresholding become steeper for larger pp, and, in particular, the success probability increases with pp for s01.1ns_{0}\leq 1.1\sqrt{n}. This indicates a sharp threshold for s0=constns_{0}={\rm const}\cdot\sqrt{n}, as suggested by our theory.

In terms of practical applicability, our algorithm in Table 1 has the shortcomings of requiring knowledge of problem parameters s0,β,θs_{0},\beta,\theta. Furthermore, the thresholds ρ,τ\rho,\tau suggested by theory need not be optimal. We next describe a principled approach to estimating (where possible) the parameters of interest and running the algorithm in a purely data-dependent manner. Assume the following model, for i[n]i\in[n]

We let μ^=i=1nxi/n\widehat{\boldsymbol{\mu}}=\sum_{i=1}^{n}{\mathbf{x}}_{i}/n be the empirical mean estimate for μ\mu. Further letting X=X1μ^T\overline{\mathbf{X}}={\mathbf{X}}-\mathbf{1}\widehat{{\boldsymbol{\mu}}}^{\sf T} we see that pn(qkq)npnpn-(\sum_{q}k_{q})n\approx pn entries of X\overline{\mathbf{X}} are mean and variance σ2\sigma^{2}. We let σ^=MAD(X)/ν\widehat{\sigma}={{\rm MAD}(\overline{\mathbf{X}})}/{\nu} where MAD(){\rm MAD}(\,\cdot\,) denotes the median absolute deviation of the entries of the matrix in the argument, and ν\nu is a constant scale factor. Guided by the Gaussian case, we take ν=Φ1(3/4)0.6745\nu=\Phi^{-1}(3/4)\approx 0.6745.

We define Σ^=XTX/nσ^2Ip{\mathbf{\widehat{\Sigma}}}=\overline{\mathbf{X}}^{\sf T}\overline{\mathbf{X}}/n-\widehat{\sigma}^{2}{\rm I}_{p} and soft threshold it to get η(Σ^)\eta({\mathbf{\widehat{\Sigma}}}) using τ\tau as above. Our proof of Theorem 2 relies on the fact that η(Σ^)\eta({\mathbf{\widehat{\Sigma}}}) has rr eigenvalues that are separated from the bulk of the spectrum. Hence, we estimate rr using r^\widehat{r}: the number of eigenvalues separated from the bulk in η(Σ^)\eta({\mathbf{\widehat{\Sigma}}}). The edge of the spectrum can be computed numerically using the Stieltjes transform method as in [CS13].

Let v^q\mathbf{\widehat{v}}_{q} denote the qthq^{\text{th}} eigenvector of η(Σ^)\eta({\mathbf{\widehat{\Sigma}}}). Our theoretical analysis indicates that v^q\mathbf{\widehat{v}}_{q} is expected to be close to vq{\mathbf{v}}_{q}. In order to denoise v^q\mathbf{\widehat{v}}_{q}, we assume Σ^v^q(1δ)vq+εq{\mathbf{\widehat{\Sigma}}}\mathbf{\widehat{v}}_{q}\approx(1-\delta){\mathbf{v}}_{q}+{\boldsymbol{{\varepsilon}}}_{q}, where εq{\boldsymbol{{\varepsilon}}}_{q} is additive random noise (perhaps with some sparse corruptions). We then threshold Σ^vq{\mathbf{\widehat{\Sigma}}}{\mathbf{v}}_{q} ‘at the noise level’ to recover a better estimate of vq{\mathbf{v}}_{q}. To do this, we estimate the standard deviation of the “noise” ε{\boldsymbol{{\varepsilon}}} by σε^=MAD(v^q)/ν\widehat{\sigma_{{\boldsymbol{{\varepsilon}}}}}={{\rm MAD}(\mathbf{\widehat{v}}_{q})}/{\nu}. Here we set –again guided by the Gaussian heuristic– ν0.6745\nu\approx 0.6745. Since vq{\mathbf{v}}_{q} is sparse, this procedure returns a good estimate for the size of the noise deviation. We let v^q\mathbf{\widehat{v}}^{\prime}_{q} denote the vector obtained by hard thresholding v^q\mathbf{\widehat{v}}_{q}: set v^i=v^q,i if v^q,iνσ^εq\mathbf{\widehat{v}}^{\prime}_{i}=\mathbf{\widehat{v}}_{q,i}\text{ if }\lvert{\widehat{v}_{q,i}}\rvert\geq\nu^{\prime}\widehat{\sigma}_{{\boldsymbol{{\varepsilon}}}_{q}} and 0 otherwise.0\text{ otherwise.} We then let v^q=v^q/v^q\mathbf{\widehat{v}}^{*}_{q}=\mathbf{\widehat{v}}^{\prime}_{q}/\lVert{\mathbf{\widehat{v}}^{\prime}_{q}}\rVert and return v^q\mathbf{\widehat{v}}^{*}_{q} as our estimate for vq{\mathbf{v}}_{q}.

Note that –while different in several respects– this empirical approach shares the same philosophy of the algorithm in Table 1. On the other hand, the data-driven algorithm presented in this section is less straightforward to analyze, a task that we defer to future work.

Figure 1 also shows results of a support recovery experiment using the ‘data-driven’ version of this section. Covariance thresholding in this form also appears to work for supports of size s0constns_{0}\leq\text{const}\sqrt{n}. Figure 2 shows the performance of vanilla PCA, Diagonal Thresholding and Covariance Thresholding on the “Three Peak” example of [JL04]. This signal is sparse in the wavelet domain and the simulations employ the data-driven version of covariance thresholding. A similar experiment with the “box” example of Johnstone and Lu is provided in Figure 3. These experiments demonstrate that, while for large values of nn both Diagonal Thresholding and Covariance Thresholding perform well, the latter appears superior for smaller values of nn.

Proof preliminaries

In this section we review some notation and preliminary facts that we will use throughout the paper.

We let Qq{\sf Q}_{q} denotes the support of the qthq^{\text{th}} spike vq{\mathbf{v}}_{q}. Also, we denote the union of the supports of vq{\mathbf{v}}_{q} by Q=qQq{\sf Q}=\cup_{q}{\sf Q}_{q}. The complement of a set E[n]E\in[n] is denoted by EcE^{c}.

We write η(;)\eta(\cdot;\cdot) for the soft-thresholding function. By η(;τ)\partial\eta(\cdot;\tau) we denote the derivative of η(;τ)\eta(\cdot;\tau) with respect to the first argument, which exists Lebesgue almost everywhere. To simplify the notation, we omit the second argument when it is understood from context.

In the statements of our results, consider the limit of large pp and large nn with certain conditions on p,np,n (as in Theorem 1). This limit will be referred to either as “nn large enough” or “pp large enough” where the phrase “large enough” indicates dependence of pp (and thereby nn) on specific problem parameters.

The Gaussian distribution function will be denoted by Φ(x)=xet2/2dt/2π\Phi(x)=\int_{-\infty}^{x}e^{-t^{2}/2}\,{\rm d}t/\sqrt{2\pi}.

2 Preliminary facts

A subset Tε(X)XT^{\varepsilon}(X)\subseteq X is called an ε{\varepsilon}-net of XX if every point in XX may be approximated by one in Tε(X)T^{\varepsilon}(X) with error at most ε{\varepsilon}. More precisely:

The minimum cardinality of an ε{\varepsilon}-net of XX, if finite, is called its covering number.

The following two facts are useful while using ε{\varepsilon}-nets to bound the spectral norm of a matrix. For proofs, we refer the reader to [Ver12, Lemmas 5.2, 5.4].

Let Sn1S^{n-1} be the unit sphere in nn dimensions. Then there exists an ε{\varepsilon}-net of Sn1S^{n-1}, Tε(Sn1)T^{\varepsilon}(S^{n-1}) satisfying:

The lemma then follows as x,A(xx)x+xAxx2εA\lvert{\langle{\mathbf{x}},{\mathbf{A}}({\mathbf{x}}-{\mathbf{x}}_{*})\rangle}\rvert\leq\lVert{{\mathbf{x}}+{\mathbf{x}}_{*}}\rVert\lVert{{\mathbf{A}}}\rVert\lVert{{\mathbf{x}}-{\mathbf{x}}_{*}}\rVert\leq 2{\varepsilon}\lVert{{\mathbf{A}}}\rVert. ∎

We now state a general concentration lemma. This will be our basic tool to establish Theorem 2, and thereby Theorem 3.

Here z{\mathbf{z}}^{\prime} is an independent copy of z{\mathbf{z}}.

We first implement the symmetrization. Note that:

Here we use Jensen’s inequality with the monotonicity of ϕ()\phi(\cdot) to obtain (a)(a) and with the convexity of ϕ()\phi(\cdot) to obtain (b)(b).

Now we choose ϕ(z)=(za)+\phi(z)=(z-a)_{+}, for a=Δ2/2a=\Delta^{2}/2.

Here (a)(a) is Markov’s inequality, and (b)(b) is the symmetrization bound Eq. (33), where we use the fact that ϕ(z)=(za)+\phi(z)=(z-a)_{+} is non-decreasing and convex in zz.

Define the set Gθ={(z,z):maxsFs(z(θ))L}\mathcal{G}_{\theta}=\{({\mathbf{z}},{\mathbf{z}}^{\prime}):\max_{s}\lVert{{\nabla}F_{s}({\mathbf{z}}(\theta))}\rVert\leq L\}. Then:

Here (a)(a) follows as GθG\mathcal{G}_{\theta}\supseteq\mathcal{G}. Equality (b)(b) follows from noting that Gθ\mathcal{G}_{\theta} is measurable with respect to z(θ){\mathbf{z}}(\theta) and, hence, first integrating with respect to z˙(θ)=zcosθzsinθ{\dot{\mathbf{z}}}(\theta)={\mathbf{z}}\cos\theta-{\mathbf{z}}^{\prime}\sin\theta, a Gaussian random variable that is independent of z(θ){\mathbf{z}}(\theta). The final inequality (c)(c) follows by using the fact that Fs(z(θ))L\lVert{{\nabla}F_{s}({\mathbf{z}}(\theta))}\rVert\leq L on the set Gθ\mathcal{G}_{\theta}.

Since this bound is uniform over sSs\in S, we can use it in (41):

We can now set λ=4a/π2L2\lambda=4\sqrt{a}/\pi^{2}L^{2}, to obtain the exponent above as 2a/π2L2=Δ2/π2L2-2a/\pi^{2}L^{2}=-\Delta^{2}/\pi^{2}L^{2}. The prefactor (1+λa)λ2(1+\lambda\sqrt{a})\lambda^{-2} is bounded by CL2max(,L2/Δ2)CL^{2}\max(,L^{2}/\Delta^{2}) when a=Δ2/2a=\Delta^{2}/2. Therefore, as required, we obtain:

Combining this with Eq. (38) and the fact that ϕ(Δ2)1CΔ2\phi(\Delta^{2})^{-1}\leq C\Delta^{-2} gives Eq. (28) and, consequently, the lemma. ∎

By a simple application of Cauchy-Schwarz, this lemma implies the following.

The following two lemmas are well-known concentration of measure results. The forms below can be found in [Ver12, Corollary 5.35], [LM00, Lemma 1] respectively.

Let zN(0,IN){\mathbf{z}}\sim{\sf N}(0,{\rm I}_{N}). Then

Proof of Theorem 1

Since Σ^=XTX/nIp{\mathbf{\widehat{\Sigma}}}={\mathbf{X}}^{{\sf T}}{\mathbf{X}}/n-{\rm I}_{p}, we have:

We let D={(i,i):i[p]Q}{\sf D}=\{(i,i):i\in[p]\setminus{\sf Q}\} be the diagonal entries not included in any support. (Recall that Q=qQq{\sf Q}=\cup_{q}{\sf Q}_{q} denote the union of the supports.) Further let E=Q×Q{\sf E}={\sf Q}\times{\sf Q}, F=(Qc×Qc)\D{\sf F}=({\sf Q}^{c}\times{\sf Q}^{c})\backslash{\sf D}, and G=[p]×[p]\(DEF){\sf G}=[p]\times[p]\backslash({\sf D}\cup{\sf E}\cup{\sf F}), or, equivalently G=(Q×Qc)(Qc×Q){\sf G}=({\sf Q}\times{\sf Q}^{c})\cup({\sf Q}^{c}\times{\sf Q}). Since these are disjoint we have:

The first term corresponds to the ‘signal’ component, while the last three terms correspond to the ‘noise’ component.

Theorem 1 is a direct consequence of the next five propositions. The first demonstrates that, even for a low level of thresholding, viz. τ<logp/2\tau<\sqrt{\log p}/2, the term N\mathbf{N} has small operator norm. The second demonstrates that the soft thresholding operation preserves the signal in the term S\mathbf{S}. The next two propositions show that the cross and diagonal terms C\mathbf{C} and D\mathbf{D} are negligible as well. Finally, in the last proposition, we demonstrate that, for the regime of thresholding far above the noise level, i.e. τ>Clogp\tau>C\sqrt{\log p}, the noise terms N\mathbf{N} and C\mathbf{C} vanish entirely.

Let N{\mathbf{N}} denote the second term of Eq. (54). Since F=Qc×Qc\D{\sf F}={\sf Q}^{c}\times{\sf Q}^{c}\backslash{\sf D},

Then, there exists an absolute constant CC such that the following happens. Assuming that (i)(i) τ<logp/2\tau<\sqrt{\log p}/2 and (ii)(ii) n>Clogpn>C\log p, then with probability 1o(1)1-o(1)

Let S{\mathbf{S}} denote the first term in Eq. (54):

Assume that (i)(i) s0/n<1s_{0}/n<1 and (ii)n>Clogp(ii)n>C\log p: Then with probability 1o(1)1-o(1):

Let C\mathbf{C} denote the matrix corresponding to the third term of Eq. (54):

Assuming the conditions of Proposition 6.1 and, additionally, that s02ps_{0}^{2}\leq p, there exist constants C,cC,c such that with probability 1o(1)1-o(1)

Let D\mathbf{D} denote the matrix corresponding to the third term of Eq. (54):

With probability 1o(1)1-o(1) we have that DopCn1logp\lVert{\mathbf{D}}\rVert_{op}\leq C\sqrt{n^{-1}\log p}.

For some absolute constant C0C_{0}, we have for τC0(β1)logp\tau\geq C_{0}(\beta\vee 1)\sqrt{\log p} that, with probability 1o(1)1-o(1):

Therefore, Nop=0\lVert{\mathbf{N}}\rVert_{op}=0 and Cop=0\lVert{\mathbf{C}}\rVert_{op}=0.

At this point we remark that the probability 1o(1)1-o(1) can be made quantitative, for e.g. of the form 1exp(min(p,n)/C1)1-\exp(-\min(\sqrt{p},n)/C_{1}), for every nn large enough. For simplicity of exposition we do not pursue this in the paper.

We defer the proofs of Propositions 6.1, 6.2, 6.3, 6.4 and 6.5 to Sections 6.1, 6.2, 6.3, 6.4 and 6.5 respectively. By combining them for β=O(1)\beta=O(1), we immediately obtain the following bound.

There exist numerical constants C0,C1C_{0},C_{1} such that the following happens. Assume βC0\beta\leq C_{0}, n>C1logpn>C_{1}\log p and τlogp/2\tau\leq\sqrt{\log p}/2. Then with probability 1o(1)1-o(1):

The proof is obtained by adding the error terms from Propositions 6.1, 6.2, 6.3 and 6.4, and noting that β\beta is bounded. ∎

Using Propositions 6.1, 6.2, 6.3 and 6.4, together with a suitable choice of τ\tau, we obtain the proof of Theorem 1.

Note that in the case s02>p/es_{0}^{2}>p/e there is no thresholding and hence the result follows from the fact that Σ^ΣopCp/n\|{\mathbf{\widehat{\Sigma}}}-{\mathbf{\Sigma}}\|_{op}\leq C\sqrt{p/n} [Ver12, Remark 5.40].

We assume now that s02p/es_{0}^{2}\leq p/e and the case that τ=C1(β1)log(p/s02)logp/2\tau_{*}=C_{1}(\beta\vee 1)\sqrt{\log(p/s_{0}^{2})}\leq\sqrt{\log p}/2. In that case we set τ=τlogp/2\tau=\tau_{*}\leq\sqrt{\log p}/2. Below we will keep C1C_{1} a large enough constant, and check that each of the error terms in Propositions 6.1, 6.2, 6.3 and 6.4 is upper bounded by (a constant times) the right-hand side of Eq. (7). Throughout CC will denote a generic constant that can be made as large as we want, and can change from line to line.

where in the last step we used (es02/p),(s02/n)1(e\,s_{0}^{2}/p),(s_{0}^{2}/n)\leq 1.

From Proposition 6.3, we get, using the same argument as in Eq. (65)

Finally, the term of Proposition 6.4 is also bounded as desired using logps02log(p/s02)\log p\leq s_{0}^{2}\log(p/s_{0}^{2}) (dividing both sides by pp and using the fact that xxlog(1/x)x\mapsto x\log(1/x) is increasing).

The case of τlogp/2\tau_{*}\geq\sqrt{\log p}/2 is easier. In that case, we can keep τ=C2τ\tau=C_{2}\tau_{*} with C2C_{2} large enough so that τC0(β1)logp\tau\geq C_{0}(\beta\vee 1)\sqrt{\log p} for C0C_{0} of Proposition 6.5. Then, by Proposition 6.5, we know that N=0\mathbf{N}=0 and C=0\mathbf{C}=0. Therefore we only need consider the terms SΣ\mathbf{S}-{\mathbf{\Sigma}} and D\mathbf{D}. For these terms we can use Propositions 6.2 and 6.4 respectively and, arguing as in the earlier case τlogp\tau_{*}\leq\sqrt{\log p}, we obtain the desired result. ∎

Fix any A1A\geq 1. There exists an absolute constants C,cC,c such that:

Consider the good set G1\mathcal{G}_{1} given by:

which holds because of triangle inequality and the fact that t+1t2\sqrt{t}+\sqrt{1-t}\leq\sqrt{2}. We can now apply Lemma 5.4 to bound the RHS of Eq. (71) and get:

We can simplify the terms on the right-hand side to obtain the result of the lemma. With ε=1/4{\varepsilon}=1/4, Stirling’s approximation and Lemma 5.2 we have:

Assume that τlogp/2\tau\leq\sqrt{\log p}/2. Given A1A\geq 1 and a unit vector y{\mathbf{y}}, let S={i:yiA/p}{\sf S}=\{i:\lvert{y_{i}}\rvert\leq\sqrt{A/p}\} and yS,ySc{\mathbf{y}}_{{\sf S}},{\mathbf{y}}_{{\sf S}^{c}} denote the projections of y{\mathbf{y}} onto supports S,Sc{\sf S},{\sf S}^{c} respectively. We have:

whence, by triangle inequality and t(1t)<1\sqrt{t(1-t)}<1

We can use now Lemma 5.4, to get, for L1>0L_{1}>0 as defined above and any ΔL1\Delta\geq L_{1}:

where the last line follows by Cauchy-Schwarz, as in the proof of Lemma 6.7, and the fact that L1(np)C2L_{1}\geq(np)^{-C_{2}} using the upper bound τlogp/2\tau\leq\sqrt{\log p}/2.

By standard bounds h=4Φ(τ/22)2exp(τ2/16)h=4\Phi(-\tau/2\sqrt{2})\leq 2\exp(-\tau^{2}/16) and, as τlogp/2\tau\leq\sqrt{\log p}/2, h1/ph\geq 1/\sqrt{p}, we have

Combining this with Eq. (101) we now get:

Under the condition of G2\mathcal{G}_{2}, the gradient also satisfies, when evaluated at W=Z(t)=tZ+1tZ\mathbf{W}={\mathbf{Z}}(t)=\sqrt{t}{\mathbf{Z}}+\sqrt{1-t}{\mathbf{Z}}^{\prime}:

The rest of the proof is then the same as before. ∎

Given these lemmas, we can now establish Proposition 6.1.

With ε=1/4{\varepsilon}=1/4, the first term is controlled by Lemma 6.7 while the final two are controlled by Lemma 6.8. We choose ε=1/4{\varepsilon}=1/4 in Eq. (116), and

for large enough CC so that, using the bounds of Lemmas 6.7 and 6.8, we have:

This probability bound is o(1)o(1) provided AA is not too large: we choose A=0.25τexp(τ2/16)pA=0.25\sqrt{\tau\exp(\tau^{2}/16)}\ll\sqrt{p} which guarantees that the bound above is o(1)o(1) when n>Clogpn>C\log p for some CC large enough. This concludes the proposition. ∎

2 Proof of Proposition 6.2

We decompose the empirical covariance matrix (6) as

With a view to employing this inequality, we use Eq. (119) and the triangle inequality:

where the last line follows by noticing that the first term is supported on E{\sf E} of size s0×s0s_{0}\times s_{0} and then using bias bound Eq. (122) entry-wise. We next bound each of the three terns on the right hand side.

The last inequality follows from the Bai-Yin law on eigenvalues of Wishart matrices [Ver12, Corollary 5.35].

Consider the second term in Eq (125). By orthonormality of v1,,vr{\mathbf{v}}_{1},\dots,{\mathbf{v}}_{r}, the matrix Δ2{\mathbf{\Delta}}_{2} is orthogonally equivalent to BZQTU/n\mathbf{B}{\mathbf{Z}}_{{\sf Q}}^{{\sf T}}{\mathbf{U}}/n, where we recall that ZQ{\mathbf{Z}}_{{\sf Q}} denotes the submatrix of Z{\mathbf{Z}} formed by the columns in Q{\sf Q}. Denoting by PU{\mathbf{P}}_{U} the orthogonal projector onto the column space of U{\mathbf{U}}, we then have, with high probability,

Here the penultimate inequality follows by Lemma 5.6 noting that, by invariance under rotations (and since PU{\mathbf{P}}_{U} project onto a random subspace of rr dimensions independent of Z{\mathbf{Z}}), PUZQop\|{\mathbf{P}}_{U}{\mathbf{Z}}_{{\sf Q}}\|_{op} is distributed as the norm of a matrix with i.i.d. standard normal entries, with dimensions Q×r|{\sf Q}|\times r, Qs0|{\sf Q}|\leq s_{0}.

Finally, for the third term of Eq. (125) we use the Bai-Yin law of Wishart matrices [Ver12, Corollary 5.35] to obtain, with high probability:

Finally, substituting the above bounds in Eq. (125), we get

3 Proof of Proposition 6.3

Fix an A[1,p1/3]A\in[1,p^{1/3}], and let L=((β1)n+p)/n2L=\sqrt{((\beta\vee 1)n+p)/n^{2}}. Then, there exists an absolute constant C>0C>0 such that, for any Δ>0\Delta>0:

Let, as before, Tpε(S)T^{\varepsilon}_{p}({\sf S}) denote the ε{\varepsilon}-net of unit vectors supported on SQc{\sf S}\subset{\sf Q}^{c} of size at most p/Ap/A and let T=STpε(S)T=\cup_{{\sf S}}T^{\varepsilon}_{p}({\sf S}). Then, by Lemma 5.3, with ε=1/4{\varepsilon}=1/4:

Therefore, arguing as in proof of Proposition 6.1 (see Lemma 6.7):

Consider the good set G4\mathcal{G}_{4} of pairs \big{(}{\mathbf{Z}},{\mathbf{Z}}^{\prime}\big{)} satisfying:

For \big{(}({\mathbf{Z}},{\mathbf{Z}}^{\prime}\big{)}\in\mathcal{G}_{4}, and tt\in, define Z(t)=tZ+1tZ{\mathbf{Z}}(t)=\sqrt{t}{\mathbf{Z}}+\sqrt{1-t}{\mathbf{Z}}^{\prime}. Now Using Eqs. (140) and (142, the gradient w,Cˉy{\nabla}\langle\mathbf{w},\bar{\mathbf{C}}\mathbf{y}\rangle evaluated at Z(t){\mathbf{Z}}(t) satisfies:

Now applying Corollary 5.5, for L=C((β1)n+p)/n2L=C\sqrt{((\beta\vee 1)n+p)/n^{2}}:

Let ε=1/4{\varepsilon}=1/4, observing that TS:Sp/ATpε(S)T\subseteq\cup_{{\sf S}:|{\sf S}|\leq p/A}T^{\varepsilon}_{p}({\sf S}), we have the bound (using Lemma 5.2 and Stirling’s approximation):

Now we prove a similar lemma when y{\mathbf{y}} has entries that are “spread out”.

where L=Aexp(τ2/C(β1))(n(β1)+p)/n2L_{*}=\sqrt{A\exp(-\tau^{2}/C(\beta\vee 1))(n(\beta\vee 1)+p)/n^{2}}.

For simplicity of notation, it is convenient to introduce the vector y=yS{\mathbf{y}}^{\prime}={\mathbf{y}}_{{\sf S}}. Throughout the proof, we will use that y1\lVert{{\mathbf{y}}^{\prime}}\rVert\leq 1 and yA/p\lVert{{\mathbf{y}}^{\prime}}\rVert_{\infty}\leq\sqrt{A/p}. We compute the gradients as follows:

Combining the bounds in Eqs.(153), (156), we obtain

With K=Cβ1K=C\beta\vee 1, we define the good set G5\mathcal{G}_{5} of pairs (Z,Z)({\mathbf{Z}},{\mathbf{Z}}^{\prime}) satisfying

Define Z(t)=tZ+1tZ{\mathbf{Z}}(t)=\sqrt{t}{\mathbf{Z}}+\sqrt{1-t}{\mathbf{Z}}^{\prime} with (Z,Z)G5({\mathbf{Z}},{\mathbf{Z}}^{\prime})\in\mathcal{G}_{5}. By Eq. (158) the gradient evaluated at Z(t){\mathbf{Z}}(t) is bounded by

Hence on the good set G5\mathcal{G}_{5}, we have:

Therefore the gradient satisfies, on the good set:

where the last equality holds since UU{\mathbf{U}}\in{\mathcal{U}} and by tail bounds on chi-squared random variables. Further

By the above argument, the first term is at most exp(n/C)\exp(-n/C) and we turn to the second term. By the Chernoff bound

As before, we will let ε=1/4{\varepsilon}=1/4. The first term is controlled via Lemma 6.9, while the second is controlled by Lemma 6.10. We keep Δ=Δ\Delta=\Delta_{*} where

so that, via the bounds of Lemmas 6.9, 6.10 and that s02ps_{0}^{2}\leq p:

We now set A=\big{(}(\tau^{2}/K)\exp(\tau^{2}/K)\big{)}^{1/2} with K=C(β1)K=C(\beta\vee 1) for a suitable constant CC and, since τlogp/2\tau\leq\sqrt{\log p}/2, we get that Ap1/3A\leq p^{1/3}. Furthermore, it is straightforward to see that L(np)CL\geq(np)^{-C}, and this implies that

With this setting of AA, we get the form of Δ\Delta_{*} below, as required for the proposition.

4 Proof of Proposition 6.4

Since D\mathbf{D} is a diagonal matrix, its spectral norm is bounded by the maximum of its entries. This is easily done as, for every iQci\in{\sf Q}^{c}:

By the Chernoff bound for χ2\chi^{2}-squared random variables as in Lemma 5.7 followed by the union bound, with probability 1o(1)1-o(1):

for some absolute CC. Here we used the fact that (logp)/n<1(\log p)/n<1.

5 Proof of Proposition 6.5

It suffices to show that with probability 1o(1)1-o(1)

This is a standard argument [BL08b, Lemma A.3] where (following the dependence on β\beta) it suffices to take τC0(β1)logp\tau\geq C_{0}(\beta\vee 1)\sqrt{\log p} for C0C_{0} a sufficiently large absolute constant. We note here that the same can also be proved via the conditioning technique applied in the proofs of Propositions 6.1 and 6.3.

Proof of Theorems 3

We then have, for q{1,,r}q\in\{1,\dots,r\} and i{1,,p}i\in\{1,\dots,p\},

We next bound, with high probability, maxi,qTi,q(a)\max_{i,q}T^{(a)}_{i,q} for a{1,2,3}a\in\{1,2,3\}. Throughout we let εmaxq[r]v^qvq{\varepsilon}\equiv\max_{q\in[r]}\|\mathbf{\widehat{v}}_{q}-{\mathbf{v}}_{q}\|.

where in the last inequality we used q[r]qvq,v^q21vq,v^q2ε2/2\sum_{q^{\prime}\in[r]\setminus q}\langle{\mathbf{v}}_{q^{\prime}},\mathbf{\widehat{v}}_{q}\rangle^{2}\leq 1-\langle{\mathbf{v}}_{q},\mathbf{\widehat{v}}_{q}\rangle^{2}\leq{\varepsilon}^{2}/2.

Consider next the second term. Since ZijiidN(0,1)Z_{ij}\sim_{iid}{\sf N}(0,1), it follows that Ti,q(2)=Wi,qT^{(2)}_{i,q}=|W_{i,q}|, for Wi,qN(0,σi,q2)W_{i,q}\sim{\sf N}(0,\sigma^{2}_{i,q}) a Gaussian random variable with variance

By union bound over i[p]i\in[p], q[r]q\in[r] we obtain

Finally, consider the last term. By rotational invariance of Z{\mathbf{Z}}, the distribution of Ti,q(3)T^{(3)}_{i,q} only depends on the angle between ei{\mathbf{e}}_{i} and v^q\mathbf{\widehat{v}}_{q}. Calling this angle ϑ\vartheta, we have

Both of these terms have Bernstein-type tail bonds, whence

By putting together Eqs. (194), (199), (203), and using assumption A2, we get

Let Q^q={i[p]:  (Σ^v^q)iρ}{\widehat{\sf Q}}_{q}=\{i\in[p]:\;|({\mathbf{\widehat{\Sigma}}}^{\prime}\mathbf{\widehat{v}}_{q})_{i}|\geq\rho\}. We claim that the above implies that, with high probability, QqQ^qQ{\sf Q}_{q}\subseteq{\widehat{\sf Q}}_{q}\subseteq{\sf Q} for all qq.

where the last inequality follows from Eq. (14).

On the other hand, By Theorem 2 and using the assumption (14), we can guarantee

Hence for iQqi\in{\sf Q}_{q}, and considering –to be definite– vq,i>0v_{q,i}>0, we get

where, in the first inequality, we used vq,v^q1ε\langle{\mathbf{v}}_{q},\mathbf{\widehat{v}}_{q}\rangle\geq 1-{\varepsilon}.

This concludes the proof. Keeping track of the dependence on θ\theta, γ\gamma, β\beta, βmin\beta_{\rm min}, we get that the following conditions are sufficient for the theorem’s conclusion to hold (with CC a suitable numerical constant):

All of these conditions are implied by the assumptions of Theorem 3, namely Eq. (14). In particular, this is shown by using the fact that s0logps02log(p/s02)s_{0}\log p\leq s_{0}^{2}\log(p/s_{0}^{2}) for s0ps_{0}\leq\sqrt{p}.

Acknowledgements

We are grateful to David Donoho for his feedback on this manuscript. This work was partially supported by the NSF CAREER award CCF-0743978, the NSF grant CCF-1319979, and the grants AFOSR/DARPA FA9550-12-1-0411 and FA9550-13-1-0036.

References