Sum-of-Squares Lower Bounds for Sparse PCA

Tengyu Ma, Avi Wigderson

Introduction

We start with a general discussion of the tension between sample size and computational efficiency in statistical and learning problems. We then describe the concrete model and problem at hand: Sum-of-Squares algorithms and the Sparse-PCA problem. All are broad topics studied from different viewpoints, and the given references provide more information.

Modern machine learning and statistical inference problems are often high dimensional, and it is highly desirable to solve them using far less samples than the ambient dimension. Luckily, we often know, or assume, some underlying structure of the objects sought, which allows such savings in principle. Typical such assumption is that the number of real degrees of freedom is far smaller than the dimension; examples include sparsity constraints for vectors, and low rank for matrices and tensors. The main difficulty that occurs in nearly all these problems is that while information theoretically the sought answer is present (with high probability) in a small number of samples, actually computing (or even approximating) it from these many samples is a computationally hard problem. It is often expressed as a non-convex optimization program which is NP-hard in the worst case, and seemingly hard even on random instances.

Given this state of affairs, relaxed formulations of such non-convex programs were proposed, which can be solved efficiently, but sometimes to achieve accurate results seem to require far more samples than existential bounds provide. This phenomenon has been coined the “statistical versus computational trade-off” by Chandrasekaran and Jordan [CJ13], who motivate and formalize one framework to study it in which efficient algorithms come from the Sum-of-Squares family of convex relaxations (which we shall presently discuss). They further give a detailed study of this trade-off for the basic de-noising problem [Joh02, Don95, DJ98] in various settings (some exhibiting the trade-off and others that do not). This trade-off was observed in other practical machine learning problems, in particular for the Sparse PCA problem that will be our focus, by Berthet and Rigollet [BR13a].

As it turns out, the study of the same phenomenon was proposed even earlier in computational complexity, primarily from theoretical motivations. Decatur, Goldreich and Ron [DGR97] initiate the study of “computational sample complexity” to study statistical versus computation trade-offs in sample-size. In their framework efficient algorithms are arbitrary polynomial time ones, not restricted to any particular structure like convex relaxations. They point out for example that in the distribution-free PAC-learning framework of Vapnik-Chervonenkis and Valiant, there is often no such trade-off. The reason is that the number of samples is essentially determined (up to logarithmic factors, which we will mostly ignore here) by the VC-dimension of the given concept class learned, and moreover, an “Occam algorithm” (computing any consistent hypothesis) suffices for classification from these many samples. So, in the many cases where efficiently finding a hypothesis consistent with the data is possible, enough samples to learn are enough to do so efficiently! This paper also provide examples where this is not the case in PAC learning, and then turns to an extensive study of possible trade-offs for learning various concept classes under the uniform distribution. This direction was further developed by Servedio [Ser00].

The fast growth of Big Data research, the variety of problems successfully attacked by various heuristics and the attempts to find efficient algorithms with provable guarantees is a growing area of interaction between statisticians and machine learning researchers on the one hand, and optimization and computer scientists on the other. The trade-offs between sample size and computational complexity, which seems to be present for many such problems, reflects a curious “conflict” between these fields, as in the first more data is good news, as it allows more accurate inference and prediction, whereas in the second it is bad news, as a larger input size is a source of increased complexity and inefficiency. More importantly, understanding this phenomenon can serve as a guide to the design of better algorithms from both a statistical and computational viewpoints, especially for problems in which data acquisition itself is costly, and not just computation. A basic question is thus for which problems is such trade-off inherent, and to establish the limits of what is achievable by efficient methods.

Establishing a trade-off has two parts. One has to prove an existential, information theoretic upper bound on the number of samples needed when efficiency is not an issue, and then prove a computational lower bound on the number of samples for the class of efficient algorithms at hand. Needless to say, it is desirable that the lower bounds hold for as wide a class of algorithms as possible, and that it will match the best known upper bound achieved by algorithms from this class. The most general one, the computational complexity framework of [DGR97, Ser00] allows all polynomial-time algorithms. Here one cannot hope for unconditional lower bounds, and so existing lower bounds rely on computational assumptions, e.g.”cryptographic assumptions”, e.g. that factoring integers has no polynomial time algorithm, or other average case assumptions. For example, hardness of refuting random 3CNF was used for establishing the sample-computational tradeoff for learning halfspaces [DLS13], and hardness of finding planted clique in random graphs was used for tradeoff in sparse PCA [BR13a, GMZ14]. On the other hand, in frameworks such as [CJ13], where the class of efficient algorithms is more restricted (e.g. a family of convex relaxations), one can hope to prove unconditional lower bounds, which are called “integrality gaps” in the optimization and algorithms literature. Our main result is of this nature, adding to the small number of such lower bounds for machine learning problems.

We now turn to describe and motivate SoS convex relaxations algorithms, and then the Sparse PCA problem.

2 Sum-of-Squares convex relaxations

Sum-of-Squares algorithms (sometimes called the Lasserre hierarchy) encompasses perhaps the strongest known algorithmic technique for a diverse set of optimization problems. It is a family of convex relaxations introduced independently around the year 2000 by Lasserre [Las01], Parillo [Par00], and in the (equivalent) context of proof systems by Grigoriev [Gri01b]. These papers followed better and better understanding in real algebraic geometry [Art27, Kri64, Ste74, Sho87, Sch91, Put93, Nes00]of David Hilbert’s famous 17th problem on certifying the non-negativity of a polynomial by writing it as a sum of squares (which explains the name of this method). We only briefly describe this important class of algorithms; far more can be found in the book [Las15] and the excellent extensive survey [Lau09].

The SoS method provides a principled way of adding constraints to a linear or convex program in a way that obtains tighter and tighter convex sets containing all solutions of the original problem. This family of algorithms is parametrized by their degree dd (sometimes called the number of rounds); as dd gets larger, the approximation becomes better, but the running time becomes slower, specifically nO(d)n^{O(d)}. Thus in practice one hopes that small degree (ideally constant) would provide sufficiently good approximation, so that the algorithm would run in polynomial time. This method extends the standard semi-definite relaxation (SDP, sometimes called spectral), that is captured already by degree-2 SoS algorithms. Moreover, it is more powerful than two earlier families of relaxations: the Sherali-Adams [SA90] and Lovász-Scrijver [LS91] hierarchies.

The introduction of these algorithms has made a huge splash in the optimization community, and numerous applications of it to problems in diverse fields were found that greatly improve solution quality and time performance over all past methods. For large classes of problems they are considered the strongest algorithmic technique known. Relevant to us is the very recent growing set of applications of constant-degree SoS algorithms to machine learning problems, such as [BKS15, BKS14, BM15]. The survey [BS14] contains some of these exciting developments. Section 2.3 contains some self-contained material about the general framework SoS algorithms as well.

Given their power, it was natural to consider proving lower bounds on what SoS algorithms can do. There has been an impressive progress on SoS degree lower bounds (via beautiful techniques) for a variety of combinatorial optimization problems [Gri01a, Gri01b, Sch08, MPW15]. However, for machine learning problems relatively few such lower bounds (above SDP level) are known [BM15, WGL15] and follow via reductions to the above bounds. So it is interesting to enrich the set of techniques for proving such limits on the power of SoS for ML. The lower bound we prove indeed seem to follow a different route than previous such proofs.

3 Sparse PCA

Sparse principal component analysis, the version of the classical PCA problem which assumes that the direction of variance of the data has a sparse structure, is by now a central problem of high-diminsional statistical analysis. In this paper we focus on the single-spiked covariance model introduced by Johnstone [Joh01]. One observes nn samples from pp-dimensional Gaussian distribution with covariance

where (the planted vector) vv is assumed to be a unit-norm sparse vector with at most kk non-zero entries, and λ>0\lambda>0 represents the strength of the signal. The task is to find (or estimate) the sparse vector vv. More general versions of the problem allow several sparse directions/components and general covariance matrix [Ma13, VL13]. Sparse PCA and its variants have a wide variety of applications ranging from signal processing to biology: see, e.g., [ABN+99, JL09, Che11, JOB10].

The hardness of Sparse PCA, at least in the worst case, can be seen through its connection to the (NP-hard) Clique problem in graphs. Note that if Σ\Sigma is a {0,1}\{0,1\} adjacency matrix of a graph (with 1’s on the diagonal), then it has a kk-sparse eigenvector vv with eigenvalue kk if and only if the graph has a kk-clique. This connection between these two problems is actually deeper, and will appear again below, for our real, average case version above.

From a theoretical point of view, Sparse PCA is one of the simplest examples where we observe a gap between the number of samples needed information theoretically and the number of samples needed for a polynomial time estimator: It has been well understood [VL12, PJ12, BR13b] that information theoretically, given n=O(klogp)n=O(k\log p) samplesWe treat λ\lambda as a constant so that we omit the dependence on it for simplicity throughout the introduction section, one can estimate vv up to constant error (in euclidean norm), using a non-convex (therefore not polynomial time) optimization algorithm. On the other hand, all the existing provable polynomial time algorithms [JL09, AW09, VL13, DM14], which use either diagonal thresholding (for the single spiked model) or semidefinite programming (for general covariance), first introduced for this problem in [dGJL07], need at least quadratically many samples to solve the problem, namely n=O(k2)n=O(k^{2}). Moreover, Krauthgamer, Nadler and Vilenchik [KNV15] and Berthet and Rigollet [BR13b] have shown that for semi-definite programs (SDP) this bound is tight. Specifically, the natural SDP cannot even solve the detection problem: to distinguish the data in equation 1.1 above from the null hypothesis in which no sparse vector is planted, namely the nn samples are drawn from the Gaussian distribution with covariance matrix II.

Recall that the natural SDP for this problem (and many others) is just the first level of the SoS hierarchy, namely degree-2. Given the importance of the Sparse PCA, it is an intriguing question whether one can solve it efficiently with far fewer samples by allowing degree-dd SoS algorithms with larger dd. A very interesting conditional negative answer was suggested by Berthet and Rigollet [BR13b]. They gave an efficient reduction from Planted CliqueAn average case version of the Clique problem in which the input is a random graph in which a much larger than expected clique is planted. problem to Sparse PCA, which shows in particular that degree-dd SoS algorithms for Sparse PCA will imply similar ones for Planted Clique. Gao, Ma and Zhou [GMZ14] strengthen the result by establishing the hardness of the Gaussian single-spiked covariance model, which is an interesting subsetNote that lower bounds for special cases are stronger than those for general cases of models considered by [BR13a]. These are useful as nontrivial constant-degree SoS lower bounds for Planted Clique were recently proved by [MPW15, DM15] (see there for the precise description, history and motivation for Planted Clique). As [BR13b, GMZ14] argue, strong yet believed bounds, if true, would imply that the quadratic gap is tight for any constant dd. Before the submission of this paper, the known lower bounds above for planted clique were not strong enough yet to yield any lower bound for Sparse PCA beyond the minimax sample complexity. We also note that the recent progress [RS15, HKP15] that show the tight lower bounds for planted clique, together with the reductions of [BR13a, GMZ14], also imply the tight lower bounds for Sparse PCA, as shown in this paper.

4 Our contribution

We give a direct, unconditional lower bound proof for computing Sparse PCA using degree-4 SoS algorithms, showing that they too require n=Ω~(k2)n=\widetilde{\Omega}(k^{2}) samples to solve the detection problem (Theorem 3.1), which is tight up to polylogarithmic factors when the strength of the signal λ\lambda is a constant. Indeed the theorem gives a lower bound for every strength λ\lambda, which becomes weaker as λ\lambda gets larger. Our proof proceeds by constructing the necessary pseudo-moments for the SoS program that achieve too high an objective value (in the jargon of optimization, we prove an “integrality gap” for these programs). As usual in such proofs, there is tension between having the pseudo-moments satisfy the constraints of the program and keeping them positive semidefinite (PSD). Differing from past lower bound proofs, we construct two different PSD moments, each approximately satisfying one sets of constraints in the program and is negligible on the rest. Thus, their sum give PSD moments which approximately satisfy all constraints. We then perturb these moments to satisfy constraints exactly, and show that with high probability over the random data, this perturbation leaves the moments PSD.

We note several features of our lower bound proof which makes the result particularly strong and general. First, it applies not only for the Gaussian distribution, but also for Bernoulli and other distributions. Indeed, we give a set of natural (pseudorandomness) conditions on the sampled data vectors under which the SoS algorithm is “fooled”, and show that these conditions are satisfied with high probability under many similar distributions (possessing strong concentration of measure). Next, our lower bound holds even if the hidden sparse vector is discrete, namely its entries come from the set {0,±1k}\{0,\pm\frac{1}{\sqrt{k}}\}. We also extend the lower bound for the detection problem to apply also to the estimation problem, in the regime when the ambient dimension is linear in the number of samples, namely npBnn\leq p\leq Bn for constant BB (see Theorem 3.2).

Organization: Section 2 provides more backgrounds of sparse PCA and SoS algorithms. Then we state our main results in Section 3. In Section 4, we design the pseudo-moments and state their properties and then in Section 5 we prove our main theorems using these moments. Section 6 and 7 contain the analysis of the moments. Section 8 lists the tools that we heavily used for proving concentration inequalities in the analysis. Finally we conclude with a discussion of further directions of study in Section 9.

Formal description of the model and problem

1 Sparse PCA estimation and detection problems

It is also interesting to also consider the sparse component detection problem [BR13b, BR13a], which is the decision problem of distinguishing from random samples the following two distributions

H0H_{0}: data Xj=ξjX^{j}=\xi^{j} is purely random

HvH_{v}: data Xj=ξj+λgjvX^{j}=\xi^{j}+\sqrt{\lambda}g^{j}v contains a hidden sparse signal with strength λ\lambda.

Rigollet [MR14] observed that a polynomial time algorithm for estimation version of sparse PCA with constant error implies that an algorithm for the detection problem with twice number of the samples. Thus, for polynomial time lower bounds, it suffices to consider the detection problem.

We will use XX as a shorthand for the p×np\times n matrix [X1,,Xn]\left[X^{1},\dots,X^{n}\right]. We denote the rows of XX as X1T,,XpTX_{1}^{T},\dots,X_{p}^{T}, therefore XiX_{i}’s are nn-dimensional column vectors. The empirical covariance matrix is defined as Σ^=1nXXT\hat{\Sigma}=\frac{1}{n}XX^{T}.

2 Statistically optimal estimator/detector

It is well known that the following non-convex program achieves optimal statistical minimax rate for the estimation problem and the optimal sample complexity for the detection problem. Note that we scale the variables xx up by a factor of k\sqrt{k} for simplicity (the hidden vector now has entries from {0,±1}\{0,\pm 1\}).

The non-convex program (2.2) statistically optimally solves the sparse PCA problem when nCk/λ2logpn\geq Ck/\lambda^{2}\log p for some sufficiently large CC. Namely, the following hold with high probability. If XX is generated from HvH_{v}, then optimal solution xoptx_{\textrm{opt}} of program (2.2) satisfies 1kxoptxoptTvvT13\|\frac{1}{k}\cdot x_{\textrm{opt}}x_{\textrm{opt}}^{T}-vv^{T}\|\leq\frac{1}{3}, and the objective value λmaxk(Σ^)\lambda_{\max}^{k}(\hat{\Sigma}) is at least 1+2λ31+\frac{2\lambda}{3}. On the other hand, if XX is generated from null hypothesis H0H_{0}, then λmaxk(Σ^)\lambda_{\max}^{k}(\hat{\Sigma}) is at most 1+λ31+\frac{\lambda}{3} .

Therefore, for the detection problem, once can simply use the test λmaxk(Σ^)>1+λ2\lambda_{\max}^{k}(\hat{\Sigma})>1+\frac{\lambda}{2} to distinguish the case of H0H_{0} and HvH_{v}, with n=Ω~(k/λ2)n=\widetilde{\Omega}(k/\lambda^{2}) samples. However, this test is highly inefficient, as the best known ways for computing λmaxk(Σ^)\lambda_{\max}^{k}(\hat{\Sigma}) take exponential time! We now turn to consider efficient ways of solving this problem.

3 Sum of Squares (Lasserre) Relaxations

Here we will only briefly introduce the basic ideas of Sum-of-Squares (Lasserre) relaxation that will be used for this paper. We refer readers to the extensive [Las15, Lau09, BS14] for detailed discussions of sum of squares algorithms and proofs and their applications to algorithm design.

For a mutli-set S[n]S\subset[n], we use xSx^{S} to denote the monomial iSxi\prod_{i\in S}x_{i}. Since MM is a linear operator, it can be clearly described by all the values of MM on the monomial of degree dd, that is, all the values of M(xS)M(x^{S}) for mutli-set SS of size at most dd uniquely determines MM. Moreover, the nonnegativity constraint M(p(x)2)0M(p(x)^{2})\geq 0 is equivalent to the positive semidefiniteness of the matrix-form (as defined below), and therefore the set of all pseudo-moments is convex.

For an even integer dd and any degree-dd pseudo-moments MM, we define the matrix-form of MM as the trivial way of viewing all the values of MM on monomials as a matrix: we use mat(M)\textup{mat}(M) to denote the matrix that is indexed by multi-subset SS of [n][n] with size at most d/2d/2, and mat(M)S,T=M(xSxT)\textup{mat}(M)_{S,T}=M(x^{S}x^{T}).

Given polynomials p(x)p(x) and q1(x),,qm(x)q_{1}(x),\dots,q_{m}(x) of degree at most dd, and a polynomial program,

Note that (2.6) is a valid relaxation because for any solution xx_{*} of (2.5), if we define M(xS)M(x^{S}) to be M(xS)=xSM(x^{S})=x_{*}^{S}, then MM satisfies all the constraints and the objective value is p(x)p(x_{*}). Therefore it is guaranteed that the optimal value of (2.6) is always larger than that of (2.5).

Finally, the key point is that this program can be solved efficiently, in polynomial time in its size, namely in time nO(d)n^{O(d)}. As dd grows, the constraints added make the “pseudo-distribution” defined by the moments closer and closer to an actual distribution, thus providing a tighter relaxation, at the cost of a larger running time to solve it.

In the next section we apply this relaxation to the Sparse PCA problem and state our results.

Main Results

To exploit the sum of squares relaxation framework as described in Section 2.3], we first convert the statistically optimal estimator/detector (2.2) into the “polynomial” program version below.

Now we are ready to apply the sum-of-squares relaxation scheme described in Section 2.3) to the polynomial program above as . For degree-4 relaxation we obtain the following semidefinite program SoS4(Σ^)\textrm{SoS}_{4}(\hat{\Sigma}), which we view as an algorithm for both detection and estimation problems. Note that the same objective function, with only the three constraints (C1), (C2), (C6) gives the degree-2 relaxation, which is precisely the standard SDP relaxation of Sparse PCA studied in [AW09, BR13b, KNV15]. So clearly SoS4(Σ^)\textrm{SoS}_{4}(\hat{\Sigma}) subsumes the SDP relaxation.

Before stating the lower bounds for both detection and estimation in the next two subsections, we comment on the choices made for the outputs of the algorithm in both, as clearly other choices can be made that would be interesting to investigate. For detection, we pick the natural threshold (1+12λ)k(1+\frac{1}{2}\lambda)k from the statistically optimal detection algorithm of Section 2.2. Our lower bound of the objective under H0H_{0} is actually a large constant multiple of λk\lambda k, so we could have taken a higher threshold. To analyze even higher ones would require analyzing the behavior of SoS4\textrm{SoS}_{4} under the (planted) alternative distribution HvH_{v}. For estimation we output the maximizer M2M_{2}^{*} of the objective function, and prove that it is not too correlated with the rank-1 matrix vvTvv^{T} in the planted distribution HvH_{v}. This suggest, but does not prove, that the leading eigenvector of M2M_{2}^{*} (which is a natural estimator for vv) is not too correlated with vv. We finally note that Rigollet’s efficient reduction from detection to estimation is not in the SoS framework, and so our detection lower bound does not automatically imply the one for estimation.

For the detection problem, we prove that SoS4(Σ^)\textrm{SoS}_{4}(\hat{\Sigma}) gives a large objective value on null hypothesis H0H_{0}.

There exists absolute constant CC and rr such that for 1λ<min{k1/4,n}1\leq\lambda<\min\{k^{1/4},\sqrt{n}\} and any pCλnp\geq C\lambda n, kCλ7/6nlogrpk\geq C\lambda^{7/6}\sqrt{n}\log^{r}p, the following holds. When the data XX is drawn from the null hypothesis H0H_{0}, then with high probability (1p101-p^{-10}), the objective value of degree-4 sum of squares relaxation SoS4(Σ^)\textrm{SoS}_{4}(\hat{\Sigma}) is at least 10λk10\lambda k. Consequently, Algorithm 1 can’t solve the detection problem.

To parse the theorem and to understand its consequence, consider first the case when λ\lambda is a constant (which is also arguably the most interesting regime). Then the theorem says that when we have only nk2n\ll k^{2} samples, degree-4 SoS relaxation SoS4\textrm{SoS}_{4} still overfits heavily to the randomness of the data XX under the null hypothesis H0H_{0}. Therefore, using SoS4(Σ^)>(1+λ2)k\textrm{SoS}_{4}(\hat{\Sigma})>(1+\frac{\lambda}{2})k (or even 10λk10\lambda k) as a threshold will fail with high probability to distinguish H0H_{0} and HvH_{v}.

We note that for constant λ\lambda our result is essentially tight in terms of the dependencies between n,k,pn,k,p. The condition p=Ω~(n)p=\widetilde{\Omega}(n) is necessary since otherwise when p=o(n)p=o(n), even without the sum of squares relaxation, the objective value is controlled by (1+o(1))k(1+o(1))k since Σ^\hat{\Sigma} has maximum eigenvalue 1+o(1)1+o(1) in this regime. Furthermore, as mentioned in the introduction, kΩ~(n)k\geq\widetilde{\Omega}(\sqrt{n}) is also necessary (up to poly-logarithmic factors), since when nk2n\gg k^{2}, a simple diagonal thresholding algorithm works for this simple single-spike model.

When λ\lambda is not considered as a constant, the dependence of the lower bound on λ\lambda is not optimal, but close. Ideally one could expect that as long as kλnk\gg\lambda\sqrt{n}, and pλnp\geq\lambda n, the objective value on the null hypothesis is at least Ω(λk)\Omega(\lambda k). Tightening the λ1/6\lambda^{1/6} slack, and possibly extending the range of λ\lambda are left to future study.

2 Lower bounds for the estimation problem

For estimation problem, we prove that M2M_{2}^{*} output by Algorithm 1 is not too correlated with the desired rank-1 matrix vvTvv^{T}.

For any constant BB there exists absolute constants CC and rr such that for λB/2\lambda\leq B/2, Bnp2λnBn\geq p\geq 2\lambda n and o(p)kCnlogrpo(p)\geq k\geq C\sqrt{n}\log^{r}p, suppose the data XX is drawn hypothesis HvH_{v} (model (2.1)), then with high probability (1p10(1-p^{-10}) over the randomness of the data, Algorithm 1 will output M2M_{2}^{*} such that 1kM2vvT1/5\|\frac{1}{k}\cdot M_{2}^{*}-vv^{T}\|\geq 1/5.

We observe that the result is of the same nature (and arguably near-optimal for estimation problem) as [KNV15] achieve the for degree-2 SoS relaxation. The proof follows simply from combining our detection lower bound Theorem 3.1 and arguments similar to [KNV15]. Finally we address a threshold-like behavior of the estimation error. Note that while our Theorem proves that n=Ω~(k2)n=\widetilde{\Omega}(k^{2}) samples is necessary for efficient algorithms to get even constant estimation error, it is known [YZ13, Ma13, WLL14] that slightly more samples, n=O~(k2)n=\widetilde{O}(k^{2}), can already achieve in polynomial time a much smaller (and optimal) estimation error, namely O((klogp)/n)O(\sqrt{(k\log p)/n}).

Design of Pseudo-moments

We start with a sketch of our approach to the design of the moments MM at a very high level, highlighting aspects of their design which are different than in previous lower bounds. First, there are some natural choices to make. We define the degree-2 moments M~\widetilde{M} from the input as the empirical covariance matrix, as was done in the proof of the SDP lower bound. This already gives a large objective value (see Lemma 4.2). We also define taking odd moments (degree 1 and 3) to be 0. The difficult part is designing the degree-4 moments consistently with the constraints and M~\widetilde{M}. We do this in stages, first approximating the constraints (indeed even M~\widetilde{M} only approximately satisfies, in a way we will specify in Section 4.1, constraints (C1) and (C2)) and later in Section 4.2 correcting the moments to satisfy the constraints precisely. Moreover, we separately use different 4-moments for different constraints and then combine them, as follows. We define two different degree-4 PSD moments PP and QQ such that (with high probability) PP almost satisfies constraints (C3), (C5) and (C6), and negligible for constraint (C4) (see Lemma 4.4), whereas QQ almost satisfies constraints (C5), (C4) and (C6), and negligible for (C3) (Lemma 4.5). Therefore taking the sum P+QP+Q will almost satisfy all the constraints (Lemma 4.6), which completes the design of the approximate moments. Finally we “locally” adjust P+QP+Q so that the resulting moments MM exactly satisfy all the constraints (Theorem 4.7), and remain PSD with high probability.

All moments will be defined from the data matrix XX, to which we first apply a simple pre-processing step: we scale all its rows to have square norm nn (around which they are concentrated). We abuse notation and call the scaled matrix XX as well. Note that when the noise model in the null hypothesis H0H_{0} is Bernoulli, namely the entries of XX are chosen as unbiased independent ±1\pm 1 variables, the rows are automatically scaled, which motivates our abuse of notation. We suggest that the reader thinks of this distribution, even though the proof works for a much wider class of distributions.

The properties above of our moments will be proved under the assumption that the scaled matrix XX satisfies the “pseudo-randomness” condition below. This set-up allows us to encapsulate what we really need the data to satisfy, and thus prove our lower bound not only for Gaussian or Bernoulli noise, but actually for a larger family containing both. Namely, we later prove in Section 7, via a series of concentration inequalities, that when data is drawn from null hypothesis H0H_{0}, its scaling XX satisfies the pseudorandomness condition with very high probability under all these noise models. Note that this condition is actually a sequence of statements about deviation from the mean of various polynomials in the data - these will become natural once we define our moments.

Our constructions of the moments will only require the following pseudorandom conditions about the (scaled) data matrix XX,

In this section, we design a pseudo-moments M~\widetilde{M} that approximately satisfies the all the constraints. Then in the next subsection we will locally adjust it to obtain one that exactly satisfies all of the constraints.

where γ\gamma is a constant that to be tuned later according to the signal strength λ\lambda. Note that by design mat(M~)\textup{mat}(\widetilde{M}) is a PSD matrix. We can check straightforwardly that M~\widetilde{M} satisfies constraint (C2) and gives a large objective value (Obj).

There exists constant CC such that for pγnp\geq\gamma n and kCγnlogpk\geq C\gamma\sqrt{n}\log p, suppose XX satisfies Condition 4.1, then M~\widetilde{M} is a valid degree-2 pseudo-moments and satisfies the sparsity constraint (C2).

Moreover, we also have M~(xi2)=γknp2kp\widetilde{M}(x_{i}^{2})=\frac{\gamma kn}{p^{2}}\leq\frac{k}{p}, and M~(xixj)O~(γknp2)\widetilde{M}(x_{i}x_{j})\leq\widetilde{O}(\frac{\gamma k\sqrt{n}}{p^{2}}).

The proof follows simple calculation and concentration inequality. Since Xi2=n\|X_{i}\|^{2}=n for all ii and with high probability over the randomness of XX, for all iji\neq j, Xi,XjO~(n)|\langle X_{i},X_{j}\rangle|\leq\widetilde{O}(\sqrt{n}), we obtain that M~(xi2)=γknp2kp\widetilde{M}(x_{i}^{2})=\frac{\gamma kn}{p^{2}}\leq\frac{k}{p}, and M~(xixj)O~(γknp2)\widetilde{M}(x_{i}x_{j})\leq\widetilde{O}(\frac{\gamma k\sqrt{n}}{p^{2}}). Then to verify equation (4.2), we have

when kγnk\gg\gamma\sqrt{n}. Finally, we can verify the objective value is large

where we use the fact that XXTF2(1o(1))p2n\|XX^{T}\|_{F}^{2}\geq(1-o(1))p^{2}n (see property (P7) in Condition 4.1). ∎

Note that M~\widetilde{M} doesn’t satisfies constraint (C1) exactly. However, we could simply fix this by defining M(xixj)=M~(xixj)M^{\prime}(x_{i}x_{j})=\widetilde{M}(x_{i}x_{j}) for all iji\neq j and M(xi2)=k/pM^{\prime}(x_{i}^{2})=k/p. However, note that we will use a perturbation of MM^{\prime} in our final design in Section 4.2 so that it is consistent with the degree-4 moments.

There exists absolute constant CC such that for pγnp\geq\gamma n and kCγnlogpk\geq C\gamma\sqrt{n}\log p, there exists a degree-2 pseudo-moments MM^{\prime} that satisfies constraints (C1), (C2) and give objective value at least (1o(1))γk(1-o(1))\gamma k.

Now we define a degree-4 pseudo-moment that approximately satisfies all the constraints in SoS4(Σ^)\textrm{SoS}_{4}(\hat{\Sigma}) and give a large objective value. We keep the current (approximate) design M~\widetilde{M} for degree-2 moments, since the degree-2 moments defined in previous section seems to be nearly optimal and enjoys many good properties. Then we define M~(S)=0\widetilde{M}(S)=0 for any multi-set SS of size 3, because apparently degree-3 moments don’t play any role the semidefinite relaxation.

The main difficulty is to define M~(S)\widetilde{M}(S) for SS of size 4. Here we have three constraints (C3), (C4), and (C5), and the PSDness constraint that implicitly compete with each other. We took the following approach. We let M~\widetilde{M} be a sum of two matrices matrix PP and QQ. We ensure that PP “almost” (as will be specified later) satisfies (C3) and (C5), and is negligible for constraints C4. In turn QQ is negligible for constraints (C3) and “almost” satisfies constraint (C4) and (C5). Therefore P+QP+Q will “almost” satisfies constraints (C3)) and (C4)), and satisfy the sparsity constraint (C5). Moreover, PP and QQ will be PSD by definition. Concretely, we define

We note that PP and QQ are well defined pseudo-moments because they are invariant to the permutation of indices and naturally PSD. To see the PSDness, note that PP is a sum of pp rank-1 PSD matrices. Moreover, QQ is also PSD: the part that correpsonds to Xs,XtXi,Xj\langle X_{s},X_{t}\rangle\langle X_{i},X_{j}\rangle is simply a rank-1 PSD matrix; Xi,XtXs,Xj\langle X_{i},X_{t}\rangle\langle X_{s},X_{j}\rangle can be written as XtXs,XiXj\langle X_{t}\otimes X_{s},X_{i}\otimes X_{j}\rangle and therefore it also contributes a PSD matrix to QQ. Similarly, Xj,XtXs,Xi\langle X_{j},X_{t}\rangle\langle X_{s},X_{i}\rangle can be written as XsXt,XiXj\langle X_{s}\otimes X_{t},X_{i}\otimes X_{j}\rangle, and it also contributes a PSD matrix.

In the next two lemmas (one for PP and one for QQ), we formalize the intuition above by showing that, the deviation from PP and QQ exactly satisfying the constraints is captured by error matrices E,F,G\mathcal{E},\mathcal{F},\mathcal{G}. We bound the magnitude of these error matrices and establish the PSDness of some of them so that later we can fix them for the exact satisfaction of the constraints.

There exists some absolute constant CC and rr such that for 1γmin{k1/4,n}1\leq\gamma\leq\min\{k^{1/4},\sqrt{n}\}, 1γn1\leq\gamma\leq n, p=1.1γnp=1.1\gamma n, and kCγ7/6nlogrpk\geq C\cdot\gamma^{7/6}\sqrt{n}\log^{r}p, suppose XX satisfies pseudorandomness condition 4.1, then PP almost satisfies constraint (C3)and (C5), naturally satisfies PSD constraint (C6), and is negligible for constraint (C4) in the sense that

where F\mathcal{F} and E\mathcal{E} are p×pp\times p matrices that satisfy

0FiiO~(γkpn)0\leq\mathcal{F}_{ii}\leq\widetilde{O}\left(\frac{\gamma k}{pn}\right), FijO~(γkpn1.5)|\mathcal{F}_{ij}|\leq\widetilde{O}\left(\frac{\gamma k}{pn^{1.5}}\right) for any ii and jij\neq i.

E\mathcal{E} is PSD with EssO~(γkn)|\mathcal{E}_{ss}|\leq\widetilde{O}\left(\frac{\gamma k}{n}\right), and EstO~(γknn)|\mathcal{E}_{st}|\leq\widetilde{O}(\frac{\gamma k}{n\sqrt{n}}) for any sts\neq t.

There exists some absolute constant CC and rr such that for 1γn1\leq\gamma\leq n, p=1.1γnp=1.1\gamma n and kCγnlogrpk\geq C\cdot\gamma\sqrt{n}\log^{r}p, suppose XX satisfies pseudorandomness condition 4.1, then QQ is negligible for constraint (C3)) and almost satisfies constraint (C4) and (C5) in the sense that,

where G\mathcal{G} is a p×pp\times p PSD matrix GssO~(γk2p2)|\mathcal{G}_{ss}|\leq\widetilde{O}\left(\frac{\gamma k^{2}}{p^{2}}\right) and GstO~(γk2p2n)|\mathcal{G}_{st}|\leq\widetilde{O}\left(\frac{\gamma k^{2}}{p^{2}\sqrt{n}}\right) for any sts\neq t.

Now we are ready to prove that M~=P+Q\widetilde{M}=P+Q almost satisfies all other constraints (C1)- (C6) approximately.

Define M~(xixjxsxt)=P(xixjxsxt)+Q(xixjxsxt)\widetilde{M}(x_{i}x_{j}x_{s}x_{t})=P(x_{i}x_{j}x_{s}x_{t})+Q(x_{i}x_{j}x_{s}x_{t}) for all i,j,s,t[p]i,j,s,t\in[p], then we have under the condition of Lemma 4.4,

where F\mathcal{F}^{\prime} and E\mathcal{E}^{\prime} are p×pp\times p matrices that satisfy

FiiO~(γk2np3)|\mathcal{F}_{ii}^{\prime}|\leq\widetilde{O}\left(\frac{\gamma k^{2}n}{p^{3}}\right) and FijO~(γk2np3)|\mathcal{F}^{\prime}_{ij}|\leq\widetilde{O}\left(\frac{\gamma k^{2}\sqrt{n}}{p^{3}}\right) for all iji\neq j.

E\mathcal{E}^{\prime} is a PSD matrix with EssO~(γkn)\mathcal{E}^{\prime}_{ss}\leq\widetilde{O}\left(\frac{\gamma k}{n}\right) and EstO~(γknn)|\mathcal{E}^{\prime}_{st}|\leq\widetilde{O}(\frac{\gamma k}{n\sqrt{n}}) for sts\neq t.

Note that by definition of M~\widetilde{M} and Lemma 4.4 and Lemma 4.5, we have Fij=Fij+3kpM~(xixj)\mathcal{F}^{\prime}_{ij}=\mathcal{F}_{ij}+\frac{3k}{p}\widetilde{M}(x_{i}x_{j}) and E=E+G\mathcal{E}^{\prime}=\mathcal{E}+\mathcal{G}. The bound for F\mathcal{F}^{\prime} follows the bound for F\mathcal{F} and the facts that M~(xi2)=γknp2\widetilde{M}(x_{i}^{2})=\frac{\gamma kn}{p^{2}} and M~(xixj)O~(γknp2)|\widetilde{M}(x_{i}x_{j})|\leq\widetilde{O}\left(\frac{\gamma k\sqrt{n}}{p^{2}}\right). The PSDness of E\mathcal{E}^{\prime} and the bounds for it follows straightforwardly from those of E\mathcal{E} and G\mathcal{G}. Equation (4.11) follows equation (4.4) and equation (4.8). ∎

2 Exact Pseudo-moments

Note that M~\widetilde{M} only satisfies the constraints approximately up to some additive errors (which are carefully bounded for the purpose of the next theorem). We fix this issue by defining the actual pseudo-moments MM based (on a carefully chosen) local adjustment of M~\widetilde{M}. Concretely, we define M(1)=1M(1)=1 and for all add degree monomial xαx^{\alpha}, M(xα)=0M(x^{\alpha})=0. For distinct i,j,s,ti,j,s,t, we define M(xixjxsxt)M~(xixjxsxt)M(x_{i}x_{j}x_{s}x_{t})\triangleq\widetilde{M}(x_{i}x_{j}x_{s}x_{t}) and M(xi2xsxt)M~(xi2xsxt)M(x_{i}^{2}x_{s}x_{t})\triangleq\widetilde{M}(x_{i}^{2}x_{s}x_{t}). For distinct s,ts,t, we define

and M(xs2xt2)M~(xs2xs2)+δM(x_{s}^{2}x_{t}^{2})\triangleq\widetilde{M}(x_{s}^{2}x_{s}^{2})+\delta where δ\delta a constant (will be proved to be nonnegative) such that

Therefore we can see by construction, it is almost obvious that MM satisfies all the linear constraints (C1), (C3), (C4) exactly. Moreover, since E\mathcal{E}^{\prime} and F\mathcal{F}^{\prime} are small error matrices, most of the entries M(xixjxsxt)M(x_{i}x_{j}x_{s}x_{t}) are equal or close to M~(xixjxsxt)\widetilde{M}(x_{i}x_{j}x_{s}x_{t}). Note that M~\widetilde{M} satisfies the rest of constraints (C2), (C5) and (C6) (even with some slackness). We will prove that the difference between MM and M~\widetilde{M} is small enough so that these constraints are still satisfied by MM.

Under the condition of Lemma 4.4, suppose XX satisfies pseudorandomness condition 4.1, then the pseudo-moments MM defined above satisfies all the constraint (C1)-(C6) of the semidefinite programming and has objective value larger than (1o(1))γk(1-o(1))\gamma k.

We prove that MM satisfies all the constraints in an order that is most convenient for the proof, and check the objective value at the end.

Constraint (C3): This is satisfied by the definition of MM.

Constraint (C4): By the definition, we can see that M(xs3xt)M(x_{s}^{3}x_{t}) is also a perturbation of M~(xs3xt)\widetilde{M}(x_{s}^{3}x_{t}):

where the second equality uses definition (4.12) and the third uses equation (4.9), and the fourth uses (4.10) and the last equality uses the definition (4.12) again.

Moreover, for the case when s=ts=t, we have that

where we used the definition (4.14) of M(xs4)M(x_{s}^{4}) and M(xs2)M(x_{s}^{2}). Therefore we verified that MM satisfies constraint (C4).

Constraint (C1): Using equation (4.13) and (4.14), we have

Next we check the PSDness of matrix mat(M)\textup{mat}(M). Note that mat(M)\textup{mat}(M) is indexed by all the mutli subset of [p][p] of size at most 2, and it consists of 3 blocks mat(M)=blkdiag(M4,M2,M0)\textup{mat}(M)=\textrm{blkdiag}(M_{4},M_{2},M_{0}), where

Therefore it suffices to check that M0M_{0}, M2M_{2} and M4M_{4} are all PSD. M0M_{0} is trivially PSD. We can write M2M_{2} in the following form

where Δ=M2(M~(xsxt))s,t[p]\Delta=M_{2}-\left(\widetilde{M}(x_{s}x_{t})\right)_{s,t\in[p]}. By equation (4.12), we have that for sts\neq t, Δst=1k2(Est2Fst)\Delta_{st}=\frac{1}{k-2}\left(\mathcal{E}^{\prime}_{st}-2\mathcal{F}^{\prime}_{st}\right) for all sts\neq t. Moreover, by definition of M(xs2)M(x_{s}^{2}) and M(xs2xt2)M(x_{s}^{2}x_{t}^{2}), we have that

where second line uses equation (4.10) and the third line uses (4.9), and therefore Δss=p1k1δ+1k1(EssFss)\Delta_{ss}=\frac{p-1}{k-1}\cdot\delta+\frac{1}{k-1}\left(\mathcal{E}_{ss}^{\prime}-\mathcal{F}_{ss}^{\prime}\right). We extract the PSD matrix 1k2E\frac{1}{k-2}\cdot\mathcal{E}^{\prime} form Δ\Delta and obtain Δ=Δ1k2E\Delta^{\prime}=\Delta-\frac{1}{k-2}\cdot\mathcal{E}^{\prime}. Then by this definition, Δss=p1k1δ+1k1(EssFss)1k2Ess\Delta^{\prime}_{ss}=\frac{p-1}{k-1}\cdot\delta+\frac{1}{k-1}\left(\mathcal{E}_{ss}^{\prime}-\mathcal{F}_{ss}^{\prime}\right)-\frac{1}{k-2}\mathcal{E}_{ss}^{\prime}, and Δst=2k2Fst\Delta^{\prime}_{st}=-\frac{2}{k-2}\mathcal{F}^{\prime}_{st}. We use Gershgorin Circle Theorem to establish the PSDness of Δ\Delta^{\prime}. By Lemma 4.6, we have FijO~(γk2np3)|\mathcal{F}_{ij}|\leq\widetilde{O}\left(\frac{\gamma k^{2}\sqrt{n}}{p^{3}}\right) Therefore

where we used the fact that n/p=o(1)\sqrt{n}/p=o(1) which follows form p=1.1γnp=1.1\gamma n and γn\gamma\leq\sqrt{n}.

Using equation (4.16) and constrain (C1) we have that

It follows that δ(1o(1))k(k1)12p(p1)\delta\geq(1-o(1))\cdot\frac{k(k-1)}{12p(p-1)}. Therefore we obtain that Δii=p1k1δ+1k1(EssFss)1k2Ess112(1o(1))kpO~(γn)O~(γknp3)=112(1o(1))kp\Delta_{ii}^{\prime}=\frac{p-1}{k-1}\cdot\delta+\frac{1}{k-1}\left(\mathcal{E}_{ss}^{\prime}-\mathcal{F}_{ss}^{\prime}\right)-\frac{1}{k-2}\mathcal{E}_{ss}^{\prime}\geq\frac{1}{12}(1-o(1))\frac{k}{p}-\widetilde{O}(\frac{\gamma}{n})-\widetilde{O}\left(\frac{\gamma kn}{p^{3}}\right)=\frac{1}{12}(1-o(1))\frac{k}{p}. Therefore we obtain Δiij:jiΔij\Delta_{ii}^{\prime}\geq\sum_{j:j\neq i}|\Delta_{ij}^{\prime}| and by Gershgorin Circle Theorem Δ\Delta^{\prime} is PSD.

Now we examine M4M_{4}. we write M4M_{4} as

where Γ=M4(mat(P)+mat(Q))\Gamma=M_{4}-\left(\textup{mat}(P)+\textup{mat}(Q)\right). One can observe that Γ\Gamma has only non-zero entries of the form

where the last equality uses equation (4.15).

Now we are ready to prove PSDness of Γ\Gamma. We further decompose Γ\Gamma as Γ=Γ+blkdiag(Λ,0)\Gamma=\Gamma^{\prime}+\textrm{blkdiag}(\Lambda^{\prime},0) where Λ\Lambda^{\prime} is the p×pp\times p matrix with Λ=δ11T\Lambda^{\prime}=\delta\mathbf{1}\mathbf{1}^{T} Λ\Lambda^{\prime} is index by iiii, i=1,,pi=1,\dots,p. Note that Λ\Lambda^{\prime} is a PSD matrix and therefore it suffices to prove that Γ=Γblkdiag(Λ,0)\Gamma^{\prime}=\Gamma-\textrm{blkdiag}(\Lambda^{\prime},0) is a PSD matrix.

Note that Γ\Gamma^{\prime} has ijij-th column the same as jiji-th column, and therefore it’s only of rank at most p+p(p1)/2p+p(p-1)/2. We define Γ\Gamma^{\prime\prime} be the p+p(p1)/2p+p(p-1)/2 by p+p(p1)/2p+p(p-1)/2 submatrix of Γ\Gamma^{\prime},that is indexed by subsets (i,i)(i,i) for i[p]i\in[p] and (i,j)(i,j) for i<ji<j. Therefore it suffices to prove that Γ\Gamma^{\prime\prime} is PSD. We prove it using Gershgorin Circle Theorem.

Note that by equation (4.19), we have that Γii,ii=Γii,iiΛii,ii=(pk)k1δ+1k1Eiikk1Fii\Gamma_{ii,ii}^{\prime\prime}=\Gamma_{ii,ii}^{\prime}-\Lambda_{ii,ii}^{\prime}=\frac{(p-k)}{k-1}\cdot\delta+\frac{1}{k-1}\mathcal{E}^{\prime}_{ii}-\frac{k}{k-1}\mathcal{F}_{ii}. Therefore by the lower bound for δ\delta and Lemma 4.6, we obtain, Γii,ii(1o(1))ϵkp\Gamma_{ii,ii}^{\prime\prime}\geq(1-o(1))\frac{\epsilon k}{p}. Moreover, Γii,ij=Γii,ij=Estk2kk2Fst\Gamma_{ii,ij}^{\prime\prime}=\Gamma_{ii,ij}=\frac{\mathcal{E}^{\prime}_{st}}{k-2}-\frac{k}{k-2}\mathcal{F}_{st} and therefore Γii,ijEstk2+kk2FstO~(γn1.5)+O~(γk2np3)O~(γn1.5).|\Gamma^{\prime\prime}_{ii,ij}|\leq|\frac{\mathcal{E}^{\prime}_{st}}{k-2}|+|\frac{k}{k-2}\mathcal{F}_{st}|\leq\widetilde{O}(\frac{\gamma}{n^{1.5}})+\widetilde{O}\left(\frac{\gamma k^{2}\sqrt{n}}{p^{3}}\right)\leq\widetilde{O}(\frac{\gamma}{n^{1.5}}). Furthermore, for i<ji<j, Γij,ij=Γij,ij=δ(1o(1))ϵk2p2\Gamma^{\prime\prime}_{ij,ij}=\Gamma_{ij,ij}=\delta\geq(1-o(1))\frac{\epsilon k^{2}}{p^{2}}. Finally observe that Γii,jj=0\Gamma^{\prime\prime}_{ii,jj}=0 by definition and all other entries of Γ\Gamma^{\prime\prime} are trivially 0 because the corresponding entries of Γ\Gamma and Λ\Lambda^{\prime} vanish. Therefore we are ready to use a variant of Gershogorin Circle Theorem (Lemma 8.8) to prove the PSDness of Γ\Gamma^{\prime}. Taking α=1/γ2\alpha=1/\gamma^{2}, we have for any ii,

where we used the fact that kγnk\gg\gamma\sqrt{n} and ϵ\epsilon is a constant.

where we used kγ4k\geq\gamma^{4} and kγnk\gg\gamma\sqrt{n}. Therefore by Lemma 8.8, we obtain that Γ\Gamma^{\prime\prime} is PSD.

Constraint (C2): Using Lemma 4.2 and equation (4.12), we have that

Constraint (C5): Finally, we check that MM satisfies the sparsity constraint (C5).

where we used (4.11) and the (trivial) facts that Γij,stO(k/p)\Gamma_{ij,st}\leq O(k/p) for any i,j,s,ti,j,s,t and there are only at most O(p2)O(p^{2}) nonzero entries in Γ\Gamma.

Objective value (Obj): Note that by constraint (C1) and Lemma 4.2, we have that iM(xi2)Σ^ii=kiM~(xi2)Σ^ii\sum_{i}M(x_{i}^{2})\hat{\Sigma}_{ii}=k\geq\sum_{i}\widetilde{M}(x_{i}^{2})\hat{\Sigma}_{ii}, then

where in the second inequality we used Lemma 4.2 and the facts that Σ^ij=1nXi,XjO~(1/n)\hat{\Sigma}_{ij}=\frac{1}{n}\langle X_{i},X_{j}\rangle\leq\widetilde{O}(1/\sqrt{n}) and Eij+FijO~(γn1.5)|\mathcal{E}^{\prime}|_{ij}+|\mathcal{F}^{\prime}_{ij}|\leq\widetilde{O}\left(\frac{\gamma}{n^{1.5}}\right), and the last line uses the fact taht γ4k\gamma^{4}\leq k.

Proof of Theorem 3.1 and Theorem 3.2

Now we are ready to prove our main Theorem 3.1. The idea is very simple: we normalize the data matrix XX so that the resulting matrix Xˉ\bar{X} satisfies the the pseudorandomness condition 4.1. Then we apply Theorem 4.7 and obtain a moment matrix which give large objective value with respect to Xˉ\bar{X}. Then we argue that the difference between Xˉ\bar{X} from XX is negligible so that the same moment matrix has also large objective value with respect to XX.

Using the observation above, we take p=1.1γnp=1.1\gamma n with γ=11λ\gamma=11\lambda, and we define Xˉ\bar{X} to matrix obtained by normalizing rows of XX to euclidean norm n\sqrt{n}. Then by Theorem 7.1 it satisfies the pseudorandomenss condition 4.1. Let Σ^=1nXˉXˉT\hat{\Sigma}^{\prime}=\frac{1}{n}\bar{X}\bar{X}^{T} be the covariance matrix defined by Xˉ\bar{X}. By Theorem 4.7 we have that SoS4(Σ^)(1o(1))γk11λk\textrm{SoS}_{4}(\hat{\Sigma}^{\prime})\geq(1-o(1))\gamma k\geq 11\lambda k. Moreover, let MM be the moment defined in Theorem 4.7, and M2M_{2} its restriction to degree-2 moments, that is, M2=(mat(M)S,T)S=T=1M_{2}=\left(\textup{mat}(M)_{S,T}\right)_{|S|=|T|=1}. We are going to show that the entry-wise difference between Σ^\hat{\Sigma} and Σ^\hat{\Sigma}^{\prime} are small enough so that M2,Σ^\langle M_{2},\hat{\Sigma}\rangle is close to M2,Σ^\langle M_{2},\hat{\Sigma}^{\prime}\rangle.

Note that since Xi2=n±O~(n)\|X_{i}\|^{2}=n\pm\widetilde{O}(\sqrt{n}), therefore for any iji\neq j, Σ^ij=Σ^ijXiXj=Σ^ij±O~(1n)Σ^ij=Σ^ij±O~(1n)\hat{\Sigma}^{\prime}_{ij}=\frac{\hat{\Sigma}_{ij}}{\|X_{i}\|\|X_{j}\|}=\hat{\Sigma}_{ij}\pm\widetilde{O}(\frac{1}{\sqrt{n}})|\hat{\Sigma}|_{ij}=\hat{\Sigma}_{ij}\pm\widetilde{O}(\frac{1}{n}). For i=ji=j, we have similarly that Σ^ii=Σ^ii±O~(1n)\hat{\Sigma}^{\prime}_{ii}=\hat{\Sigma}_{ii}\pm\widetilde{O}(\frac{1}{\sqrt{n}}). We bound the difference between M2,Σ^\langle M_{2},\hat{\Sigma}^{\prime}\rangle and M2,Σ^\langle M_{2},\hat{\Sigma}^{\prime}\rangle by the sum of the entry-wise differences:

Therefore M2,Σ^(1o(1))γko(k)=(1o(1))γk\langle M_{2},\hat{\Sigma}\rangle\geq(1-o(1))\gamma k-o(k)=(1-o(1))\gamma k. Therefore the moment MM gives objective value (1o(1))γk(1-o(1))\gamma k for data Σ^\hat{\Sigma}, and therefore SoS4(Σ^)(1o(1))γk10λk\textrm{SoS}_{4}(\hat{\Sigma})\geq(1-o(1))\gamma k\geq 10\lambda k. ∎

Then we prove that Theorem 3.1 together with the arguments in [KNV15] implies Theorem 3.2. The intuition behind is the following: Suppose M2M_{2}^{*} is very close to vvTvv^{T}, then it is close to rank-1 and its leading eigenvector is close to v^\hat{v}. However, since we prove that the objective value is large (which is true also in the planted vector case), M2M_{2}^{*} needs to be highly correlated with Σ^\hat{\Sigma}, which implies its leading eigenvector v^\hat{v} needs to be correlated with Σ^\hat{\Sigma}, which in turns implies that vv is correlated with Σ^\hat{\Sigma}. However, it turns out that vv is not correlated enough with Σ^\hat{\Sigma}, which leads to a contradiction.

We first prove that the optimal value of SoS4(Σ^)\textrm{SoS}_{4}(\hat{\Sigma}) for hypothesis HvH_{v} is also at least 0.99kp/n0.99kp/n. Suppose vv has support SS of size kk. We consider the restriction of linear operator MM to the subset T=[p]\ST=[p]\backslash S, denoted by MTM_{T}. That is, we have that MT(xα)=0M_{T}(x^{\alpha})=0 for any monomial xαx^{\alpha} that contains a factor xix_{i} with iSi\in S, and otherwise MT(xα)=M(xα)M_{T}(x^{\alpha})=M(x^{\alpha}). We also consider the data matrix XTX_{T} obtained by restricting to the rows indexed by TT. Note that since XTX_{T} doesn’t contain the signal, and kn)k\gg\sqrt{n}), using Theorem 4.7 with γ=(pk)/(1.01n)\gamma=(p-k)/(1.01n), we have that there exists pseudo-moment MTM_{T}^{*} which gives objective value (1o(1))γk0.99pk/n\geq(1-o(1))\gamma k\geq 0.99pk/n with respective to covariance matrix Σ^T=1nXTXTT\hat{\Sigma}_{T}=\frac{1}{n}X_{T}X_{T}^{T}. Note that by Proposition 5.1, SoS4(Σ^)SoS4(Σ^T)\textrm{SoS}_{4}(\hat{\Sigma})\geq\textrm{SoS}_{4}(\hat{\Sigma}_{T}) and therefore we obtain that under hypothesis HvH_{v}, with high probability, SoS4(Σ^)0.99kp/n\textrm{SoS}_{4}(\hat{\Sigma})\geq 0.99kp/n.

Now suppose MM^{*} is the maximizer of SoS4(Σ^)\textrm{SoS}_{4}(\hat{\Sigma}), and M2=(M(xixj))i,j[p]M_{2}^{*}=\left(M^{*}(x_{i}x_{j})\right)_{i,j\in[p]}. For the sake of contradiction, we assume that 1kM2vvT1/5\|\frac{1}{k}M_{2}^{*}-vv^{T}\|\leq 1/5. We first show that this implies that M2M_{2}^{*} has an eigenvector v^\hat{v} that is close to vv and its eigenvalue is large. Indeed we have 1kM2vvT1/5=4/5\|\frac{1}{k}M_{2}\|\geq\|vv^{T}\|-1/5=4/5. Therefore the top eigenvector of 1kM2\frac{1}{k}M_{2}^{*} has eigenvalue larger than 4/54/5. Then we can decompose the difference between 1kM2\frac{1}{k}\cdot M_{2}^{*} and vvTvv^{T} into 1kM2vvT=45(v^v^TvvT)+((1kM245v^v^T)15vvT)\frac{1}{k}\cdot M_{2}^{*}-vv^{T}=\frac{4}{5}\cdot\left(\hat{v}\hat{v}^{T}-vv^{T}\right)+\left(\left(\frac{1}{k}M_{2}^{*}-\frac{4}{5}\hat{v}\hat{v}^{T}\right)-\frac{1}{5}vv^{T}\right). Note that since (1kM245v^v^T)\left(\frac{1}{k}M_{2}^{*}-\frac{4}{5}\hat{v}\hat{v}^{T}\right) is a PSD matrix with eigenvalue at most 1/51/5, we have ((1kM245v^v^T)15vvT)2/5\|\left(\left(\frac{1}{k}M_{2}^{*}-\frac{4}{5}\hat{v}\hat{v}^{T}\right)-\frac{1}{5}vv^{T}\right)\|\leq 2/5 by triangle inequality. Then by triangle inequality and our assumption again we obtain that

Therefore we obtain that vvTv^v^T3/5\|vv^{T}-\hat{v}\hat{v}^{T}\|\leq 3/5 and therefore vvTv^v^TF22vvTv^v^T21\|vv^{T}-\hat{v}\hat{v}^{T}\|_{F}^{2}\leq 2\|vv^{T}-\hat{v}\hat{v}^{T}\|^{2}\leq 1. It follow that v,v^2=112vvTv^v^TF21/2|\langle v,\hat{v}\rangle|^{2}=1-\frac{1}{2}\|vv^{T}-\hat{v}\hat{v}^{T}\|_{F}^{2}\geq 1/2.

Next we are going to show that it is impossible for M2M_{2}^{*} to have an eigenvector that is close to vv with a large eigenvalue and therefore we will get a contradiction. Let v^=αv+βs\hat{v}=\alpha v+\beta s where ss is orthogonal to vv and α2+β2=1\alpha^{2}+\beta^{2}=1 and α1/2\alpha\geq\sqrt{1/2}, and β1/2\beta\leq\sqrt{1/2}. Then using triangle inequality we have that v^Σ^αvΣ^+βsΣ^O(λ)α+Σ^β\|\hat{v}\|_{\hat{\Sigma}}\leq\|\alpha v\|_{\hat{\Sigma}}+\|\beta s\|_{\hat{\Sigma}}\leq\sqrt{O(\lambda)\alpha}+\sqrt{\|\hat{\Sigma}\|\beta}. Proposition 5.3 of [KNV15] implies that for sufficiently large CC and λ1\lambda\geq 1, when p/nCλp/n\geq C\lambda, Σ^1.01p/n\|\hat{\Sigma}\|\leq 1.01p/n. Therefore under our assumption we have that Σ^1.01p/n\|\hat{\Sigma}\|\leq 1.01p/n. It follows β1/2\beta\leq\sqrt{1/2} that v^Σ^O(λ)α+βp/nO(λ)+p/2n\|\hat{v}\|_{\hat{\Sigma}}\leq\sqrt{O(\lambda)\alpha}+\sqrt{\beta}\cdot\sqrt{p/n}\leq\sqrt{O(\lambda)}+\sqrt{p/2n}. Therefore, we have that

where in the third line we used the fact that v^Σ^O(λ)+p/2n\|\hat{v}\|_{\hat{\Sigma}}\leq\sqrt{O(\lambda)}+\sqrt{p/2n}, and the last line we used Σ^1.01p/n\|\hat{\Sigma}\|\leq 1.01p/n.Note that this is a contradiction since we assumed that p/nCλp/n\geq C\lambda for sufficiently large CC.

Analysis of matrices P𝑃P and Q𝑄Q

In this section we prove Lemma 4.4 and 4.5. They basically follow direct calculation and the pseudorandomness properties of data matrix XX listed in Condition 4.1.

Note that since p=1.1γnp=1.1\gamma n and 1γn1\leq\gamma\leq n, we have that O(n2)pnO(n^{2})\geq p\geq n. We verify equations (4.3), (4.4) and (4.5) and the bounds for F\mathcal{F} and E\mathcal{E} one by one.

For the case when i=ji=j, we verify P(xi4)P(x_{i}^{4}) using property (P1) and (P2),

where in the second equality we use equation (P3), and pnp\geq n.

Note that for distinct i,j,s,ti,j,s,t, by equation (P4), we have

Therefore taking the sum over all distinct i,j,s,ti,j,s,t we have

where we used kγ7/6nk\gg\gamma^{7/6}\sqrt{n}, which implies that k3γp2.5/nk^{3}\gg\gamma p^{2.5}/n.

By equation (4.2) and equation (4.3), we have that

where we used the fact that pn2p\leq n^{2} and kγnk\gg\gamma\sqrt{n}.

Therefore, combining equation (6.1), (6.2), (6.3), we obtain that

Finally it remains to bound E\mathcal{E}. Note that E\mathcal{E} is a sum of submatrices of PP and therefore it is PSD. Moreover,

where the last inequality uses equation (P2). Finally we bound Est\mathcal{E}_{st} using equation (P6)

Again we verify equation (4.6), (4.7) and (4.8) in order.

Equation (4.6): By definition we have that for any i,ji,j,

Equation (4.8): For the sparsity constraint, we note first that for distinct i,j,s,ti,j,s,t, using property (P2), we have

where we used the fact that k2=c2nk^{2}\gg=\gg c^{2}n. We bound other terms as follows:

There are only at most O(p3)O(p^{3}) different choices of i,j,s,ti,j,s,t such that i,j,s,ti,j,s,t are not distinct, therefore we have

where we used the fact that kγnk\gg\gamma\sqrt{n} and γ1\gamma\geq 1.

Combining equation (6.4) and (6.5), we obtain that

Therefore Gst=2γk2p3ni[p]Xi,XsXi,Xt\mathcal{G}_{st}=\frac{2\gamma k^{2}}{p^{3}n}\sum_{i\in[p]}\langle X_{i},X_{s}\rangle\langle X_{i},X_{t}\rangle forms a PSD matrix. Moreover, when sts\neq t, using equation (P5), we have that

Pseudo-randomness of X𝑋X

In this section, we prove that basically as long as the noise model is subgaussian and has variance 1(which generalizes the standard Bernoulli and Gaussian distributions), after normalizing the rows of the data matrix XH0X\sim H_{0}, it satisfies the pseudorandomness condition 4.1.

The proof of the Theorem relies on the following Proposition and Theorem 7.4. The proposition says that XiTXi\frac{X_{i}^{T}}{\|X_{i}\|} still satisfies good properties like symmetry and that each entries has a subgaussian tail, even though its entries are no longer independent due to normalization. These properties will be encapsulated in the definition of a good random variable following the proposition. Then we prove in Theorem 7.4 that these properties suffice for establishing the pseudorandomness Condition 4.1 with high probability. We will heavily use the ψα\psi_{\alpha}-Orlicz norm (denoted ψα\|\cdot\|_{\psi_{\alpha}}) of a random variable, defined in Definition 8.1, and its properties, summarized in the next (toolbox) section. Intuitively, ψ2\|\cdot\|_{\psi_{2}} norm is a succinct and convenient way to capture the “subgaussianity” of a random variable.

xψ2O~(1)\||x|_{\infty}\|_{\psi_{2}}\leq\widetilde{O}(1)

For simplicity, we call a random variable good if it satisfies the five properties listed in the proposition above. Goodness will be invoked in most statements below.

We will show a random matrix XX with good rows satisfies the pseudo-randomness Condition 4.1 with high probability.

Suppose independent nn-dimensional random vectors X1,,XpX_{1},\dots,X_{p} with pnp\geq n are all good, then X1,,XpX_{1},\dots,X_{p} satisfies the pseudorandomness condition 4.1 with high probability.

The general approach to prove the theorem is just to use the concentration of measure. The only caveat here is that in most of cases, the random variables that we are dealing with are not bounded a.s. so we can’t use Chernoff bound or Bernstein inequality directly. However, though these random variables are not bounded a.s., they typically have a light tail, that is, their ψα\psi_{\alpha} norms can be bounded. Then we are going to apply Theorem 8.4 of Ledoux and Talagrand’s, a extended version of Bernstein inequality with only ψα\psi_{\alpha} norm boundedness required. We will also use other known technical results listed in the toolbox Section 8.

Equation (P1) and (P2) follows the assumptions on XiX_{i}’s and union bound. Equation (P3) is proved in Lemma 7.5 by taking u=Xsu=X_{s} and v=Xtv=X_{t} and view the rest of XiX_{i};s as ZjZ_{j}’s in the statement of Lemma 7.5. Equation (P4) is proved in Lemma 7.6, (P5) in Lemma 7.8, (P6) in Lemma 7.10, and equation P7 is proved in Lemma 7.15. ∎

For any good random variable xx, we have that for fixed u,vu,v with u2=v2=n\|u\|^{2}=\|v\|^{2}=n , uO~(1)|u|_{\infty}\leq\widetilde{O}(1), vO~(1)|v|_{\infty}\leq\widetilde{O}(1), and u,vO~(n)\langle u,v\rangle\leq\widetilde{O}(\sqrt{n}),

and moreover, for pnp\geq n and a sequence of good independent random variables Z1,,ZpZ_{1},\dots,Z_{p}, we have that with high probability,

Therefore by our assumption on uu and vv we obtain that

Suppose pnp\geq n and X1,,XpX_{1},\dots,X_{p} are good independent random variables, then with high probability, for any distinct i,j,s,ti,j,s,t,

For any good random variable xx, and for fixed a,b,c,da,b,c,d such that max{a,b,c,d}=O~(1)\max\{|a|_{\infty},|b|_{\infty},|c|_{\infty},|d|_{\infty}\}=\widetilde{O}(1), and all the pair-wise inner products between a,b,c,da,b,c,d have magnitude at most O~(n)\widetilde{O}(\sqrt{n}), we have that

and moreover, for pnp\geq n and a sequence independent random variable Z1,,ZpZ_{1},\dots,Z_{p} such that each ZiZ_{i} satisfies the conclusion of proposition 7.2, we have that with high probability,

where we use {ijaibicjdjxi2xj2}\left\{\sum_{i\neq j}a_{i}b_{i}c_{j}d_{j}x_{i}^{2}x_{j}^{2}\right\} to denote the sum of aibicjdjxi2xj2a_{i}b_{i}c_{j}d_{j}x_{i}^{2}x_{j}^{2} and all its permutations with repect to a,b,c,da,b,c,d.

Since x,a\langle x,a\rangle has ψ2\psi_{2} norm n\sqrt{n} and similar for the other three terms, we have that by Lemma 8.5 that x,ax,bx,cx,dψ1/2O(n2)\|\langle x,a\rangle\langle x,b\rangle\langle x,c\rangle\langle x,d\rangle\|_{\psi_{1/2}}\leq O(n^{2}). Therefore using Theorem 8.4 we have that

Suppose pnp\geq n and X1,,XpX_{1},\dots,X_{p} are good independent random variables, then with high probability, for any distinct s,ts,t,

With high probability over the randomness of Xi,i[p]\{s,t}X_{i},i\in[p]\backslash\{s,t\},

where the last inequality is by Lemma 7.9. Taking union bound we complete the proof. ∎

For pnp\geq n and a sequence of good independent random variable Z1,,ZpZ_{1},\dots,Z_{p}, and any two fixed vectors u,vu,v with uO~(1)|u|_{\infty}\leq\widetilde{O}(1) and vO~(1)|v|_{\infty}\leq\widetilde{O}(1), and u,vO~(n)\langle u,v\rangle\leq\widetilde{O}(\sqrt{n}), we have that with high probability,

Suppose pnp\geq n and X1,,XpX_{1},\dots,X_{p} are good independent random variables, then with high probability, for any distinct s,ts,t,

Since Xs,XtO~(n)\langle X_{s},X_{t}\rangle\leq\widetilde{O}(\sqrt{n}) and i[p]Xi,Xs2=n2+isXi,Xs2O~(np)\sum_{i\in[p]}\langle X_{i},X_{s}\rangle^{2}=n^{2}+\sum_{i\neq s}\langle X_{i},X_{s}\rangle^{2}\leq\widetilde{O}(np), we have that

Hence combining the three equations above, taking union bound over all choices of s,ts,t, we obtain the desired result. ∎

For pnp\geq n and a sequence of good independent random variables Z1,,ZpZ_{1},\dots,Z_{p}, and any two fixed vectors u,vu,v with uO~(1)|u|_{\infty}\leq\widetilde{O}(1) and vO~(1)|v|_{\infty}\leq\widetilde{O}(1), and u,vO~(n)\langle u,v\rangle\leq\widetilde{O}(\sqrt{n}), we have that with high probability,

We first extract the consider those cases with i=ji=j separately by expanding

where the the last line uses Lemma 7.9. Let Y1,,YpY_{1},\dots,Y_{p} are independent random variables that have the same distribution as Z1,,ZpZ_{1},\dots,Z_{p}, respectively, then by Theorem 8.7, we can decouple the sum of functions of Zi,jZjZ_{i},jZ_{j} into a sum that of functions of ZiZ_{i} and YjY_{j},

Now we can invoke Lemma 7.12 which deals with RHS of the equation above, and obtain that with high probability

Then combine with equation (7.1) we obtain the desired result. ∎

For pnp\geq n and a sequence of good independent random variables Z1,,ZpZ_{1},\dots,Z_{p}, let Y1,,YpY_{1},\dots,Y_{p} be independent random variables which have the same distribution as Z1,,ZpZ_{1},\dots,Z_{p}, respectively, then for any two fixed vectors u,vu,v with uO~(1)|u|_{\infty}\leq\widetilde{O}(1) and vO~(1)|v|_{\infty}\leq\widetilde{O}(1), and u,vO~(n)\langle u,v\rangle\leq\widetilde{O}(\sqrt{n}), with high probability,

Let B=i[p]YiYiTB=\sum_{i\in[p]}Y_{i}Y_{i}^{T}. Therefore by Lemma 7.13, we have that with high probability over the randomness of YY, B2O~(p)\|B\|_{2}\leq\widetilde{O}(p), tr(B)=pn\textrm{tr}(B)=pn. Moreover, by Lemma 7.9, we have that with high probability, uTBvO~(pn)|u^{T}Bv|\leq\widetilde{O}(p\sqrt{n}). Note that these bounds only depend on the randomness of YY, and conditioning on all these bounds are true, we can still use the randomness of ZiZ_{i}’s for concentration. We invoke Lemma 7.14 and obtain that

For pnp\geq n and a sequence of good independent random variables Z1,,ZpZ_{1},\dots,Z_{p}, we have that with high probability,

Finally we observe that ZiZiTn\|Z_{i}Z_{i}^{T}\|\leq n. Thus applying matrix Bernstein inequality we obtain that with high probability,

Let W=xTBxx,ux,vW=x^{T}Bx\langle x,u\rangle\langle x,v\rangle, where xx is a random variable that satisfies the conclusion of Proposition 7.2. We first calculate the expectation of WW,

Observe that ZjTBZjO~(pn)Z_{j}^{T}BZ_{j}\leq\widetilde{O}(pn) a.s. (with respect to the randomness of ZjZ_{j}), and Zi,uZi,vψ1O(n)\|\langle Z_{i},u\rangle\langle Z_{i},v\rangle\|_{\psi_{1}}\leq O(n), therefore we have that ZjTBZjZi,uZi,vψ1O(pn2)\|Z_{j}^{T}BZ_{j}\langle Z_{i},u\rangle\langle Z_{i},v\rangle\|_{\psi_{1}}\leq O(pn^{2}). Using Theorem 8.4, we obtain that with high probability,

Suppose pnp\geq n and X1,,XpX_{1},\dots,X_{p} are good independent random variables, then with high probability,

Toolbox

This section contains a collection of known technical results which are useful in proving the concentration bounds of Section 7. We note that when the data matrix XX takes uniformly {±1}\{\pm 1\} entries, then XX satisfies Proposition 7.2 without any normalization and actually due to the independence of the entries, it’s much easier to prove that it satisfies Condition 4.1.

For 1α<1\leq\alpha<\infty, let ψα(x)=exp(xα)1\psi_{\alpha}(x)=\exp(x^{\alpha})-1. For 0<α<10<\alpha<1, let ψα(x)=xα1\psi_{\alpha}(x)=x^{\alpha}-1 for large enough xxαx\geq x_{\alpha}, and ψα\psi_{\alpha} is linear in [0,xα][0,x_{\alpha}]. The Orlicz norm ψα\psi_{\alpha} of a random variable XX is defined as

Note that by definition ψα\psi_{\alpha} is convex and increasing. The following Theorem of Ledoux and Talagrand’s is our main tool for proving concentration inequalities in Section 7.

There exists a constant KαK_{\alpha} depending on α\alpha such that for a sequence of independent mean zero random variables X1,,XnX_{1},\dots,X_{n} in LψαL_{\psi_{\alpha}}, if 0<α10<\alpha\leq 1,

The following convenient Lemma allows us to control the second part of RHS of (8.2) easily.

There exists absolute constant cc, such that for any real valued random variables X1,,XnX_{1},\dots,X_{n}, we have that

Using Lemma 8.3 and Theorem 8.2, we obtain straightforwardly the following theorem that will be used many times for proving concentration bounds in this paper.

For any 0<α10<\alpha\leq 1, there exists a constant KαK_{\alpha} such that for a sequence of independent random variables X1,,XnX_{1},\dots,X_{n},

which implies that with high probability over the randomness of XiX_{i}’s,

The following two lemmas are used to bound the Orlicz norms of random variables.

There exists constant DαD_{\alpha} depending on α\alpha such that, if two (possibly correlated) random variables XX, YY have ψα\psi_{\alpha} Orlicz norm bounded by Xψαa\|X\|_{\psi_{\alpha}}\leq a and Yψαb\|Y\|_{\psi_{\alpha}}\leq b then XYψα/2Dαab\|XY\|_{\psi_{\alpha/2}}\leq D_{\alpha}ab

Moreover, note that by definition of ψα\psi_{\alpha}, there exists constant CαC_{\alpha} and CαC^{\prime}_{\alpha} such that for x0x\geq 0, Cα(exp(xα)1)ψα(x)Cα(exp(xα)1)C^{\prime}_{\alpha}(\exp(x^{\alpha})-1)\geq\psi_{\alpha}(x)\geq C_{\alpha}(\exp(x^{\alpha})-1). Therefore we have that there exists a constant EαE_{\alpha} such that ψα/2(xy)Eα2(ψα(x)+ψα(y))\psi_{\alpha/2}(|xy|)\leq\frac{E_{\alpha}}{2}(\psi_{\alpha}(|x|)+\psi_{\alpha}(|y|)). Also note that for any constant cc, there exsits constant cc^{\prime} such that ψα(x/c)ψα(x)/c\psi_{\alpha}(x/c^{\prime})\leq\psi_{\alpha}(x)/c. Therefore, choosing DαD_{\alpha} such that ψα/2(x/Dα)ψα(x)/Eα\psi_{\alpha/2}(x/D_{\alpha})\leq\psi_{\alpha}(x)/E_{\alpha} for all x0x\geq 0 we obtain that

The following Theorem of [PMS95] is useful to decouple the randomness of a sum of correlated random variables into a form that is easier to control.

Let X1,,XnX_{1},\dots,X_{n}, Y1,,YnY_{1},\dots,Y_{n} are independent random variables on a measurable space over SS, where XiX_{i} and YiY_{i} has the same distribution for i=1,,ni=1,\dots,n. Let fij(,)f_{ij}(\cdot,\cdot) be a family of functions taking S×SS\times S to a Banach space (B,)(B,\|\cdot\|). Then there exists absolute constant CC, such that for all n2n\geq 2, t>0t>0,

The following lemma provides a simple way to prove the PSDness of a matrix that has large value on the diagonal and small off-diagonal values.

Suppose a matrix Γ\Gamma is of the form Γ=[ABCD]\Gamma=\left[\begin{matrix}A&B\\ C&D\end{matrix}\right] where A,DA,D are square diagonal matrices, and CC is of dimension n×mn\times m. Then Γ\Gamma is PSD if there exists α>0\alpha>0 such that the following holds: Aii1αj[n]Cij,[p]A_{ii}\geq\frac{1}{\alpha}\sum_{j\in[n]}|C_{ij}|,\forall\in[p] and Djjαi[m]Cij,j[p]D_{jj}\geq\alpha\sum_{i\in[m]}|C_{ij}|,\forall j\in[p].

Let vector u=(α1m,α11n)u=(\alpha\mathbf{1}_{m},\alpha^{-1}\mathbf{1}_{n}) and v=(α11m,α1n)v=(\alpha^{-1}\mathbf{1}_{m},\alpha\mathbf{1}_{n}), where 1n\mathbf{1}_{n} is nn-dimensional all 1’s vector. Then Γ\Gamma can be written as Γ=vvT(uuTΓ)\Gamma=vv^{T}\odot(uu^{T}\odot\Gamma), where \odot denotes the entries-wise product of two matrices (That is, ABA\odot B is a matrix with entry AijBijA_{ij}B_{ij}). Using the Gershgorin Circle Theorem and the conditions of the Lemma we obtain that uuTΓuu^{T}\odot\Gamma is PSD and therefore Γ\Gamma is PSD. ∎

Conclusions and future directions

In this paper we prove a lower bounds on the number of samples required to solve the Sparse PCA problem by degree-4 SoS algorithms. This extends the (spectral) degree-2 SoS lower bound for the problem, establishing the quadratic gap from the number of samples required by the (inefficient) information theoretic bound. It remains an interesting problem to extend our lower bounds to higher degree SoS algorithms (or even better, show that with some constant degree, one can solve the problem with fewer samples). One specific difficulty we encountered in trying to extend the lower bound to higher degree was the polynomial constraint xi3=xix_{i}^{3}=x_{i}, capturing the discreteness of the hidden sparse vector. The SoS formulation of the problem without this condition is interesting as well, and lower bound for it may be easier.

As mentioned, it is possible that the best way to prove strong SoS lower bounds for Sparse PCA is via the reduction of Berthet and Rigollet’s [BR13a], namely by improving existing lower bounds for the Planted Clique problem. However, we note that this approach is limited as well, as it seems that sparse PCA is significantly harder. Specifically, Planted Clique has a simple O(logn)O(\log n)-degree SoS algorithm (and thus a quasi-polynomial time) optimal solution, whereas for Sparse PCA we know of no better sample-optimal algorithm than one running in exponential pO(k)p^{O(k)} time. It is thus conceivable that one can even prove Ω(k)\Omega(k)-degree SoS lower bounds for this problem.

More generally, we believe that statistical and machine learning problems provide a new and challenging setting for testing the power and limits and SoS algorithms. While we have fairly strong techniques for proving optimal SoS lower bounds for combinatorial optimization problems, we lack similar ones for ML problems. In particular, many other problems besides Sparse PCA seem to exhibit the apparent trade-off between the number of samples required information theoretically versus via computationally efficient techniques, offering fertile ground for attempting SoS lower bounds establishing such trade-offs.

Finally it would be nice to see more reductions between problems of statistical and ML nature, as the one by [BR13a]. Efficient reductions have proved extremely powerful in computational complexity theory and optimization, enabling the framework of complexity classes and complete problems. Creating such a framework within machine learning will hopefully expose structure on the relative difficulty of problems in this vast area, highlighting some problems as more central to attack, and enabling both new algorithms and new lower bounds.

We would like to thank Sanjeev Arora, Boaz Barak, Philippe Rigollet and David Steurer for helpful discussions throughout various stages of this work.

References