Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors

Samuel B. Hopkins, Tselil Schramm, Jonathan Shi, David Steurer

Introduction

The sum-of-squares (SoS) method (also known as the Lasserre hierarchy) [Sho87, Par00, Nes00, Las01] is a powerful, semidefinite-programming based meta-algorithm that applies to a wide-range of optimization problems. The method has been studied extensively for moderate-size polynomial optimization problems that arise for example in control theory and in the context of approximation algorithms for combinatorial optimization problems, especially constraint satisfaction and graph partitioning (see e.g. the survey [BS14]). For the latter, the SoS method captures and generalizes the best known approximation algorithms based on linear programming (LP), semidefinite programming (SDP), or spectral methods, and it is in many cases the most promising approach to obtain algorithms with better guarantees—especially in the context of Khot’s Unique Games Conjecture [BBH+12].

A sequence of recent works applies the sum-of-squares method to basic problems that arise in unsupervised machine learning: in particular, recovering sparse vectors in linear subspaces and decomposing tensors in a robust way [BKS14, BKS15, HSS15, BM15, GM15]. For a wide range of parameters of these problems, SoS achieves significantly stronger guarantees than other methods, in polynomial or quasi-polynomial time.

In this work, we introduce spectral algorithms for planted sparse vector, tensor decomposition, and tensor principal components analysis (PCA) that exploit the same high-degree information as the corresponding sum-of-squares algorithms without relying on semidefinite programming, and achieve the same (or close to the same) guarantees. The resulting algorithms are quite simple (a couple of lines of matlab code) and have considerably faster running times—quasi-linear or close to linear in the input size.

A surprising implication of our work is that for some problems, spectral algorithms can exploit information from larger values of the parameter dd without spending time nO(d)n^{O(d)}. For example, one of our algorithms runs in nearly-linear time in the input size, even though it uses properties that the sum-of-squares method can only use for degree parameter d4d\geqslant 4. (In particular, the guarantees that the algorithm achieves are strictly stronger than the guarantees that SoS achieves for values of d<4d<4.)

The initial successes of SoS in the machine learning setting gave hope that techniques developed in the theory of approximation algorithms, specifically the techniques of hierarchies of convex relaxations and rounding convex relaxations, could broadly impact the practice of machine learning. This hope was dampened by the fact that in general, algorithms that rely on solving large semidefinite programs are too slow to be practical for the large-scale problems that arise in machine learning. Our work brings this hope back into focus by demonstrating for the first time that with some care SoS algorithms can be made practical for large-scale problems.

In the following subsections we describe each of the problems that we consider, the prior best-known guarantee via the SoS hierarchy, and our results.

Several kinds of algorithms have been proposed for this problem based on linear programming (LP), basic semidefinite programming (SDP), sum-of-squares, and non-convex gradient descent (alternating directions method).

An inherent limitation of simpler convex methods (LP and basic SDP) [SWW12, dGJL04] is that they require the relative sparsity of the planted vector to be polynomial in the subspace dimension (less than n/dn/\sqrt{d} non-zero coordinates).

Sum-of-squares and non-convex methods do not share this limitation. They can recover planted vectors with constant relative sparsity even if the subspace has polynomial dimension (up to dimension O(n1/2)O(n^{1/2}) for sum-of-squares [BKS14] and up to O(n1/4)O(n^{1/4}) for non-convex methods [QSW14]).

Our algorithm runs in nearly linear time in the input size, and matches the best-known guarantees up to a polylogarithmic factor in the subspace dimension [BKS14].

We give a technical overview of the proof in Section 2, and a full proof in Section 4.

Previous work also showed how to recover the planted sparse vector exactly. The task of going from an approximate solution to an exact one is a special case of standard compressed sensing (see e.g. [BKS14]).

2 Overcomplete tensor decomposition

Tensors naturally represent multilinear relationships in data. Algorithms for tensor decompositions have long been studied as a tool for data analysis across a wide-range of disciplines (see the early work of Harshman [Har70] and the survey [KB09]). While the problem is NP-hard in the worst-case [Hås90, HL13], algorithms for special cases of tensor decomposition have recently led to new provable algorithmic results for several unsupervised learning problems [AGH+14, BCMV14, GVX14, AGHK14] including independent component analysis, learning mixtures of Gaussians [GHK15], Latent Dirichlet topic modeling [AFH+15] and dictionary learning [BKS15]. Some previous learning algorithms can also be reinterpreted in terms of tensor decomposition [Cha96, MR06, NR09].

A key algorithmic challenge for tensor decompositions is overcompleteness, when the number of components is larger than their dimension (i.e., the components are linearly dependent). Most algorithms that work in this regime require tensors of order 44 or higher [LCC07, BCMV14]. For example, the FOOBI algorithm of [LCC07] can recover up to Ω(d2)\Omega(d^{2}) components given an order-44 tensor in dimension dd under mild algebraic independence assumptions for the components—satisfied with high probability by random components. For overcomplete 33-tensors, which arise in many applications of tensor decompositions, such a result remains elusive.

where a1,,ana_{1},\ldots,a_{n} are random unit or Gaussian vectors, the goal is to approximately recover the components a1,,ana_{1},\ldots,a_{n}.

Given T\mathbf{T} sampled as above, find a unit vector bb that has correlation maxiai,b1η\max_{i}\langle a_{i},b\rangle\geqslant 1-\eta with one of the vectors aia_{i}.

Given T\mathbf{T} sampled as above, find a set of unit vectors {b1,,bn}\{b_{1},\ldots,b_{n}\} such that ai,bi1η\langle a_{i},b_{i}\rangle\geqslant 1-\eta for every i[n]i\in[n].

We give a technical overview of the proof in Section 2, and a full proof in Section 5.

We remark that the above algorithm only requires access to the input tensor with some fixed inverse polynomial accuracy because each of its four steps amplifies errors by at most a polynomial factor (see Algorithm 5.17). In this sense, the algorithm is robust.

3 Tensor principal component analysis

For this problem, our improvements over the previous results are more modest—we achieve signal-to-noise guarantees matching [HSS15], but with an algorithm that runs in linear time rather than near-linear time (time O(d3)O(d^{3}) rather than O(d3polylogd)O(d^{3}\operatorname{polylog}d), for an input of size d3d^{3}).

There is an algorithm which solves the tensor principal component analysis problem with accuracy η>0\eta>0 whenever the signal-to-noise ratio satisfies τO(n3/4/ηlog1/2n)\tau\geqslant O(n^{3/4}/\eta\cdot\log^{1/2}n). Furthermore, the algorithm runs in time O(d3)O(d^{3}).

Though for tensor PCA our improvement over previous work is modest, we include the results here as this problem is a pedagogically poignant illustration of our techniques. We give a technical overview of the proof in Section 2, and a full proof in Section 6.

4 Related work

Foremost, this work builds upon the SoS algorithms of [BKS14, BKS15, GM15, HSS15]. In each of these previous works, a machine learning decision problem is solved using an SDP relaxation for SoS. In these works, the SDP value is large in the yes case and small in the no case, and the SDP value can be bounded using the spectrum of a specific matrix. This was implicit in [BKS14, BKS15], and in [HSS15] it was used to obtain a fast algorithm as well. In our work, we design spectral algorithms which use smaller matrices, inspired by the SoS certificates in previous works, to solve these machine-learning problems much faster, with almost matching guarantees.

A key idea in our work is that given a large matrix with information encoded in the matrix’s spectral gap, one can often efficiently “compress” the matrix to a much smaller one without losing that information. This is particularly true for problems with planted solutions. In this way, we are able to improve running time by replacing an nO(d)n^{O(d)}-sized SDP with an eigenvector computation for an nk×nkn^{k}\times n^{k} matrix, for some k<dk<d.

The idea of speeding up LP and SDP hierarchies for specific problems has been investigated in a series of previous works [dlVK07, BRS11, GS12], which shows that with respect to local analyses of the sum-of-squares algorithm it is sometimes possible to improve the running time from nO(d)n^{O(d)} to 2O(d)nO(1)2^{O(d)}\cdot n^{O(1)}. However, the scopes and strategies of these works are completely different from ours. First, the notion of local analysis from these works does not apply to the problems considered here. Second, these works employ the ellipsoid method with a separation oracle inspired by rounding algorithms, whereas we reduce the problem to an ordinary eigenvector computation.

It would also be interesting to see if our methods can be used to speed up some of the other recent successful applications of SoS to machine-learning type problems, such as [BM15], or the application of [BKS14] to tensor decomposition with components that are well-separated (rather than random). Finally, we would be remiss not to mention that SoS lower bounds exist for several of these problems, specifically for tensor principal components analysis, tensor prediction, and sparse PCA [HSS15, BM15, MW15]. The lower bounds in the SoS framework are a good indication that we cannot expect spectral algorithms achieving better guarantees.

Techniques

Our approach for faster algorithms based on SoS algorithms is to construct specific matrices (polynomials) in this affine linear space, then compute their top eigenvectors. By designing our matrices carefully, we ensure that our algorithms have access to the same higher degree information that the sum-of-squares algorithm can access, and this information affords an advantage over the basic spectral methods for these problems. At the same time, our algorithms avoid searching for the best polynomial and matrix representation, which gives us faster running times since we avoid semidefinite programming. This approach is well suited to average-case problems where we avoid the problem of adversarial choice of input; in particular it is applicable to machine learning problems where noise and inputs are assumed to be random.

A serious limitation of the above approach is that the representation of a degree-dd polynomial requires size roughly ndn^{d}. Hence, even avoiding the use of semidefinite programming, improving upon running time O(nd)O(n^{d}) requires additional ideas.

In each of the problems that we consider, we have a large matrix (suggested by a SoS algorithm) with a “signal” planted in some amount of “noise”. We show that in some situations, this large matrix can be compressed significantly without loss in the signal by applying partial trace operations. In these situations, the partial trace yields a smaller matrix with the same signal-to-noise ratio as the large matrix suggested by the SoS algorithm, even in situations when lower degree sum-of-squares approaches are known to fail.

If AA is a Wigner matrix (e.g. a symmetric matrix with iid ±1\pm 1 entries), then both Tr(A),An\operatorname{Tr}(A),\|A\|\approx\sqrt{n}, and the above condition is indeed met. In our average case/machine learning settings the “noise” component is not as simple as ABA\otimes B with AA a Wigner matrix. Nonetheless, we are able to ensure that the noise displays a similar behavior under partial trace operations. In some cases, this requires additional algorithmic steps, such as random projection in the case of tensor decomposition, or centering the matrix eigenvalue distribution in the case of the planted sparse vector.

It is an interesting question if there are general theorems describing the behavior of spectral norms under partial trace operations. In the current work, we compute the partial traces explicitly and estimate their norms directly. Indeed, our analyses boil down to concentrations bounds for special matrix polynomials. A general theory for the concentration of matrix polynomials is a notorious open problem (see [MW13]).

Partial trace operations have previously been applied for rounding SoS relaxations. Specifically, the operation of reweighing and conditioning, used in rounding algorithms for sum-of-squares such as [BRS11, RT12, BKS14, BKS15, LR15], corresponds to applying a partial trace operation to the moments matrix returned by the sum-of-squares relaxation.

We now give a technical overview of our algorithmic approach for each problem, and some broad strokes of the analysis for each case. Our most substantial improvements in runtime are for the planted sparse vector and overcomplete tensor decomposition problems (Section 2.1 and Section 2.2 respectively). Our algorithm for tensor PCA is the simplest application of our techniques, and it may be instructive to skip ahead and read about tensor PCA first (Section 2.3).

1 Planted sparse vector in random linear subspace

It follows that in order to distinguish between a random subspace with a planted sparse vector (yes case) and a completely random subspace (no case), it is enough to compute the second-largest eigenvalue of a d2d^{2}-by-d2d^{2} matrix (representing the 44-norm polynomial over the subspace as above). This decision version of the problem, while strictly speaking easier than the search version above, is at the heart of the matter: one can show that the large eigenvalue for the yes case corresponds to an eigenvector which encodes the coefficients of the sparse planted vector in the basis.

The partial trace of our matrix M=i=1n(aiai)(aiai)M=\sum_{i=1}^{n}(a_{i}a_{i}^{\top})\otimes(a_{i}a_{i}^{\top}) is easy to compute directly,

In the yes case (random subspace with planted sparse vector), a direct computation shows that

Hence, a natural approach to distinguish between the yes case and no case (completely random subspace) is to upper bound the spectral norm of NN in the no case.

For linear sparsity k=εnk=\varepsilon\cdot n, this inequality is true so long as d(n/ε2)1/3d\ll(n/\varepsilon^{2})^{1/3}, which is somewhat worse than the bound n\sqrt{n} bound on the dimension that we are aiming for.

Recall that TrB=iλi(B)\operatorname{Tr}B=\sum_{i}\lambda_{i}(B) for a symmetric matrix BB. As discussed above, the partial trace approach works best when the noise behaves as the tensor of two Wigner matrices, in that there are cancellations when the eigenvalues of the noise are summed. In our case, the noise terms (aiai)(aiai)(a_{i}a_{i}^{\top})\otimes(a_{i}a_{i}^{\top}) do not have this property, as in fact Traiai=ai2d/n\operatorname{Tr}a_{i}a_{i}^{\top}=\|a_{i}\|^{2}\approx d/n. Thus, in order to improve the dimension bound, we will center the eigenvalue distribution of the noise part of the matrix. This will cause it to behave more like a Wigner matrix, in that the spectral norm of the noise will not increase after a partial trace. Consider the partial trace of a matrix of the form

for some constant α>0\alpha>0. The partial trace of this matrix is

We choose the constant αd/n\alpha\approx d/n such that our matrix NN^{\prime} has expectation in the no case, when the subspace is completely random. In the yes case, the Rayleigh quotient of NN^{\prime} at x0x_{0} simply shifts as compared to NN, and we have λ\textscyesx0,Nx0v0441/k\lambda_{\textsc{yes}}\geqslant\langle x_{0},N^{\prime}x_{0}\rangle\approx\lVert v_{0}\rVert_{4}^{4}\geqslant 1/k (see Lemma 4.5 and sublemmas). On the other hand, in the no case, this centering operation causes significant cancellations in the eigenvalues of the partial trace matrix (instead of just shifting the eigenvalues). In the no case, NN^{\prime} has spectral norm λ\textscnoO(d/n3/2)\lambda_{\textsc{no}}\leqslant O(d/n^{3/2}) for dnd\ll\sqrt{n} (using standard matrix concentration bounds; again see Lemma 4.5 and sublemmas). Therefore, the spectral norm of the matrix NN^{\prime} allows us to distinguish between the yes and no case as long as d/n3/21/kd/n^{3/2}\ll 1/k, which is satisfied as long as knk\ll n and dnd\ll\sqrt{n}. We give the full formal argument in Section 4.

2 Overcomplete tensor decomposition

The starting point of our algorithm is the polynomial f(x)=i=1nai,x3f(x)=\sum_{i=1}^{n}\langle a_{i},x\rangle^{3}. It turns out that for nd1.5n\ll d^{1.5} the (approximate) maximizers of this polynomial are close to the components a1,,ana_{1},\ldots,a_{n}, in the sense that f(x)1f(x)\approx 1 if and only if maxi[n]ai,x21\max_{i\in[n]}\langle a_{i},x\rangle^{2}\approx 1. Indeed, Ge and Ma [GM15] show that the sum-of-squares method already captures this fact at degree 12, which implies a quasipolynomial time algorithm for this tensor decomposition problem via a general rounding result of Barak, Kelner, and Steurer [BKS15].

The simplest approach to this problem is to consider the tensor representation of the polynomial T=i[n]ai3\mathbf{T}=\sum_{i\in[n]}a_{i}^{\otimes 3}, and flatten it, hoping the singular vectors of the flattening are correlated with the aia_{i}. However, this approach is doomed to failure for two reasons: firstly, the simple flattenings of T\mathbf{T} are d2×dd^{2}\times d matrices, and since ndn\gg d the ai2a_{i}^{\otimes 2} collide in the column space, so that it is impossible to determine Span{ai2}\operatorname{Span}\{a_{i}^{\otimes 2}\}. Secondly, even for ndn\leqslant d, because the aia_{i} are random vectors, their norms concentrate very closely about 11. This makes it difficult to distinguish any one particular aia_{i} even when the span is computable.

Of course, we do not have access to the higher-order tensor i[n]ai6\sum_{i\in[n]}a_{i}^{\otimes 6}. Instead, we can obtain a noisy version of this tensor. Our approach considers the following matrix representation of the polynomial f2f^{2},

To explain how the above algorithm is a speedup of SoS, we give an overview of the SoS algorithm of [GM15, BKS15]. There, the degree-tt SoS SDP program is used to obtain an order-tt tensor χt\chi_{t} (or a pseudodistribution). Informally speaking, we can understand χt\chi_{t} as a proxy for i[n]ait\sum_{i\in[n]}a_{i}^{\otimes t}, so that χt=i[n]ait+N\chi_{t}=\sum_{i\in[n]}a_{i}^{\otimes t}+N, where NN is a noise tensor. While the precise form of NN is unclear, we know that NN must obey a set of constraints imposed by the SoS hierarchy at degree tt. For a formal discussion of pseudodistributions, see [BKS15].

Our subquadratic-time algorithm can be viewed as a low-degree, spectral analogue of the [BKS15] SoS algorithm. However, rather than relying on an SDP to produce an object close to i[n]ait\sum_{i\in[n]}a_{i}^{\otimes t}, we manufacture one ourselves by taking the Kronecker square of our input tensor. We explicitly know the form of the deviation of T2\mathbf{T}^{\otimes 2} from i[n]ai6\sum_{i\in[n]}a_{i}^{\otimes 6}, unlike in [BKS15], where the deviation of the SDP certificate χt\chi_{t} from i[n]ait\sum_{i\in[n]}a_{i}^{\otimes t} is poorly understood. We are thus able to control this deviation (or “noise”) in a less computationally intensive way, by cleverly designing a partial trace operation which decreases the spectral norm of the deviation. Since the tensor handled by the algorithm is much smaller—order 66 rather than order logn\log n—this provides the desired speedup.

3 Tensor principal component analysis

A previous application of SoS techniques to this problem discussed several SoS or spectral algorithms, including one that runs in quasi-linear time [HSS15]. Here we apply the partial trace method to a subquadratic spectral SoS algorithm discussed in [HSS15] to achieve nearly the same signal-to-noise guarantee in only linear time.

Specifically, let AiA_{i} be the iith slice of A\mathbf{A}, so that x,Aix\langle x,A_{i}x\rangle is the quadratic form j,kAijkxjxk\sum_{j,k}\mathbf{A}_{ijk}x_{j}x_{k}. Then there is a SoS proof that T(x)\mathbf{T}(x) is bounded by T(x)τv,x3f(x)1/2x|\mathbf{T}(x)-\tau\cdot\langle v,x\rangle^{3}|\leqslant f(x)^{1/2}\cdot\|x\|, where f(x)f(x) is the degree-4 polynomial f(x)=ix,Aix2f(x)=\sum_{i}\langle x,A_{i}x\rangle^{2}. The polynomial ff has a convenient matrix representation: f(x)=x2,(iAiAi)x2f(x)=\langle x^{\otimes 2},(\sum_{i}A_{i}\otimes A_{i})x^{\otimes 2}\rangle: since this matrix is a sum of iid random matrices AiAiA_{i}\otimes A_{i}, it is easy to show that this matrix spectrally concentrates to its expectation. So with high probability one can show that the eigenvalues of iAiAi\sum_{i}A_{i}\otimes A_{i} are at most d3/2log(d)1/2\approx d^{3/2}\log(d)^{1/2} (except for a single spurious eigenvector), and it follows that degree-44 SoS solves tensor PCA so long as τd3/4log(d)1/4\tau\gg d^{3/4}\log(d)^{1/4}.

This leads the authors to consider a slight modification of f(x)f(x), given by g(x)=ix,Tix2g(x)=\sum_{i}\langle x,T_{i}x\rangle^{2}, where TiT_{i} is the iith slice of T\mathbf{T}. Like T\mathbf{T}, the function gg also contains information about vv, and the SoS bound on the noise term in T\mathbf{T} carries over as an analogous bound on the noise in gg. In particular, expanding TiTiT_{i}\otimes T_{i} and ignoring some negligible cross-terms yields

In this work we speed this up to a linear time algorithm via the partial trace approach. As we have seen, the heart of the matter is to show that taking the partial trace of τ2(vv)(vv)+iAiAi\tau^{2}\cdot(v\otimes v)(v\otimes v)^{\top}+\sum_{i}A_{i}\otimes A_{i} does not increase the spectral noise. That is, we require that

Notice that the AiA_{i} are essentially Wigner matrices, and so it is roughly true that Tr(Ai)Ai|\operatorname{Tr}(A_{i})|\approx\|A_{i}\|, and the situation is very similar to our toy example of the application of partial traces in Section 2.

Heuristically, because i[n]AiAi\sum_{i\in[n]}A_{i}\otimes A_{i} and i[n]Tr(Ai)Ai\sum_{i\in[n]}\operatorname{Tr}(A_{i})\cdot A_{i} are random matrices, we expect that their eigenvalues are all of roughly the same magnitude. This means that their spectral norm should be close to their Frobenius norm divided by the square root of the dimension, since for a matrix MM with eigenvalues λ1,,λn\lambda_{1},\ldots,\lambda_{n}, MF=i[n]λi2\|M\|_{F}=\sqrt{\sum_{i\in[n]}\lambda_{i}^{2}}. By estimating the sum of the squared entries, we expect that the Frobenious norm of iTr(Ai)Ai\sum_{i}\operatorname{Tr}(A_{i})\cdot A_{i} is less than that of iAiAi\sum_{i}A_{i}\otimes A_{i} by a factor of d\sqrt{d} after the partial trace, while the dimension decreases by a factor of dd, and so assuming that the eigenvalues are all of the same order, a typical eigenvalue should remain unchanged. We formalize these heuristic calculations using standard matrix concentration arguments in Section 6.

Preliminaries

For a vectors space VV, we may denote by L(V)\mathcal{L}(V) the space of linear operators from VV to VV. The space orthogonal to a vector vv is denoted vv^{\perp}.

For a matrix MM, we use M1M^{-1} to denote its inverse or its Moore-Penrose pseudoinverse; which one it is will be clear from context. For MM PSD, we write M1/2M^{-1/2} for the unique PSD matrix with (M1/2)2=M1(M^{-1/2})^{2}=M^{-1}.

Planted sparse vector in random linear subspace

In this section we give a nearly-linear-time algorithm to recover a sparse vector planted in a random subspace.

The leverage scores a12,,an2\|a_{1}\|^{2},\ldots,\|a_{n}\|^{2} are clearly computable in time O(nd)O(nd). In the course of proving correctness of the algorithm we will show that AA has constant spectral gap, so by a standard analysis O(logd)O(\log d) matrix-vector multiplies suffice to recover its top eigenvector. A single matrix-vector multiply AxAx requires computing ci:=(ai2dn)ai,xc_{i}\mathrel{\mathop{:}}=(\|a_{i}\|^{2}-\tfrac{d}{n})\langle a_{i},x\rangle for each ii (in time O(nd)O(nd)) and summing i[n]cixi\sum_{i\in[n]}c_{i}x_{i} (in time O(nd)O(nd)). Finally, computing SuSu requires summing dd vectors of dimension nn, again taking time O(nd)O(nd).

The following theorem expresses correctness of the algorithm.

When dn1/2/polylog(n)d\leqslant n^{1/2}/\operatorname{polylog}(n), for any sparsity ε>0\varepsilon>0, w.ov.p. the top eigenvector uu of i=1n(ai2dn)aiai\sum_{i=1}^{n}(\|a_{i}\|^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top} has Su,v021O(ε1/4)o(1)\langle Su,v_{0}\rangle^{2}\geqslant 1-O(\varepsilon^{1/4})-o(1).

where e1e_{1} is the first standard basis vector and MO(v043n1/4+v042n1/2+v04n3/4+n1)\|M\|\leqslant O(\|v_{0}\|_{4}^{3}\cdot n^{-1/4}+\|v_{0}\|_{4}^{2}\cdot n^{-1/2}+\|v_{0}\|_{4}\cdot n^{-3/4}+n^{-1}).

The second ingredient we need is that the algorithm is robust to exchanging this good basis for an arbitrary orthogonal basis.

Last, we will need the following fact, which follows from standard concentration. The proof is in Section B.

By Lemma 4.5 and Lemma 4.6, we can write the matrix A=i=1n(ai22dn)aiaiA=\sum_{i=1}^{n}(\|a_{i}\|_{2}^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top} as

We have assumed that v044(εn)1\|v_{0}\|_{4}^{4}\geqslant(\varepsilon n)^{-1}, and so since AA is an almost-rank-one matrix (Lemma A.3), the top eigenvector uu of AA has u,Qe121O(ε1/4)\langle u,Qe_{1}\rangle^{2}\geqslant 1-O(\varepsilon^{1/4}), so that Su,SQe121O(ε1/4)\langle Su,SQe_{1}\rangle^{2}\geqslant 1-O(\varepsilon^{1/4}) by column-orthogonality of SS.

We now prove Lemma 4.5. We decompose the matrix in question into a contribution from v044\|v_{0}\|_{4}^{4} and the rest: explicitly, the decomposition is (ai22dn)aiai=v(i)2aiai+(bi22dnaiai)\sum(\|a_{i}\|_{2}^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top}=\sum v(i)^{2}\cdot a_{i}a_{i}^{\top}+\sum(\|b_{i}\|_{2}^{2}-\tfrac{d}{n}\cdot a_{i}a_{i}^{\top}). This first lemma handles the contribution from v044\|v_{0}\|_{4}^{4}.

where MO(v43n1/4+v42n1/2)\|M^{\prime}\|\leqslant O(\|v\|_{4}^{3}n^{-1/4}+\|v\|_{4}^{2}n^{-1/2}) w.ov.p..

We first show an operator-norm bound on the principal submatrix i=1nv(i)2bibi\sum_{i=1}^{n}v(i)^{2}\cdot b_{i}b_{i}^{\top} using the truncated matrix Bernstein inequality Proposition A.7. First, the expected operator norm of each summand is bounded:

The operator norms are bounded by constant-degree polynomials in Gaussian variables, so Lemma A.8 applies to truncate their tails in preparation for application of a Bernstein bound. We just have to calculate the variance of the sum, which is at most

for appropriate choice of dn1/2/polylog(n)d\leqslant n^{-1/2}/\operatorname{polylog}(n). Hence, by triangle inequality, i=1nv(i)2bibiv42n1/2\|\sum_{i=1}^{n}v(i)^{2}\cdot b_{i}b_{i}^{\top}\|\leqslant\|v\|_{4}^{2}n^{-1/2} w.ov.p..

We have already bounded i=1nqiqi=i=1nv(i)2bibi\sum_{i=1}^{n}q_{i}q_{i}^{\top}=\sum_{i=1}^{n}v(i)^{2}\cdot b_{i}b_{i}^{\top}. At the same time, i=1npipi=v44\|\sum_{i=1}^{n}p_{i}p_{i}^{\top}\|=\|v\|_{4}^{4}. By Lemma A.1, then,

w.ov.p.. A final application of triangle inquality gives the lemma. ∎

Our second lemma controls the contribution from the random part of the leverage scores.

Like in the proof of Lemma 4.8, i=1n(bi22dn)aiai\sum_{i=1}^{n}(\|b_{i}\|_{2}^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top} decomposes into a convenient block structure; we will bound each block separately.

The termwise operator norms are bounded by constant-degree polynomials in Gaussian variables, so Lemma A.8 applies to truncate the tails of the summands in preparation for a Bernstein bound. We just have to compute the variance of the sum, which is small because we have centered the coefficients:

for appropriate choice of dn/polylog(n)d\leqslant n/\operatorname{polylog}(n).

We turn to the other blocks from (4.3). The upper-left block contains just the scalar i=1n(bi22dn)v(i)2\sum_{i=1}^{n}(\|b_{i}\|_{2}^{2}-\tfrac{d}{n})v(i)^{2}. By standard concentration each term is bounded: w.ov.p.,

It remains just to address the block i=1n(bi22dn)v(i)bi\sum_{i=1}^{n}(\|b_{i}\|_{2}^{2}-\tfrac{d}{n})v(i)\cdot b_{i}. Each term in the sum has expected operator norm at most

and once again the since the summands’ operator norms are bounded by constant-degree polynomials of Gaussian variables Lemma A.8 applies to truncate their tails in preparation to apply a Bernstein bound. The variance of the sum is at most v22O(d2/n3)\|v\|_{2}^{2}\cdot O(d^{2}/n^{3}), again by Fact A.6. Finally, Lemma A.8 and Proposition A.7 apply to give that w.ov.p.

for appropriate choice of dn1/2/polylog(n)d\leqslant n^{1/2}/\operatorname{polylog}(n). Putting it all together gives the lemma. ∎

We decompose ai22=v0(i)2+bi22\|a_{i}\|_{2}^{2}=v_{0}(i)^{2}+\|b_{i}\|_{2}^{2} and use Lemma 4.8 and Lemma 4.9.

Since v044(εn)1\|v_{0}\|_{4}^{4}\geqslant(\varepsilon n)^{-1}, we get v044/M1ε1/4\|v_{0}\|_{4}^{4}/\|M\|\geqslant\frac{1}{\varepsilon^{1/4}}, completing the proof. ∎

2 Closeness of input basis and good basis

We turn now to the proof of Lemma 4.6. We recall the setting. We have two matrices: MM, which the algorithm computes, and MM^{\prime}, which is induced by a basis for the subspace which reveals the underlying randomness and which we prefer for the analysis. MM^{\prime} differs from MM by a rotation and a basis orthogonalization step (the good basis is only almost orthogonal). The rotation is easily handled. The following lemma gives the critical fact about the orthogonalization step: orthogonalizing does not change the leverage scores too much. Strictly speaking the good basis does not have leverage scores since it is not orthogonal, but we can still talk about the norms of the rows of the matrix whose columns are the basis vectors.

The proof again uses standard concentration and matrix inversion formulas, and can be found in Section B. We are ready to prove Lemma 4.6.

Conjugating by QQ and multiplying by 1-1 does not change the operator norm, so that this is equivalent to

Finally, substituting ai=QA1/2aia_{i}^{\prime}=QA^{-1/2}a_{i}, and using the fact that QQ is a rotation, it will be enough to show

The first of these we observe has bounded operator norm w.ov.p.:

using in the last step Lemma 4.8 and standard concentration to bound i=1nbi22aiai\sum_{i=1}^{n}\|b_{i}\|_{2}^{2}\cdot a_{i}a_{i}^{\top} (Lemma 4.7). Thus, by triangle inequality applied to (4.4), we get

using Lemma 4.5 in the last step. For appropriate choice of dn1/2/polylog(n)d\leqslant n^{-1/2}/\operatorname{polylog}(n), this is at most O(n1)+o(v44)O(n^{-1})+o(\|v\|_{4}^{4}). ∎

Overcomplete tensor decomposition

In this section, we give a polynomial-time algorithm for the following problem when nd4/3/(polylogd)n\leqslant d^{4/3}/(\operatorname{polylog}d):

We give an algorithm that solves this problem, so long as the overcompleteness of the input tensor is bounded such that nd4/3/polylogdn\ll d^{4/3}/\operatorname{polylog}d.

However, instead of iai6\sum_{i}a_{i}^{\otimes 6}, we have only i,j(aiaj)3\sum_{i,j}(a_{i}\otimes a_{j})^{\otimes 3}. Unfortunately, running the same algorithm on the latter input will not succeed. To see why, consider the extra terms E:=ijg,aiaj(aiaj)2E^{\prime}\mathrel{\mathop{:}}=\sum_{i\neq j}\langle g,a_{i}\otimes a_{j}\rangle(a_{i}\otimes a_{j})^{\otimes 2}. Since g,aiaj1|\langle g,a_{i}\otimes a_{j}\rangle|\approx 1, it is straightforward to see that EFn\|E^{\prime}\|_{F}\approx n. Since the rank of EE^{\prime} is clearly d2d^{2}, even if we are lucky and all the eigenvalues have similar magnitudes, still a typical eigenvalue will be n/d1\approx n/d\gg 1, swallowing the iai6\sum_{i}a_{i}^{\otimes 6} term.

suggesting our algorithm will succed when n3/2d2n^{3/2}\ll d^{2}, which is to say nd4/3n\ll d^{4/3}.

The following theorem, which formalizes the intuition above, is at the heart of our tensor decomposition algorithm.

The rest of this section is organized as follows. In Section 5.1 we prove Theorem 5.3 using two core facts: the Gaussian vector gg is closer to some aia_{i} than to any other with good probability, and the noise term ijg,T(aiaj)(aiaj)(aiaj)\sum_{i\neq j}\langle g,T(a_{i}\otimes a_{j})\rangle(a_{i}\otimes a_{j})(a_{i}\otimes a_{j})^{\top} is bounded in spectral norm. In Section 5.2 we prove the first of these two facts, and in Section 5.3 we prove the second. In Section 5.4, we give the full details of our tensor decomposition algorithm, then prove Theorem 5.2 using Theorem 5.3. Finally, Section C contains proofs of elementary or long-winded lemmas we use along the way.

Using these two propositions we will conclude that the top eigenvector of RMRRMR is likely to be correlated with one of the vectors aj2a_{j}^{\otimes 2}. We also need two simple concentration bounds; we defer the proof to the appendix.

As a last technical tool we will need a simple claim about the fourth moment matrix of the multivariate Gaussian:

Finally, we give a bound on the spectral gap. We note that the second eigenvector ww has u,w=0\langle u,w\rangle=0, and therefore

Thus, using our above bound on R(Mεg,Taj2(ajaj)2)R\|R(M-\varepsilon\langle g,Ta_{j}^{\otimes 2}\rangle(a_{j}a_{j}^{\top})^{\otimes 2})R\| and the concentration bounds we have already applied for aj\|a_{j}\|, g,Taj2\langle g,Ta_{j}^{\otimes 2}\rangle, and Raj2\|Ra_{j}^{\otimes 2}\|, we have that

We conclude that the above events also imply that λ2(RMR)/λ1(RMR)1O(ε)\lambda_{2}(RMR)/\lambda_{1}(RMR)\leqslant 1-O(\varepsilon). ∎

2 Spectral gap for diagonal terms: proof of Proposition 5.4

We now prove that the signal matrix, when preconditioned by RR, has a noticeable spectral gap:

In [GM15, Lemma 5] a similar lemma to this one is proved in the context of the SoS proof system. However, since Ge and Ma leverage the full power of the SoS algorithm their proof goes via a spectral bound on a different (but related) matrix. Since our algorithm avoids solving an SDP we need a bound on this matrix in particular.

The proof of Lemma 5.9 proceeds by standard spectral concentration for tall matrices with independent columns (here the columns are Rai2Ra_{i}^{\otimes 2}). The arc of the proof is straightforward but it involves some bookkeeping; we have deferred it to Section C.0.2.

We also need the following lemma on the concentration of some scalar random variables involving RR; the proof is straightforward by finding the eigenbasis of RR and applying standard concentration, and it is deferred to the appendix.

Putting together (5.3) and (5.5) with our bounds on ηnorm\eta_{\text{norm}} and ηgap\eta_{\text{gap}} and recalling the conditions on (5.2), we have shown that

We now turn to proving that with reasonable probability, gg is closer to some Taj2Ta_{j}^{\otimes 2} than all others.

To avoid proliferation of indices, without loss of generality fix j=1j=1. We begin by expanding the expression g,Tai2\langle g,Ta_{i}^{\otimes 2}\rangle,

with overwhelming probability for all ii and choices of gg; this follows from a Bersntein bound, given in Lemma 5.6.

Let G1\mathcal{G}_{1} be the event that 2αlog1/2ng,a^1d1/4\sqrt{2\alpha}\log^{1/2}n\leqslant|\langle g,\hat{a}_{1}\rangle|\leqslant d^{1/4} for some αd1/2Ω(1)\alpha\leqslant d^{1/2-\Omega(1)} to be chosen later. We note that g,a^1\langle g,\hat{a}_{1}\rangle is distributed as a standard gaussian, and that gg is independent of a1,,ana_{1},\ldots,a_{n}. Thus, we can use standard tail estimates on univariate Gaussians (Lemma A.4) to conclude that

Now, we must obtain an estimate for the probability that all other inner products with gg are small. Let Gi>1\mathcal{G}_{i>1} be the event that g,a^i(2+ρ)log1/2n|\langle g,\hat{a}_{i}\rangle|\leqslant\sqrt{(2+\rho)}\log^{1/2}n for all i[n],i>1i\in[n],i>1 and for some ρ\rho to be chosen later. We will show that conditioned on G1\mathcal{G}_{1}, Gi>1\mathcal{G}_{i>1} occurs with probability 1O(n1(2+ρ)/2)1-O(n^{1-(2+\rho)/2}). Define g1:=g,a^1a1^g_{1}\mathrel{\mathop{:}}=\langle g,\hat{a}_{1}\rangle\hat{a_{1}} to be the component of gg parallel to a1a_{1}, let g:=gg1g_{\perp}\mathrel{\mathop{:}}=g-g_{1} be the component of gg orthogonal to a^1\hat{a}_{1}, and similarly let a^2,,a^n\hat{a}_{2}^{\perp},\ldots,\hat{a}_{n}^{\perp} be the components of a^2,,a^n\hat{a}_{2},\ldots,\hat{a}_{n} orthogonal to a1a_{1}. Because gg_{\perp} is independent of g1g_{1}, even conditioned on G1\mathcal{G}_{1} we may apply the standard tail bound for univariate Gaussians (Lemma A.4), concluding that for all i>1i>1,

On the other hand, let a^2,,a^n\hat{a}_{2}^{\parallel},\ldots,\hat{a}_{n}^{\parallel} be the components of the a^i\hat{a}_{i} parallel to a^1\hat{a}_{1}. We compute the projection of a^i\hat{a}_{i} onto a^1\hat{a}_{1}. With overwhelming probability,

3 Bound for cross terms: proof of Proposition 5.5

The proof will use two iterations of Matrix Rademacher bounds. The first step will be to employ a classical decoupling inequality that has previously been used in a tensor decomposition context [GM15].

Let {si},{ti}\{s_{i}\},\{t_{i}\} be independent iid sequences of random signs. Let {Mij}\{M_{ij}\} be a family of matrices. There is a universal constant CC so that for every t>0t>0,

Once the simplified cross terms are decoupled, we can use a matrix Rademacher bound on one set of signs.

Consider a finite sequence {Mi}\{M_{i}\} of fixed m×mm\times m Hermitian matrices. Let sis_{i} be a sequence of independent sign variables. Let σ2:=iMi2\sigma^{2}\mathrel{\mathop{:}}=\|\sum_{i}M_{i}^{2}\|. Then for every t0t\geqslant 0,

Let s1,,sns_{1},\ldots,s_{n} be independent signs in {1,1}\{-1,1\}. Let A1,,AnA_{1},\ldots,A_{n} and B1,,BnB_{1},\ldots,B_{n} be Hermetian matrices. Then w.ov.p.,

We use a matrix Rademacher bound and standard manipulations:

so that now we need to bound j[n]sjajajMj\sum_{j\in[n]}s_{j}a_{j}a_{j}^{\top}\otimes M_{j}. By Corollary 5.15,

Letting t1,,tnt_{1},\ldots,t_{n} and r1,,rnr_{1},\ldots,r_{n} be independent uniformly random signs, by Theorem 5.13, it will be enough to bound the spectral norm after replacing the second and third occurrences of sis_{i} for tit_{i} and rir_{i}. To this end, we define

We use a matrix Rademacher bound for the left-hand matrix,

For the right-hand matrix, we use the fact that the summands are PSD to conclude that

using the same concentration facts as earlier.

Finally, by triangle inequality and all our bounds thus far, w.ov.p.

4 Full algorithm and proof of Theorem 5.2

In this subsection we give the full details of our tensor decomposition algorithm. As discussed above, the algorithm proceeds by constructing a random matrix from the input tensor, then computing and post-processing its top eigenvector.

Theorem 5.3 gets us most of the way to the correctness of Algorithm 5.17, proving that the top eigenvector of the matrix RMRRMR is correlated with some ai2a_{i}^{\otimes 2} with reasonable probability. We need a few more ingredients to prove Theorem 5.2. First, we need to show a bound on the runtime of Algorithm 5.17.

To run the algorithm, we only require access to power iteration using the matrix RMRRMR. We first give a fast implementation for power iteration with the matrix MM, and handle the multiplications with RR separately.

We show that the matrix-vector multiply MvMv can be computed as a flattening of the following product:

So we have that MvMv is a flattening of the product TvTT_{v}T^{\top}, which we will compute as a proxy for computing MvMv via direct multiplication.

Performing the update RMRvRMRv a total of O(log2n)O(\log^{2}n) times is sufficient for convergence, as we have that with reasonable probability, the spectral gap λ2(RMR)/λ1(RMR)1O(1logn)\lambda_{2}(RMR)/\lambda_{1}(RMR)\leqslant 1-O(\tfrac{1}{\log n}), as a result of applying Theorem 5.3 with the choice of ε=O(1logn)\varepsilon=O(\tfrac{1}{\log n}).

Finally, checking the value of iai,x3\sum_{i}\langle a_{i},x\rangle^{3} requires O(d3)O(d^{3}) operations, and we do so a constant number of times, once for each of the signings of the top 2 left (or right) singular vectors of UU. ∎

Furthermore, for any 0<α<10<\alpha<1, there are a,ba^{\prime},b^{\prime} among the top 1αc2\lfloor\tfrac{1}{\alpha c^{2}}\rfloor singular vectors of UU with

If c12(1+η)c\geqslant\sqrt{\tfrac{1}{2}(1+\eta)} for some η>0\eta>0, then a,ba,b are amongst the top (1+η)ηc2\lfloor\frac{(1+\eta)}{\eta c^{2}}\rfloor singular vectors.

Since here c2=1o(1)c^{2}=1-o(1), we can choose η=1o(1)\eta=1-o(1) and check only the top 22 singular vectors.

Next, we must show how to choose the threshold c(n,d)c(n,d) so that a big enough value i[n]ai,uj3\sum_{i\in[n]}\langle a_{i},u_{j}\rangle^{3} is ensures that uju_{j} is close to a tensor component. The proof is at the end of this section. (A very similar fact appears in [GM15]. We need a somewhat different parameterization here, but we reuse many of their results in the proof.)

We start with the first claim. By [GM15, Lemma 2, (proof of) Lemma 8, Theorem 4.2], the following inequalities all hold w.ov.p..

Together with (5.8) this concludes the of the first claim.

We remark that Algorithm 5.17 may be used in conjunction with a local search algorithm to obtain much greater guarantees on the accuracy of the recovered vectors. Previous progress on the tensor decomposition problem has produced iterative algorithms that provide local convergence guarantees given a good enough initialization, but which leave the question of how to initialize the procedure up to future work, or up to the specifics of an implementation. In this context, our contribution can be seen as a general method of obtaining good initializations for these local iterative procedures.

In particular, Anandkumar et al. [AGJ15] give an algorithm that combines tensor power iteration and a form of coordinate descent, which when initialized with the output of Algorithm 5.17, achieves a linear convergence rate to the true decomposition within polynomial time.

then there exists a procedure which with overwhelming probability over T\mathbf{T} and for any ε>0\varepsilon>0, recovers a set of vectors {a^i}\{\hat{a}_{i}\} such that

in time O(poly(d)+nd3logε)O(\operatorname{poly}(d)+nd^{3}\log\varepsilon).

Theorem 1 of Anandkumar et al. is stated for random asymmetric tensors, but the adaptation to symmetric tensors is stated in equations (14) and (27) in the same paper.

The theorem of Anandkumar et al. allows for a perturbation tensor Φ\Phi, which is just the zero tensor in our setting. Additionally, the weight ratios specifying the weight of each rank-one component in the input tensor are wmax=wmin=1w_{max}=w_{min}=1. Lastly, the initialization conditions are given in terms of the distance between the intialization vectors and the true vectors xiai|x_{i}-a_{i}|, which is related to our measure of closeness xi,ai\langle x_{i},a_{i}\rangle by the equation xiai2=xi2+ai22xi,ai|x_{i}-a_{i}|^{2}=|x_{i}|^{2}+|a_{i}|^{2}-2\langle x_{i},a_{i}\rangle.

The linear convergence guarantee is stated in Lemma 12 of Anandkumar et al.

We repeatedly invoke Algorithm 5.17 until we obtain a full set of nn vectors as characterized by Theorem 5.2. Apply Theorem 5.21 to the recovered set of vectors until the desired accuracy is obtained. ∎

Tensor principal component analysis

The Tensor PCA problem in the spiked tensor model is similar to the setting of tensor decomposition, but here the goal is to recover a single large component with all smaller components of the tensor regarded as random noise.

Using the partial trace method, we give the first linear-time algorithm for this problem that recovers vv for signal-to-noise ratio τ=O(n3/4/polylogn)\tau=O(n^{3/4}/\operatorname{poly}\log n). In addition, the algorithm requires only O(n2)O(n^{2}) auxiliary space (compared to the input size of n3n^{3}) and uses only one non-adaptive pass over the input.

This spiked tensor model (for general order-kk tensors) was introduced by Montanari and Richard [RM14], who also obtained the first algorithms to solve the model with provable statistical guarantees. Subsequently, the SoS approach was applied to the model to improve the signal-to-noise ratio required for odd-order tensors [HSS15]; for 33-tensors reducing the requirement from τ=Ω(n)\tau=\Omega(n) to τ=Ω(n3/4log(n)1/4)\tau=\Omega(n^{3/4}\log(n)^{1/4}).

Using the linear-algebraic objects involved in the analysis of the SoS relaxation, the previous work has also described algorithms with guarantees similar to those of the SoS SDP relaxation, while requiring only nearly subquadratic or linear time [HSS15].

The algorithm here improves on the previous results by use of the partial trace method, simplifying the analysis and improving the runtime by a factor of logn\log n.

2 Linear-time algorithm

When A\mathbf{A} has iid standard Gaussian entries and τCn3/4log(n)1/2/ε\tau\geqslant Cn^{3/4}\log(n)^{1/2}/\varepsilon for some constant CC, Algorithm 6.2 recovers vv^{\prime} with v,v1O(ε)\langle v,v^{\prime}\rangle\geqslant 1-O(\varepsilon) with high probability over A\mathbf{A}.

Algorithm 6.2 can be implemented in linear time and sublinear space.

These theorems are proved by routine matrix concentration results, showing that in the partial trace matrix, the signal dominates the noise.

To implement the algorithm in linear time it is enough to show that this (sublinear-sized) matrix has constant spectral gap; then a standard application of the matrix power method computes the top eigenvector.

For any vv, with high probability over A\mathbf{A}, the following occur:

Applying Lemma 6.5 and the triangle inequality, we see that

and note that to construct the right-hand side it is enough to examine each entry of T\mathbf{T} just O(1)O(1) times and perform O(n3)O(n^{3}) additions. At no point do we need to store more than O(n2)O(n^{2}) matrix entries at the same time. ∎

Acknowledgements

We would like to thank Rong Ge for very helpful discussions. We also thank Jonah Brown Cohen, Pasin Manurangsi and Aviad Rubinstein for helpful comments in the preparation of this manuscript.

References

Appendix A Additional preliminaries

Here we provide some lemmas in linear algebra.

This first lemma is closely related to the sos Cauchy-Schwarz from [BKS14], and the proof is essentially the same.

In applications, we will have ipiqi\sum_{i}p_{i}q_{i} as a single block of a larger block matrix containing also the blocks ipipi\sum_{i}p_{i}p_{i}^{\top} and iqiqi\sum_{i}q_{i}q_{i}^{\top}.

To see this, just note that the right-hand side minus the left is exactly

The lemma follows now be applying this inequality to

Let A1,,Am,B1,,BmA_{1},\ldots,A_{m},B_{1},\ldots,B_{m} be real random matrices. Then

where the nontrivial inequalities follow from Cauchy-Schwarz for expectations, vectors and scalars, respectively. ∎

The followng lemma allows to argue about the top eigenvector of matrices with spectral gap.

Then, u,v2εv2\langle u,v\rangle^{2}\geqslant\varepsilon\cdot\lVert v\rVert^{2}.

We lower bound the quadratic form of MvvM-vv^{\top} evaluated at uu by

At the same time, this quadratic form evaluated at uu is upper bounded by Mεv2\lVert M\rVert-\varepsilon\cdot\lVert v\rVert^{2}. It follows that u,v2εv2\langle u,v\rangle^{2}\geqslant\varepsilon\cdot\lVert v\rVert^{2} as desired.

Furthermore, for any 0<α<10<\alpha<1, there are a,ba^{\prime},b^{\prime} among the top 1αc2\lfloor\tfrac{1}{\alpha c^{2}}\rfloor singular vectors of UU with

If c12(1+η)c\geqslant\sqrt{\tfrac{1}{2}(1+\eta)} for some η>0\eta>0, then a,ba,b are amongst the top (1+η)ηc2\lfloor\frac{(1+\eta)}{\eta c^{2}}\rfloor singular vectors.

Let v^=v/v\hat{v}=v/\|v\|. Let (σi,ai,bi)(\sigma_{i},a_{i},b_{i}) be the iith singular value, left and right (unit) singular vectors of UU respectively.

Furthermore, we observe that UF=MuMu\|U\|_{F}=\|Mu\|\leqslant\|M\|\cdot\|u\|, and that therefore UFu\|U\|_{F}\leqslant\|u\|. We thus have that,

where to obtain the last inequality we have used Cauchy-Schwarz and our bound on UF\|U\|_{F}. We may thus conclude that

where we have used the fact that the left singular values of UU are orthonormal. The argument is symmetric in the bib_{i}.

where we have applied Cauchy-Schwarz and the orthonormality of the bib_{i}. In particular,

On the other hand, let SS be the set of i[d]i\in[d] for which σi2αc2UF2\sigma_{i}^{2}\leqslant\alpha c^{2}\|U\|_{F}^{2}. By substitution,

where we have used the fact that the right singular vectors are orthonormal. The last two inequalities imply that S[d]S\neq[d]. Letting T=[d]ST=[d]\setminus S, it follows from subtraction that

so that maxiTv^,ai2(1α)c2\max_{i\in T}\langle\hat{v},a_{i}\rangle^{2}\geqslant(1-\alpha)c^{2}. Finally,

so that T1αc2|T|\leqslant\lfloor\frac{1}{\alpha c^{2}}\rfloor. Thus, one of the top 1αc2\lfloor\frac{1}{\alpha c^{2}}\rfloor right singular vectors aa has correlation v^,a(1α)c|\langle\hat{v},a\rangle|\geqslant\sqrt{(1-\alpha)}c. The same proof holds for the bb.

Furthermore, if c2>12(1+η)c^{2}>\tfrac{1}{2}(1+\eta) for some η>0\eta>0, and (1α)c2>12(1-\alpha)c^{2}>\tfrac{1}{2}, then by (A.1) it must be that maxiTv^,ai2=maxi[d]v^,ai2\max_{i\in T}\langle\hat{v},a_{i}\rangle^{2}=\max_{i\in[d]}\langle\hat{v},a_{i}\rangle^{2}, as v^\hat{v} cannot have square correlation larger than 12\tfrac{1}{2} with more than one left singular vector. Taking α=η1+η\alpha=\tfrac{\eta}{1+\eta} guarantees this. The conclusion follows. ∎

A.2 Concentration tools

We require a number of tools from the literature on concentration of measure.

We need the some concentration bounds for certain polynomials of Gaussian random variables.

The following lemma gives standard bounds on the tails of a standard gaussian variable—somewhat more precisely than other bounds in this paper. Though there are ample sources, we repeat the proof here for reference.

Let XN(0,1)X\sim\mathcal{N}(0,1). Then for t>0t>0,

To show the first statement, we apply an integration trick,

where in the third step we have used the fact that xtx\frac{x}{t}\leqslant x for txt\geqslant x. For the second statement, we integrate by parts and repeat the trick,

The following is a small modification of Theorem 6.7 from [Jan97] which follows from Remark 6.8 in the same.

In our concentration results, we will need to calculate the expectations of multivariate Gaussian polynomials, many of which share a common form. Below we give an expression for these expectations.

Let xx be a dd-dimensional vector with independent identically distributed gaussian entries with variance σ2\sigma^{2}. Let uu be a fixed unit vector. Then setting X=(x2c)px2mxxTX=(\|x\|^{2}-c)^{p}\|x\|^{2m}xx^{T}, and setting U=(x2c)px2muuTU=(\|x\|^{2}-c)^{p}\|x\|^{2m}uu^{T}, we have

A.2.2 For matrix-valued random variables

On several occasions we will need to apply a Matrix-Bernstein-like theorem to a sum of matrices with an unfortunate tail. To this end, we prove a “truncated Matrix Bernstein Inequality.” Our proof uses an standard matrix Bernstein inequality as a black box. The study of inequalities of this variety—on tails of sums of independent matrix-valued random variables— was initiated by Ahlswede and Winter [AW02]. The excellent survey of Tropp [Tro12] provides many results of this kind.

In applications of the following the operator norms of the summands X1,,XnX_{1},\ldots,X_{n} have well-behaved tails and so the truncation is a routine formality. Two corollaries following the proposition and its proof capture truncation for all the matrices we encounter in the present work.

Furthermore, suppose that for each XiX_{i},

Then for X=i[n]XiX=\sum_{i\in[n]}X_{i}, we have

Define Y=i[n]YiY=\sum_{i\in[n]}Y_{i}. We claim that

Now we are ready to apply the non-commutative Bernstein’s inequality to YY. We have

Putting everything together and setting α=tnq\alpha=t-nq,

The following lemma helps achieve the assumptions of Proposition A.7 easily for a useful class of thin-tailed random matrices.

where cc is a universal constant. Taking t=R=polylog(n)t=R=\operatorname{polylog}(n) gives us (A.4).

We now address (A.5). To this end, let p(t)p(t) and P(t)P(t) be the probability density function and cumulative density function of Xop\|X\|_{op}, respectively. We apply Jensen’s inequality and instead bound

Evaluating at R=polylognR=\operatorname{polylog}n for a sufficiently large polynomial in the log gives us

Appendix B Concentration bounds for planted sparse vector in random linear subspace

Let c:=i=1nv(i)bic\mathrel{\mathop{:}}=\sum_{i=1}^{n}v(i)b_{i}. The matrix in question has a nice block structure:

By [Ver10, Corollary 5.50] applied to the subgaussian vectors nbinb_{i}, w.ov.p.

These are proved easily via the identity (1+ε)1=k=1εk(1+\varepsilon)^{-1}=\sum_{k=1}^{\infty}\varepsilon^{k} and similar. ∎

We conclude that with overwhelming probability 1dω(1)1-d^{-\omega(1)},

Let Aj=ijaiaiA_{-j}=\sum_{i\neq j}a_{i}a_{i}^{\top}. By Sherman–Morrison,

Let AA be a block matrix where one of the diagonal blocks is the 1×11\times 1 identity; that is,

for some matrix BB and vector cc. Let xx be a vector which decomposes as x=(x(1)x)x=(x(1)\,\,x^{\prime}) where x(1)=x,e1x(1)=\langle x,e_{1}\rangle for e1e_{1} the first standard basis vector.

By the formula for block matrix inverses,

The result follows by Sherman-Morrison applied to (Bcc)1(B-cc^{\top})^{-1} and the definition of xx. ∎

where the first two use Cauchy-Schwarz and the last follows from the first two.

We turn now to the expansion of ai,A1ai\langle a_{i},A^{-1}a_{i}\rangle offered by Lemma B.3,

Addressing (B.1) first, by the above estimates and Lemma B.2 applied to bi,B1bi\langle b_{i},B^{-1}b_{i}\rangle,

w.ov.p.. For (B.2), we pull out the important factor of c\|c\| and separate v(i)v(i) from bib_{i}: w.ov.p.,

where the last inequality follows from our estimates above and Cauchy-Schwarz.

Appendix C Concentration bounds for overcomplete tensor decomposition

We require some facts about the concentration of certain scalar-and matrix-valued random variables, which generally follow from standard concentration arguments. We present proofs here for completeness.

The first lemma captures standard facts about random Gaussians.

Inner products ai,aj|\langle a_{i},a_{j}\rangle| are all 1/d\approx 1/\sqrt{d}:

We start with Item 1. Consider the quantity ai,aj2\langle a_{i},a_{j}\rangle^{2}. We calculate the expectation,

Since this is a degree-4 square polynomial in the entries of aia_{i} and aja_{j}, we may apply Lemma A.5 to conclude that

Applying this fact with t=polylog(n)t=\operatorname{polylog}(n) and taking a union bound over pairs i,j[n]i,j\in[n] gives us the desired result.

Next is Item 2. Consider the quantity ai22\|a_{i}\|_{2}^{2}. We will apply Lemma A.5 in order to obtain a tail bound for the value of the polynomial (ai221)2(\|a_{i}\|_{2}^{2}-1)^{2}. We have

and now applying Lemma A.5 with the square root of this expectation, we have

This gives both bounds for a single aia_{i}. The result now follows from taking a union bound over all ii.

and taking t=polylog(n)t=\operatorname{polylog}(n) the conclusion follows. ∎

We also use the fact that the covariance matrix of a sum of sufficiently many gaussian outer products concentrates about its expectation.

We apply a truncated matrix bernstein inequality. For convenience, A:=i[n]aiaiA:=\sum_{i\in[n]}a_{i}a_{i}^{\top} and let Ai:=aiaiA_{i}:=a_{i}a_{i}^{\top} be a single summand. To begin, we calculate the first and second moments of the summands,

The proof proceeds by truncated matrix Bernstein, since the summands are independent for fixed g,ajg,a_{j}. For this we need to compute the variance:

The norm of each term in the sum is bounded by a constant-degree polynomial of Gaussians. Straightforward calculations show that in expectation each term is O(1dg,ai)O(\tfrac{1}{d}\langle g,a_{i}\rangle) in norm; w.ov.p. this is O(σ)O(\sigma). So Lemma A.5 applies to establish the hypothesis of truncated Bernstein Proposition A.7. In turn, Proposition A.7 yields that w.ov.p.

We will derive Fact C.4 as a corollary of a more general claim about rotationally symmetric distributions.

Now we can prove some concentration claims we deferred:

And one direction of the second item, using Fact C.4 and Fact C.1 (the other direction is similar):

C.0.2 Proof of Lemma 5.9

To prove Lemma 5.9 we will begin by reducing to the case S=[n]S=[n] via the following.

It will be convenient to reduce concentration for matrices involving aiaia_{i}\otimes a_{i} to analogous matrices where the vectors aiaia_{i}\otimes a_{i} are replaced by isotropic vectors of constant norm. The following lemma shows how to do this.

We need the following bound on the deviation from expectation of a tall matrix with independent columns.

We are now prepared to handle the case of S=[n]S=[n] via spectral concentration for matrices with independent columns, Theorem C.9.

Appendix D Concentration bounds for tensor principal component analysis

For convenience, we restate Lemma 6.5 here.

For any vv, with high probability over A\mathbf{A}, the following occur:

For notational convenience, let AA be distributed like a generic AiA_{i}. By a union bound, we have both of the following:

The other matrices are easier. First of all, we note that the matrix iv(i)Ai\sum_{i}v(i)\cdot A_{i} has independent standard Gaussian entries, so it is standard that with high probability iv(i)AiO(nlogn)\|\sum_{i}v(i)\cdot A_{i}\|\leqslant O(\sqrt{n}\log n). Second, we have

The random variable Tr(Ai)\operatorname{Tr}(A_{i}) is a centered Gaussian with variance nn, and since vv is a unit vector, iv(i)Tr(Ai)\sum_{i}v(i)\operatorname{Tr}(A_{i}) is also a centered Gaussian with variance nn. So with high probability we get

by standard estimates. This completes the proof. ∎