Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors

Samuel B. Hopkins, Tselil Schramm, Jonathan Shi, David Steurer

Introduction

The sum-of-squares (SoS) method (also known as the Lasserre hierarchy) [Sho87, Par00, Nes00, Las01] is a powerful, semidefinite-programming based meta-algorithm that applies to a wide-range of optimization problems. The method has been studied extensively for moderate-size polynomial optimization problems that arise for example in control theory and in the context of approximation algorithms for combinatorial optimization problems, especially constraint satisfaction and graph partitioning (see e.g. the survey [BS14]). For the latter, the SoS method captures and generalizes the best known approximation algorithms based on linear programming (LP), semidefinite programming (SDP), or spectral methods, and it is in many cases the most promising approach to obtain algorithms with better guarantees—especially in the context of Khot’s Unique Games Conjecture [BBH+12].

A sequence of recent works applies the sum-of-squares method to basic problems that arise in unsupervised machine learning: in particular, recovering sparse vectors in linear subspaces and decomposing tensors in a robust way [BKS14, BKS15, HSS15, BM15, GM15]. For a wide range of parameters of these problems, SoS achieves significantly stronger guarantees than other methods, in polynomial or quasi-polynomial time.

In this work, we introduce spectral algorithms for planted sparse vector, tensor decomposition, and tensor principal components analysis (PCA) that exploit the same high-degree information as the corresponding sum-of-squares algorithms without relying on semidefinite programming, and achieve the same (or close to the same) guarantees. The resulting algorithms are quite simple (a couple of lines of matlab code) and have considerably faster running times—quasi-linear or close to linear in the input size.

A surprising implication of our work is that for some problems, spectral algorithms can exploit information from larger values of the parameter $d$ without spending time $n^{O(d)}$ . For example, one of our algorithms runs in nearly-linear time in the input size, even though it uses properties that the sum-of-squares method can only use for degree parameter $d\geqslant 4$ . (In particular, the guarantees that the algorithm achieves are strictly stronger than the guarantees that SoS achieves for values of $d<4$ .)

The initial successes of SoS in the machine learning setting gave hope that techniques developed in the theory of approximation algorithms, specifically the techniques of hierarchies of convex relaxations and rounding convex relaxations, could broadly impact the practice of machine learning. This hope was dampened by the fact that in general, algorithms that rely on solving large semidefinite programs are too slow to be practical for the large-scale problems that arise in machine learning. Our work brings this hope back into focus by demonstrating for the first time that with some care SoS algorithms can be made practical for large-scale problems.

In the following subsections we describe each of the problems that we consider, the prior best-known guarantee via the SoS hierarchy, and our results.

Several kinds of algorithms have been proposed for this problem based on linear programming (LP), basic semidefinite programming (SDP), sum-of-squares, and non-convex gradient descent (alternating directions method).

An inherent limitation of simpler convex methods (LP and basic SDP) [SWW12, dGJL04] is that they require the relative sparsity of the planted vector to be polynomial in the subspace dimension (less than $n/\sqrt{d}$ non-zero coordinates).

Sum-of-squares and non-convex methods do not share this limitation. They can recover planted vectors with constant relative sparsity even if the subspace has polynomial dimension (up to dimension $O(n^{1/2})$ for sum-of-squares [BKS14] and up to $O(n^{1/4})$ for non-convex methods [QSW14]).

Our algorithm runs in nearly linear time in the input size, and matches the best-known guarantees up to a polylogarithmic factor in the subspace dimension [BKS14].

We give a technical overview of the proof in Section 2, and a full proof in Section 4.

Previous work also showed how to recover the planted sparse vector exactly. The task of going from an approximate solution to an exact one is a special case of standard compressed sensing (see e.g. [BKS14]).

2 Overcomplete tensor decomposition

Tensors naturally represent multilinear relationships in data. Algorithms for tensor decompositions have long been studied as a tool for data analysis across a wide-range of disciplines (see the early work of Harshman [Har70] and the survey [KB09]). While the problem is NP-hard in the worst-case [Hås90, HL13], algorithms for special cases of tensor decomposition have recently led to new provable algorithmic results for several unsupervised learning problems [AGH+14, BCMV14, GVX14, AGHK14] including independent component analysis, learning mixtures of Gaussians [GHK15], Latent Dirichlet topic modeling [AFH+15] and dictionary learning [BKS15]. Some previous learning algorithms can also be reinterpreted in terms of tensor decomposition [Cha96, MR06, NR09].

A key algorithmic challenge for tensor decompositions is overcompleteness, when the number of components is larger than their dimension (i.e., the components are linearly dependent). Most algorithms that work in this regime require tensors of order $4$ or higher [LCC07, BCMV14]. For example, the FOOBI algorithm of [LCC07] can recover up to $\Omega(d^{2})$ components given an order- $4$ tensor in dimension $d$ under mild algebraic independence assumptions for the components—satisfied with high probability by random components. For overcomplete $3$ -tensors, which arise in many applications of tensor decompositions, such a result remains elusive.

where $a_{1},\ldots,a_{n}$ are random unit or Gaussian vectors, the goal is to approximately recover the components $a_{1},\ldots,a_{n}$ .

Given $\mathbf{T}$ sampled as above, find a unit vector $b$ that has correlation $\max_{i}\langle a_{i},b\rangle\geqslant 1-\eta$ with one of the vectors $a_{i}$ .

Given $\mathbf{T}$ sampled as above, find a set of unit vectors $\{b_{1},\ldots,b_{n}\}$ such that $\langle a_{i},b_{i}\rangle\geqslant 1-\eta$ for every $i\in[n]$ .

We give a technical overview of the proof in Section 2, and a full proof in Section 5.

We remark that the above algorithm only requires access to the input tensor with some fixed inverse polynomial accuracy because each of its four steps amplifies errors by at most a polynomial factor (see Algorithm 5.17). In this sense, the algorithm is robust.

3 Tensor principal component analysis

For this problem, our improvements over the previous results are more modest—we achieve signal-to-noise guarantees matching [HSS15], but with an algorithm that runs in linear time rather than near-linear time (time $O(d^{3})$ rather than $O(d^{3}\operatorname{polylog}d)$ , for an input of size $d^{3}$ ).

There is an algorithm which solves the tensor principal component analysis problem with accuracy $\eta>0$ whenever the signal-to-noise ratio satisfies $\tau\geqslant O(n^{3/4}/\eta\cdot\log^{1/2}n)$ . Furthermore, the algorithm runs in time $O(d^{3})$ .

Though for tensor PCA our improvement over previous work is modest, we include the results here as this problem is a pedagogically poignant illustration of our techniques. We give a technical overview of the proof in Section 2, and a full proof in Section 6.

4 Related work

Foremost, this work builds upon the SoS algorithms of [BKS14, BKS15, GM15, HSS15]. In each of these previous works, a machine learning decision problem is solved using an SDP relaxation for SoS. In these works, the SDP value is large in the yes case and small in the no case, and the SDP value can be bounded using the spectrum of a specific matrix. This was implicit in [BKS14, BKS15], and in [HSS15] it was used to obtain a fast algorithm as well. In our work, we design spectral algorithms which use smaller matrices, inspired by the SoS certificates in previous works, to solve these machine-learning problems much faster, with almost matching guarantees.

A key idea in our work is that given a large matrix with information encoded in the matrix’s spectral gap, one can often efficiently “compress” the matrix to a much smaller one without losing that information. This is particularly true for problems with planted solutions. In this way, we are able to improve running time by replacing an $n^{O(d)}$ -sized SDP with an eigenvector computation for an $n^{k}\times n^{k}$ matrix, for some $k<d$ .

The idea of speeding up LP and SDP hierarchies for specific problems has been investigated in a series of previous works [dlVK07, BRS11, GS12], which shows that with respect to local analyses of the sum-of-squares algorithm it is sometimes possible to improve the running time from $n^{O(d)}$ to $2^{O(d)}\cdot n^{O(1)}$ . However, the scopes and strategies of these works are completely different from ours. First, the notion of local analysis from these works does not apply to the problems considered here. Second, these works employ the ellipsoid method with a separation oracle inspired by rounding algorithms, whereas we reduce the problem to an ordinary eigenvector computation.

It would also be interesting to see if our methods can be used to speed up some of the other recent successful applications of SoS to machine-learning type problems, such as [BM15], or the application of [BKS14] to tensor decomposition with components that are well-separated (rather than random). Finally, we would be remiss not to mention that SoS lower bounds exist for several of these problems, specifically for tensor principal components analysis, tensor prediction, and sparse PCA [HSS15, BM15, MW15]. The lower bounds in the SoS framework are a good indication that we cannot expect spectral algorithms achieving better guarantees.

Techniques

Our approach for faster algorithms based on SoS algorithms is to construct specific matrices (polynomials) in this affine linear space, then compute their top eigenvectors. By designing our matrices carefully, we ensure that our algorithms have access to the same higher degree information that the sum-of-squares algorithm can access, and this information affords an advantage over the basic spectral methods for these problems. At the same time, our algorithms avoid searching for the best polynomial and matrix representation, which gives us faster running times since we avoid semidefinite programming. This approach is well suited to average-case problems where we avoid the problem of adversarial choice of input; in particular it is applicable to machine learning problems where noise and inputs are assumed to be random.

A serious limitation of the above approach is that the representation of a degree- $d$ polynomial requires size roughly $n^{d}$ . Hence, even avoiding the use of semidefinite programming, improving upon running time $O(n^{d})$ requires additional ideas.

In each of the problems that we consider, we have a large matrix (suggested by a SoS algorithm) with a “signal” planted in some amount of “noise”. We show that in some situations, this large matrix can be compressed significantly without loss in the signal by applying partial trace operations. In these situations, the partial trace yields a smaller matrix with the same signal-to-noise ratio as the large matrix suggested by the SoS algorithm, even in situations when lower degree sum-of-squares approaches are known to fail.

If $A$ is a Wigner matrix (e.g. a symmetric matrix with iid $\pm 1$ entries), then both $\operatorname{Tr}(A),\|A\|\approx\sqrt{n}$ , and the above condition is indeed met. In our average case/machine learning settings the “noise” component is not as simple as $A\otimes B$ with $A$ a Wigner matrix. Nonetheless, we are able to ensure that the noise displays a similar behavior under partial trace operations. In some cases, this requires additional algorithmic steps, such as random projection in the case of tensor decomposition, or centering the matrix eigenvalue distribution in the case of the planted sparse vector.

It is an interesting question if there are general theorems describing the behavior of spectral norms under partial trace operations. In the current work, we compute the partial traces explicitly and estimate their norms directly. Indeed, our analyses boil down to concentrations bounds for special matrix polynomials. A general theory for the concentration of matrix polynomials is a notorious open problem (see [MW13]).

Partial trace operations have previously been applied for rounding SoS relaxations. Specifically, the operation of reweighing and conditioning, used in rounding algorithms for sum-of-squares such as [BRS11, RT12, BKS14, BKS15, LR15], corresponds to applying a partial trace operation to the moments matrix returned by the sum-of-squares relaxation.

We now give a technical overview of our algorithmic approach for each problem, and some broad strokes of the analysis for each case. Our most substantial improvements in runtime are for the planted sparse vector and overcomplete tensor decomposition problems (Section 2.1 and Section 2.2 respectively). Our algorithm for tensor PCA is the simplest application of our techniques, and it may be instructive to skip ahead and read about tensor PCA first (Section 2.3).

1 Planted sparse vector in random linear subspace

It follows that in order to distinguish between a random subspace with a planted sparse vector (yes case) and a completely random subspace (no case), it is enough to compute the second-largest eigenvalue of a $d^{2}$ -by- $d^{2}$ matrix (representing the $4$ -norm polynomial over the subspace as above). This decision version of the problem, while strictly speaking easier than the search version above, is at the heart of the matter: one can show that the large eigenvalue for the yes case corresponds to an eigenvector which encodes the coefficients of the sparse planted vector in the basis.

The partial trace of our matrix $M=\sum_{i=1}^{n}(a_{i}a_{i}^{\top})\otimes(a_{i}a_{i}^{\top})$ is easy to compute directly,

In the yes case (random subspace with planted sparse vector), a direct computation shows that

Hence, a natural approach to distinguish between the yes case and no case (completely random subspace) is to upper bound the spectral norm of $N$ in the no case.

For linear sparsity $k=\varepsilon\cdot n$ , this inequality is true so long as $d\ll(n/\varepsilon^{2})^{1/3}$ , which is somewhat worse than the bound $\sqrt{n}$ bound on the dimension that we are aiming for.

Recall that $\operatorname{Tr}B=\sum_{i}\lambda_{i}(B)$ for a symmetric matrix $B$ . As discussed above, the partial trace approach works best when the noise behaves as the tensor of two Wigner matrices, in that there are cancellations when the eigenvalues of the noise are summed. In our case, the noise terms $(a_{i}a_{i}^{\top})\otimes(a_{i}a_{i}^{\top})$ do not have this property, as in fact $\operatorname{Tr}a_{i}a_{i}^{\top}=\|a_{i}\|^{2}\approx d/n$ . Thus, in order to improve the dimension bound, we will center the eigenvalue distribution of the noise part of the matrix. This will cause it to behave more like a Wigner matrix, in that the spectral norm of the noise will not increase after a partial trace. Consider the partial trace of a matrix of the form

for some constant $\alpha>0$ . The partial trace of this matrix is

We choose the constant $\alpha\approx d/n$ such that our matrix $N^{\prime}$ has expectation in the no case, when the subspace is completely random. In the yes case, the Rayleigh quotient of $N^{\prime}$ at $x_{0}$ simply shifts as compared to $N$ , and we have $\lambda_{\textsc{yes}}\geqslant\langle x_{0},N^{\prime}x_{0}\rangle\approx\lVert v_{0}\rVert_{4}^{4}\geqslant 1/k$ (see Lemma 4.5 and sublemmas). On the other hand, in the no case, this centering operation causes significant cancellations in the eigenvalues of the partial trace matrix (instead of just shifting the eigenvalues). In the no case, $N^{\prime}$ has spectral norm $\lambda_{\textsc{no}}\leqslant O(d/n^{3/2})$ for $d\ll\sqrt{n}$ (using standard matrix concentration bounds; again see Lemma 4.5 and sublemmas). Therefore, the spectral norm of the matrix $N^{\prime}$ allows us to distinguish between the yes and no case as long as $d/n^{3/2}\ll 1/k$ , which is satisfied as long as $k\ll n$ and $d\ll\sqrt{n}$ . We give the full formal argument in Section 4.

2 Overcomplete tensor decomposition

The starting point of our algorithm is the polynomial $f(x)=\sum_{i=1}^{n}\langle a_{i},x\rangle^{3}$ . It turns out that for $n\ll d^{1.5}$ the (approximate) maximizers of this polynomial are close to the components $a_{1},\ldots,a_{n}$ , in the sense that $f(x)\approx 1$ if and only if $\max_{i\in[n]}\langle a_{i},x\rangle^{2}\approx 1$ . Indeed, Ge and Ma [GM15] show that the sum-of-squares method already captures this fact at degree 12, which implies a quasipolynomial time algorithm for this tensor decomposition problem via a general rounding result of Barak, Kelner, and Steurer [BKS15].

The simplest approach to this problem is to consider the tensor representation of the polynomial $\mathbf{T}=\sum_{i\in[n]}a_{i}^{\otimes 3}$ , and flatten it, hoping the singular vectors of the flattening are correlated with the $a_{i}$ . However, this approach is doomed to failure for two reasons: firstly, the simple flattenings of $\mathbf{T}$ are $d^{2}\times d$ matrices, and since $n\gg d$ the $a_{i}^{\otimes 2}$ collide in the column space, so that it is impossible to determine $\operatorname{Span}\{a_{i}^{\otimes 2}\}$ . Secondly, even for $n\leqslant d$ , because the $a_{i}$ are random vectors, their norms concentrate very closely about $1$ . This makes it difficult to distinguish any one particular $a_{i}$ even when the span is computable.

Of course, we do not have access to the higher-order tensor $\sum_{i\in[n]}a_{i}^{\otimes 6}$ . Instead, we can obtain a noisy version of this tensor. Our approach considers the following matrix representation of the polynomial $f^{2}$ ,

To explain how the above algorithm is a speedup of SoS, we give an overview of the SoS algorithm of [GM15, BKS15]. There, the degree- $t$ SoS SDP program is used to obtain an order- $t$ tensor $\chi_{t}$ (or a pseudodistribution). Informally speaking, we can understand $\chi_{t}$ as a proxy for $\sum_{i\in[n]}a_{i}^{\otimes t}$ , so that $\chi_{t}=\sum_{i\in[n]}a_{i}^{\otimes t}+N$ , where $N$ is a noise tensor. While the precise form of $N$ is unclear, we know that $N$ must obey a set of constraints imposed by the SoS hierarchy at degree $t$ . For a formal discussion of pseudodistributions, see [BKS15].

Our subquadratic-time algorithm can be viewed as a low-degree, spectral analogue of the [BKS15] SoS algorithm. However, rather than relying on an SDP to produce an object close to $\sum_{i\in[n]}a_{i}^{\otimes t}$ , we manufacture one ourselves by taking the Kronecker square of our input tensor. We explicitly know the form of the deviation of $\mathbf{T}^{\otimes 2}$ from $\sum_{i\in[n]}a_{i}^{\otimes 6}$ , unlike in [BKS15], where the deviation of the SDP certificate $\chi_{t}$ from $\sum_{i\in[n]}a_{i}^{\otimes t}$ is poorly understood. We are thus able to control this deviation (or “noise”) in a less computationally intensive way, by cleverly designing a partial trace operation which decreases the spectral norm of the deviation. Since the tensor handled by the algorithm is much smaller—order $6$ rather than order $\log n$ —this provides the desired speedup.

3 Tensor principal component analysis

A previous application of SoS techniques to this problem discussed several SoS or spectral algorithms, including one that runs in quasi-linear time [HSS15]. Here we apply the partial trace method to a subquadratic spectral SoS algorithm discussed in [HSS15] to achieve nearly the same signal-to-noise guarantee in only linear time.

Specifically, let $A_{i}$ be the $i$ th slice of $\mathbf{A}$ , so that $\langle x,A_{i}x\rangle$ is the quadratic form $\sum_{j,k}\mathbf{A}_{ijk}x_{j}x_{k}$ . Then there is a SoS proof that $\mathbf{T}(x)$ is bounded by $|\mathbf{T}(x)-\tau\cdot\langle v,x\rangle^{3}|\leqslant f(x)^{1/2}\cdot\|x\|$ , where $f(x)$ is the degree-4 polynomial $f(x)=\sum_{i}\langle x,A_{i}x\rangle^{2}$ . The polynomial $f$ has a convenient matrix representation: $f(x)=\langle x^{\otimes 2},(\sum_{i}A_{i}\otimes A_{i})x^{\otimes 2}\rangle$ : since this matrix is a sum of iid random matrices $A_{i}\otimes A_{i}$ , it is easy to show that this matrix spectrally concentrates to its expectation. So with high probability one can show that the eigenvalues of $\sum_{i}A_{i}\otimes A_{i}$ are at most $\approx d^{3/2}\log(d)^{1/2}$ (except for a single spurious eigenvector), and it follows that degree- $4$ SoS solves tensor PCA so long as $\tau\gg d^{3/4}\log(d)^{1/4}$ .

This leads the authors to consider a slight modification of $f(x)$ , given by $g(x)=\sum_{i}\langle x,T_{i}x\rangle^{2}$ , where $T_{i}$ is the $i$ th slice of $\mathbf{T}$ . Like $\mathbf{T}$ , the function $g$ also contains information about $v$ , and the SoS bound on the noise term in $\mathbf{T}$ carries over as an analogous bound on the noise in $g$ . In particular, expanding $T_{i}\otimes T_{i}$ and ignoring some negligible cross-terms yields

In this work we speed this up to a linear time algorithm via the partial trace approach. As we have seen, the heart of the matter is to show that taking the partial trace of $\tau^{2}\cdot(v\otimes v)(v\otimes v)^{\top}+\sum_{i}A_{i}\otimes A_{i}$ does not increase the spectral noise. That is, we require that

Notice that the $A_{i}$ are essentially Wigner matrices, and so it is roughly true that $|\operatorname{Tr}(A_{i})|\approx\|A_{i}\|$ , and the situation is very similar to our toy example of the application of partial traces in Section 2.

Heuristically, because $\sum_{i\in[n]}A_{i}\otimes A_{i}$ and $\sum_{i\in[n]}\operatorname{Tr}(A_{i})\cdot A_{i}$ are random matrices, we expect that their eigenvalues are all of roughly the same magnitude. This means that their spectral norm should be close to their Frobenius norm divided by the square root of the dimension, since for a matrix $M$ with eigenvalues $\lambda_{1},\ldots,\lambda_{n}$ , $\|M\|_{F}=\sqrt{\sum_{i\in[n]}\lambda_{i}^{2}}$ . By estimating the sum of the squared entries, we expect that the Frobenious norm of $\sum_{i}\operatorname{Tr}(A_{i})\cdot A_{i}$ is less than that of $\sum_{i}A_{i}\otimes A_{i}$ by a factor of $\sqrt{d}$ after the partial trace, while the dimension decreases by a factor of $d$ , and so assuming that the eigenvalues are all of the same order, a typical eigenvalue should remain unchanged. We formalize these heuristic calculations using standard matrix concentration arguments in Section 6.

Preliminaries

For a vectors space $V$ , we may denote by $\mathcal{L}(V)$ the space of linear operators from $V$ to $V$ . The space orthogonal to a vector $v$ is denoted $v^{\perp}$ .

For a matrix $M$ , we use $M^{-1}$ to denote its inverse or its Moore-Penrose pseudoinverse; which one it is will be clear from context. For $M$ PSD, we write $M^{-1/2}$ for the unique PSD matrix with $(M^{-1/2})^{2}=M^{-1}$ .

Planted sparse vector in random linear subspace

In this section we give a nearly-linear-time algorithm to recover a sparse vector planted in a random subspace.

The leverage scores $\|a_{1}\|^{2},\ldots,\|a_{n}\|^{2}$ are clearly computable in time $O(nd)$ . In the course of proving correctness of the algorithm we will show that $A$ has constant spectral gap, so by a standard analysis $O(\log d)$ matrix-vector multiplies suffice to recover its top eigenvector. A single matrix-vector multiply $Ax$ requires computing $c_{i}\mathrel{\mathop{:}}=(\|a_{i}\|^{2}-\tfrac{d}{n})\langle a_{i},x\rangle$ for each $i$ (in time $O(nd)$ ) and summing $\sum_{i\in[n]}c_{i}x_{i}$ (in time $O(nd)$ ). Finally, computing $Su$ requires summing $d$ vectors of dimension $n$ , again taking time $O(nd)$ .

The following theorem expresses correctness of the algorithm.

When $d\leqslant n^{1/2}/\operatorname{polylog}(n)$ , for any sparsity $\varepsilon>0$ , w.ov.p. the top eigenvector $u$ of $\sum_{i=1}^{n}(\|a_{i}\|^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top}$ has $\langle Su,v_{0}\rangle^{2}\geqslant 1-O(\varepsilon^{1/4})-o(1)$ .

where $e_{1}$ is the first standard basis vector and $\|M\|\leqslant O(\|v_{0}\|_{4}^{3}\cdot n^{-1/4}+\|v_{0}\|_{4}^{2}\cdot n^{-1/2}+\|v_{0}\|_{4}\cdot n^{-3/4}+n^{-1})$ .

The second ingredient we need is that the algorithm is robust to exchanging this good basis for an arbitrary orthogonal basis.

Last, we will need the following fact, which follows from standard concentration. The proof is in Section B.

By Lemma 4.5 and Lemma 4.6, we can write the matrix $A=\sum_{i=1}^{n}(\|a_{i}\|_{2}^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top}$ as

We have assumed that $\|v_{0}\|_{4}^{4}\geqslant(\varepsilon n)^{-1}$ , and so since $A$ is an almost-rank-one matrix (Lemma A.3), the top eigenvector $u$ of $A$ has $\langle u,Qe_{1}\rangle^{2}\geqslant 1-O(\varepsilon^{1/4})$ , so that $\langle Su,SQe_{1}\rangle^{2}\geqslant 1-O(\varepsilon^{1/4})$ by column-orthogonality of $S$ .

We now prove Lemma 4.5. We decompose the matrix in question into a contribution from $\|v_{0}\|_{4}^{4}$ and the rest: explicitly, the decomposition is $\sum(\|a_{i}\|_{2}^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top}=\sum v(i)^{2}\cdot a_{i}a_{i}^{\top}+\sum(\|b_{i}\|_{2}^{2}-\tfrac{d}{n}\cdot a_{i}a_{i}^{\top})$ . This first lemma handles the contribution from $\|v_{0}\|_{4}^{4}$ .

where $\|M^{\prime}\|\leqslant O(\|v\|_{4}^{3}n^{-1/4}+\|v\|_{4}^{2}n^{-1/2})$ w.ov.p..

We first show an operator-norm bound on the principal submatrix $\sum_{i=1}^{n}v(i)^{2}\cdot b_{i}b_{i}^{\top}$ using the truncated matrix Bernstein inequality Proposition A.7. First, the expected operator norm of each summand is bounded:

The operator norms are bounded by constant-degree polynomials in Gaussian variables, so Lemma A.8 applies to truncate their tails in preparation for application of a Bernstein bound. We just have to calculate the variance of the sum, which is at most

for appropriate choice of $d\leqslant n^{-1/2}/\operatorname{polylog}(n)$ . Hence, by triangle inequality, $\|\sum_{i=1}^{n}v(i)^{2}\cdot b_{i}b_{i}^{\top}\|\leqslant\|v\|_{4}^{2}n^{-1/2}$ w.ov.p..

We have already bounded $\sum_{i=1}^{n}q_{i}q_{i}^{\top}=\sum_{i=1}^{n}v(i)^{2}\cdot b_{i}b_{i}^{\top}$ . At the same time, $\|\sum_{i=1}^{n}p_{i}p_{i}^{\top}\|=\|v\|_{4}^{4}$ . By Lemma A.1, then,

w.ov.p.. A final application of triangle inquality gives the lemma. ∎

Our second lemma controls the contribution from the random part of the leverage scores.

Like in the proof of Lemma 4.8, $\sum_{i=1}^{n}(\|b_{i}\|_{2}^{2}-\tfrac{d}{n})\cdot a_{i}a_{i}^{\top}$ decomposes into a convenient block structure; we will bound each block separately.

The termwise operator norms are bounded by constant-degree polynomials in Gaussian variables, so Lemma A.8 applies to truncate the tails of the summands in preparation for a Bernstein bound. We just have to compute the variance of the sum, which is small because we have centered the coefficients:

for appropriate choice of $d\leqslant n/\operatorname{polylog}(n)$ .

We turn to the other blocks from (4.3). The upper-left block contains just the scalar $\sum_{i=1}^{n}(\|b_{i}\|_{2}^{2}-\tfrac{d}{n})v(i)^{2}$ . By standard concentration each term is bounded: w.ov.p.,

It remains just to address the block $\sum_{i=1}^{n}(\|b_{i}\|_{2}^{2}-\tfrac{d}{n})v(i)\cdot b_{i}$ . Each term in the sum has expected operator norm at most

and once again the since the summands’ operator norms are bounded by constant-degree polynomials of Gaussian variables Lemma A.8 applies to truncate their tails in preparation to apply a Bernstein bound. The variance of the sum is at most $\|v\|_{2}^{2}\cdot O(d^{2}/n^{3})$ , again by Fact A.6. Finally, Lemma A.8 and Proposition A.7 apply to give that w.ov.p.

for appropriate choice of $d\leqslant n^{1/2}/\operatorname{polylog}(n)$ . Putting it all together gives the lemma. ∎

We decompose $\|a_{i}\|_{2}^{2}=v_{0}(i)^{2}+\|b_{i}\|_{2}^{2}$ and use Lemma 4.8 and Lemma 4.9.

Since $\|v_{0}\|_{4}^{4}\geqslant(\varepsilon n)^{-1}$ , we get $\|v_{0}\|_{4}^{4}/\|M\|\geqslant\frac{1}{\varepsilon^{1/4}}$ , completing the proof. ∎

2 Closeness of input basis and good basis

We turn now to the proof of Lemma 4.6. We recall the setting. We have two matrices: $M$ , which the algorithm computes, and $M^{\prime}$ , which is induced by a basis for the subspace which reveals the underlying randomness and which we prefer for the analysis. $M^{\prime}$ differs from $M$ by a rotation and a basis orthogonalization step (the good basis is only almost orthogonal). The rotation is easily handled. The following lemma gives the critical fact about the orthogonalization step: orthogonalizing does not change the leverage scores too much. Strictly speaking the good basis does not have leverage scores since it is not orthogonal, but we can still talk about the norms of the rows of the matrix whose columns are the basis vectors.

The proof again uses standard concentration and matrix inversion formulas, and can be found in Section B. We are ready to prove Lemma 4.6.

Conjugating by $Q$ and multiplying by $-1$ does not change the operator norm, so that this is equivalent to

Finally, substituting $a_{i}^{\prime}=QA^{-1/2}a_{i}$ , and using the fact that $Q$ is a rotation, it will be enough to show

The first of these we observe has bounded operator norm w.ov.p.:

using in the last step Lemma 4.8 and standard concentration to bound $\sum_{i=1}^{n}\|b_{i}\|_{2}^{2}\cdot a_{i}a_{i}^{\top}$ (Lemma 4.7). Thus, by triangle inequality applied to (4.4), we get

using Lemma 4.5 in the last step. For appropriate choice of $d\leqslant n^{-1/2}/\operatorname{polylog}(n)$ , this is at most $O(n^{-1})+o(\|v\|_{4}^{4})$ . ∎

Overcomplete tensor decomposition

In this section, we give a polynomial-time algorithm for the following problem when $n\leqslant d^{4/3}/(\operatorname{polylog}d)$ :

We give an algorithm that solves this problem, so long as the overcompleteness of the input tensor is bounded such that $n\ll d^{4/3}/\operatorname{polylog}d$ .

However, instead of $\sum_{i}a_{i}^{\otimes 6}$ , we have only $\sum_{i,j}(a_{i}\otimes a_{j})^{\otimes 3}$ . Unfortunately, running the same algorithm on the latter input will not succeed. To see why, consider the extra terms $E^{\prime}\mathrel{\mathop{:}}=\sum_{i\neq j}\langle g,a_{i}\otimes a_{j}\rangle(a_{i}\otimes a_{j})^{\otimes 2}$ . Since $|\langle g,a_{i}\otimes a_{j}\rangle|\approx 1$ , it is straightforward to see that $\|E^{\prime}\|_{F}\approx n$ . Since the rank of $E^{\prime}$ is clearly $d^{2}$ , even if we are lucky and all the eigenvalues have similar magnitudes, still a typical eigenvalue will be $\approx n/d\gg 1$ , swallowing the $\sum_{i}a_{i}^{\otimes 6}$ term.

suggesting our algorithm will succed when $n^{3/2}\ll d^{2}$ , which is to say $n\ll d^{4/3}$ .

The following theorem, which formalizes the intuition above, is at the heart of our tensor decomposition algorithm.

The rest of this section is organized as follows. In Section 5.1 we prove Theorem 5.3 using two core facts: the Gaussian vector $g$ is closer to some $a_{i}$ than to any other with good probability, and the noise term $\sum_{i\neq j}\langle g,T(a_{i}\otimes a_{j})\rangle(a_{i}\otimes a_{j})(a_{i}\otimes a_{j})^{\top}$ is bounded in spectral norm. In Section 5.2 we prove the first of these two facts, and in Section 5.3 we prove the second. In Section 5.4, we give the full details of our tensor decomposition algorithm, then prove Theorem 5.2 using Theorem 5.3. Finally, Section C contains proofs of elementary or long-winded lemmas we use along the way.

Using these two propositions we will conclude that the top eigenvector of $RMR$ is likely to be correlated with one of the vectors $a_{j}^{\otimes 2}$ . We also need two simple concentration bounds; we defer the proof to the appendix.

As a last technical tool we will need a simple claim about the fourth moment matrix of the multivariate Gaussian:

Finally, we give a bound on the spectral gap. We note that the second eigenvector $w$ has $\langle u,w\rangle=0$ , and therefore

Thus, using our above bound on $\|R(M-\varepsilon\langle g,Ta_{j}^{\otimes 2}\rangle(a_{j}a_{j}^{\top})^{\otimes 2})R\|$ and the concentration bounds we have already applied for $\|a_{j}\|$ , $\langle g,Ta_{j}^{\otimes 2}\rangle$ , and $\|Ra_{j}^{\otimes 2}\|$ , we have that

We conclude that the above events also imply that $\lambda_{2}(RMR)/\lambda_{1}(RMR)\leqslant 1-O(\varepsilon)$ . ∎

2 Spectral gap for diagonal terms: proof of Proposition 5.4

We now prove that the signal matrix, when preconditioned by $R$ , has a noticeable spectral gap:

In [GM15, Lemma 5] a similar lemma to this one is proved in the context of the SoS proof system. However, since Ge and Ma leverage the full power of the SoS algorithm their proof goes via a spectral bound on a different (but related) matrix. Since our algorithm avoids solving an SDP we need a bound on this matrix in particular.

The proof of Lemma 5.9 proceeds by standard spectral concentration for tall matrices with independent columns (here the columns are $Ra_{i}^{\otimes 2}$ ). The arc of the proof is straightforward but it involves some bookkeeping; we have deferred it to Section C.0.2.

We also need the following lemma on the concentration of some scalar random variables involving $R$ ; the proof is straightforward by finding the eigenbasis of $R$ and applying standard concentration, and it is deferred to the appendix.

Putting together (5.3) and (5.5) with our bounds on $\eta_{\text{norm}}$ and $\eta_{\text{gap}}$ and recalling the conditions on (5.2), we have shown that

We now turn to proving that with reasonable probability, $g$ is closer to some $Ta_{j}^{\otimes 2}$ than all others.

To avoid proliferation of indices, without loss of generality fix $j=1$ . We begin by expanding the expression $\langle g,Ta_{i}^{\otimes 2}\rangle$ ,

with overwhelming probability for all $i$ and choices of $g$ ; this follows from a Bersntein bound, given in Lemma 5.6.

Let $\mathcal{G}_{1}$ be the event that $\sqrt{2\alpha}\log^{1/2}n\leqslant|\langle g,\hat{a}_{1}\rangle|\leqslant d^{1/4}$ for some $\alpha\leqslant d^{1/2-\Omega(1)}$ to be chosen later. We note that $\langle g,\hat{a}_{1}\rangle$ is distributed as a standard gaussian, and that $g$ is independent of $a_{1},\ldots,a_{n}$ . Thus, we can use standard tail estimates on univariate Gaussians (Lemma A.4) to conclude that

Now, we must obtain an estimate for the probability that all other inner products with $g$ are small. Let $\mathcal{G}_{i>1}$ be the event that $|\langle g,\hat{a}_{i}\rangle|\leqslant\sqrt{(2+\rho)}\log^{1/2}n$ for all $i\in[n],i>1$ and for some $\rho$ to be chosen later. We will show that conditioned on $\mathcal{G}_{1}$ , $\mathcal{G}_{i>1}$ occurs with probability $1-O(n^{1-(2+\rho)/2})$ . Define $g_{1}\mathrel{\mathop{:}}=\langle g,\hat{a}_{1}\rangle\hat{a_{1}}$ to be the component of $g$ parallel to $a_{1}$ , let $g_{\perp}\mathrel{\mathop{:}}=g-g_{1}$ be the component of $g$ orthogonal to $\hat{a}_{1}$ , and similarly let $\hat{a}_{2}^{\perp},\ldots,\hat{a}_{n}^{\perp}$ be the components of $\hat{a}_{2},\ldots,\hat{a}_{n}$ orthogonal to $a_{1}$ . Because $g_{\perp}$ is independent of $g_{1}$ , even conditioned on $\mathcal{G}_{1}$ we may apply the standard tail bound for univariate Gaussians (Lemma A.4), concluding that for all $i>1$ ,

On the other hand, let $\hat{a}_{2}^{\parallel},\ldots,\hat{a}_{n}^{\parallel}$ be the components of the $\hat{a}_{i}$ parallel to $\hat{a}_{1}$ . We compute the projection of $\hat{a}_{i}$ onto $\hat{a}_{1}$ . With overwhelming probability,

3 Bound for cross terms: proof of Proposition 5.5

The proof will use two iterations of Matrix Rademacher bounds. The first step will be to employ a classical decoupling inequality that has previously been used in a tensor decomposition context [GM15].

Let $\{s_{i}\},\{t_{i}\}$ be independent iid sequences of random signs. Let $\{M_{ij}\}$ be a family of matrices. There is a universal constant $C$ so that for every $t>0$ ,

Once the simplified cross terms are decoupled, we can use a matrix Rademacher bound on one set of signs.

Consider a finite sequence $\{M_{i}\}$ of fixed $m\times m$ Hermitian matrices. Let $s_{i}$ be a sequence of independent sign variables. Let $\sigma^{2}\mathrel{\mathop{:}}=\|\sum_{i}M_{i}^{2}\|$ . Then for every $t\geqslant 0$ ,

Let $s_{1},\ldots,s_{n}$ be independent signs in $\{-1,1\}$ . Let $A_{1},\ldots,A_{n}$ and $B_{1},\ldots,B_{n}$ be Hermetian matrices. Then w.ov.p.,

We use a matrix Rademacher bound and standard manipulations:

so that now we need to bound $\sum_{j\in[n]}s_{j}a_{j}a_{j}^{\top}\otimes M_{j}$ . By Corollary 5.15,

Letting $t_{1},\ldots,t_{n}$ and $r_{1},\ldots,r_{n}$ be independent uniformly random signs, by Theorem 5.13, it will be enough to bound the spectral norm after replacing the second and third occurrences of $s_{i}$ for $t_{i}$ and $r_{i}$ . To this end, we define

We use a matrix Rademacher bound for the left-hand matrix,

For the right-hand matrix, we use the fact that the summands are PSD to conclude that

using the same concentration facts as earlier.

Finally, by triangle inequality and all our bounds thus far, w.ov.p.

4 Full algorithm and proof of Theorem 5.2

In this subsection we give the full details of our tensor decomposition algorithm. As discussed above, the algorithm proceeds by constructing a random matrix from the input tensor, then computing and post-processing its top eigenvector.

Theorem 5.3 gets us most of the way to the correctness of Algorithm 5.17, proving that the top eigenvector of the matrix $RMR$ is correlated with some $a_{i}^{\otimes 2}$ with reasonable probability. We need a few more ingredients to prove Theorem 5.2. First, we need to show a bound on the runtime of Algorithm 5.17.

To run the algorithm, we only require access to power iteration using the matrix $RMR$ . We first give a fast implementation for power iteration with the matrix $M$ , and handle the multiplications with $R$ separately.

We show that the matrix-vector multiply $Mv$ can be computed as a flattening of the following product:

So we have that $Mv$ is a flattening of the product $T_{v}T^{\top}$ , which we will compute as a proxy for computing $Mv$ via direct multiplication.

Performing the update $RMRv$ a total of $O(\log^{2}n)$ times is sufficient for convergence, as we have that with reasonable probability, the spectral gap $\lambda_{2}(RMR)/\lambda_{1}(RMR)\leqslant 1-O(\tfrac{1}{\log n})$ , as a result of applying Theorem 5.3 with the choice of $\varepsilon=O(\tfrac{1}{\log n})$ .

Finally, checking the value of $\sum_{i}\langle a_{i},x\rangle^{3}$ requires $O(d^{3})$ operations, and we do so a constant number of times, once for each of the signings of the top 2 left (or right) singular vectors of $U$ . ∎

Furthermore, for any $0<\alpha<1$ , there are $a^{\prime},b^{\prime}$ among the top $\lfloor\tfrac{1}{\alpha c^{2}}\rfloor$ singular vectors of $U$ with

If $c\geqslant\sqrt{\tfrac{1}{2}(1+\eta)}$ for some $\eta>0$ , then $a,b$ are amongst the top $\lfloor\frac{(1+\eta)}{\eta c^{2}}\rfloor$ singular vectors.

Since here $c^{2}=1-o(1)$ , we can choose $\eta=1-o(1)$ and check only the top $2$ singular vectors.

Next, we must show how to choose the threshold $c(n,d)$ so that a big enough value $\sum_{i\in[n]}\langle a_{i},u_{j}\rangle^{3}$ is ensures that $u_{j}$ is close to a tensor component. The proof is at the end of this section. (A very similar fact appears in [GM15]. We need a somewhat different parameterization here, but we reuse many of their results in the proof.)

We start with the first claim. By [GM15, Lemma 2, (proof of) Lemma 8, Theorem 4.2], the following inequalities all hold w.ov.p..

Together with (5.8) this concludes the of the first claim.

We remark that Algorithm 5.17 may be used in conjunction with a local search algorithm to obtain much greater guarantees on the accuracy of the recovered vectors. Previous progress on the tensor decomposition problem has produced iterative algorithms that provide local convergence guarantees given a good enough initialization, but which leave the question of how to initialize the procedure up to future work, or up to the specifics of an implementation. In this context, our contribution can be seen as a general method of obtaining good initializations for these local iterative procedures.

In particular, Anandkumar et al. [AGJ15] give an algorithm that combines tensor power iteration and a form of coordinate descent, which when initialized with the output of Algorithm 5.17, achieves a linear convergence rate to the true decomposition within polynomial time.

then there exists a procedure which with overwhelming probability over $\mathbf{T}$ and for any $\varepsilon>0$ , recovers a set of vectors $\{\hat{a}_{i}\}$ such that

in time $O(\operatorname{poly}(d)+nd^{3}\log\varepsilon)$ .

Theorem 1 of Anandkumar et al. is stated for random asymmetric tensors, but the adaptation to symmetric tensors is stated in equations (14) and (27) in the same paper.

The theorem of Anandkumar et al. allows for a perturbation tensor $\Phi$ , which is just the zero tensor in our setting. Additionally, the weight ratios specifying the weight of each rank-one component in the input tensor are $w_{max}=w_{min}=1$ . Lastly, the initialization conditions are given in terms of the distance between the intialization vectors and the true vectors $|x_{i}-a_{i}|$ , which is related to our measure of closeness $\langle x_{i},a_{i}\rangle$ by the equation $|x_{i}-a_{i}|^{2}=|x_{i}|^{2}+|a_{i}|^{2}-2\langle x_{i},a_{i}\rangle$ .

The linear convergence guarantee is stated in Lemma 12 of Anandkumar et al.

We repeatedly invoke Algorithm 5.17 until we obtain a full set of $n$ vectors as characterized by Theorem 5.2. Apply Theorem 5.21 to the recovered set of vectors until the desired accuracy is obtained. ∎

Tensor principal component analysis

The Tensor PCA problem in the spiked tensor model is similar to the setting of tensor decomposition, but here the goal is to recover a single large component with all smaller components of the tensor regarded as random noise.

Using the partial trace method, we give the first linear-time algorithm for this problem that recovers $v$ for signal-to-noise ratio $\tau=O(n^{3/4}/\operatorname{poly}\log n)$ . In addition, the algorithm requires only $O(n^{2})$ auxiliary space (compared to the input size of $n^{3}$ ) and uses only one non-adaptive pass over the input.

This spiked tensor model (for general order- $k$ tensors) was introduced by Montanari and Richard [RM14], who also obtained the first algorithms to solve the model with provable statistical guarantees. Subsequently, the SoS approach was applied to the model to improve the signal-to-noise ratio required for odd-order tensors [HSS15]; for $3$ -tensors reducing the requirement from $\tau=\Omega(n)$ to $\tau=\Omega(n^{3/4}\log(n)^{1/4})$ .

Using the linear-algebraic objects involved in the analysis of the SoS relaxation, the previous work has also described algorithms with guarantees similar to those of the SoS SDP relaxation, while requiring only nearly subquadratic or linear time [HSS15].

The algorithm here improves on the previous results by use of the partial trace method, simplifying the analysis and improving the runtime by a factor of $\log n$ .

2 Linear-time algorithm

When $\mathbf{A}$ has iid standard Gaussian entries and $\tau\geqslant Cn^{3/4}\log(n)^{1/2}/\varepsilon$ for some constant $C$ , Algorithm 6.2 recovers $v^{\prime}$ with $\langle v,v^{\prime}\rangle\geqslant 1-O(\varepsilon)$ with high probability over $\mathbf{A}$ .

Algorithm 6.2 can be implemented in linear time and sublinear space.

These theorems are proved by routine matrix concentration results, showing that in the partial trace matrix, the signal dominates the noise.

To implement the algorithm in linear time it is enough to show that this (sublinear-sized) matrix has constant spectral gap; then a standard application of the matrix power method computes the top eigenvector.

For any $v$ , with high probability over $\mathbf{A}$ , the following occur:

Applying Lemma 6.5 and the triangle inequality, we see that

and note that to construct the right-hand side it is enough to examine each entry of $\mathbf{T}$ just $O(1)$ times and perform $O(n^{3})$ additions. At no point do we need to store more than $O(n^{2})$ matrix entries at the same time. ∎

Acknowledgements

We would like to thank Rong Ge for very helpful discussions. We also thank Jonah Brown Cohen, Pasin Manurangsi and Aviad Rubinstein for helpful comments in the preparation of this manuscript.

References

Appendix A Additional preliminaries

Here we provide some lemmas in linear algebra.

This first lemma is closely related to the sos Cauchy-Schwarz from [BKS14], and the proof is essentially the same.

In applications, we will have $\sum_{i}p_{i}q_{i}$ as a single block of a larger block matrix containing also the blocks $\sum_{i}p_{i}p_{i}^{\top}$ and $\sum_{i}q_{i}q_{i}^{\top}$ .

To see this, just note that the right-hand side minus the left is exactly

The lemma follows now be applying this inequality to

Let $A_{1},\ldots,A_{m},B_{1},\ldots,B_{m}$ be real random matrices. Then

where the nontrivial inequalities follow from Cauchy-Schwarz for expectations, vectors and scalars, respectively. ∎

The followng lemma allows to argue about the top eigenvector of matrices with spectral gap.

Then, $\langle u,v\rangle^{2}\geqslant\varepsilon\cdot\lVert v\rVert^{2}$ .

We lower bound the quadratic form of $M-vv^{\top}$ evaluated at $u$ by

At the same time, this quadratic form evaluated at $u$ is upper bounded by $\lVert M\rVert-\varepsilon\cdot\lVert v\rVert^{2}$ . It follows that $\langle u,v\rangle^{2}\geqslant\varepsilon\cdot\lVert v\rVert^{2}$ as desired.

Furthermore, for any $0<\alpha<1$ , there are $a^{\prime},b^{\prime}$ among the top $\lfloor\tfrac{1}{\alpha c^{2}}\rfloor$ singular vectors of $U$ with

If $c\geqslant\sqrt{\tfrac{1}{2}(1+\eta)}$ for some $\eta>0$ , then $a,b$ are amongst the top $\lfloor\frac{(1+\eta)}{\eta c^{2}}\rfloor$ singular vectors.

Let $\hat{v}=v/\|v\|$ . Let $(\sigma_{i},a_{i},b_{i})$ be the $i$ th singular value, left and right (unit) singular vectors of $U$ respectively.

Furthermore, we observe that $\|U\|_{F}=\|Mu\|\leqslant\|M\|\cdot\|u\|$ , and that therefore $\|U\|_{F}\leqslant\|u\|$ . We thus have that,

where to obtain the last inequality we have used Cauchy-Schwarz and our bound on $\|U\|_{F}$ . We may thus conclude that

where we have used the fact that the left singular values of $U$ are orthonormal. The argument is symmetric in the $b_{i}$ .

where we have applied Cauchy-Schwarz and the orthonormality of the $b_{i}$ . In particular,

On the other hand, let $S$ be the set of $i\in[d]$ for which $\sigma_{i}^{2}\leqslant\alpha c^{2}\|U\|_{F}^{2}$ . By substitution,

where we have used the fact that the right singular vectors are orthonormal. The last two inequalities imply that $S\neq[d]$ . Letting $T=[d]\setminus S$ , it follows from subtraction that

so that $\max_{i\in T}\langle\hat{v},a_{i}\rangle^{2}\geqslant(1-\alpha)c^{2}$ . Finally,

so that $|T|\leqslant\lfloor\frac{1}{\alpha c^{2}}\rfloor$ . Thus, one of the top $\lfloor\frac{1}{\alpha c^{2}}\rfloor$ right singular vectors $a$ has correlation $|\langle\hat{v},a\rangle|\geqslant\sqrt{(1-\alpha)}c$ . The same proof holds for the $b$ .

Furthermore, if $c^{2}>\tfrac{1}{2}(1+\eta)$ for some $\eta>0$ , and $(1-\alpha)c^{2}>\tfrac{1}{2}$ , then by (A.1) it must be that $\max_{i\in T}\langle\hat{v},a_{i}\rangle^{2}=\max_{i\in[d]}\langle\hat{v},a_{i}\rangle^{2}$ , as $\hat{v}$ cannot have square correlation larger than $\tfrac{1}{2}$ with more than one left singular vector. Taking $\alpha=\tfrac{\eta}{1+\eta}$ guarantees this. The conclusion follows. ∎

A.2 Concentration tools

We require a number of tools from the literature on concentration of measure.

We need the some concentration bounds for certain polynomials of Gaussian random variables.

The following lemma gives standard bounds on the tails of a standard gaussian variable—somewhat more precisely than other bounds in this paper. Though there are ample sources, we repeat the proof here for reference.

Let $X\sim\mathcal{N}(0,1)$ . Then for $t>0$ ,

To show the first statement, we apply an integration trick,

where in the third step we have used the fact that $\frac{x}{t}\leqslant x$ for $t\geqslant x$ . For the second statement, we integrate by parts and repeat the trick,

The following is a small modification of Theorem 6.7 from [Jan97] which follows from Remark 6.8 in the same.

In our concentration results, we will need to calculate the expectations of multivariate Gaussian polynomials, many of which share a common form. Below we give an expression for these expectations.

Let $x$ be a $d$ -dimensional vector with independent identically distributed gaussian entries with variance $\sigma^{2}$ . Let $u$ be a fixed unit vector. Then setting $X=(\|x\|^{2}-c)^{p}\|x\|^{2m}xx^{T}$ , and setting $U=(\|x\|^{2}-c)^{p}\|x\|^{2m}uu^{T}$ , we have

A.2.2 For matrix-valued random variables

On several occasions we will need to apply a Matrix-Bernstein-like theorem to a sum of matrices with an unfortunate tail. To this end, we prove a “truncated Matrix Bernstein Inequality.” Our proof uses an standard matrix Bernstein inequality as a black box. The study of inequalities of this variety—on tails of sums of independent matrix-valued random variables— was initiated by Ahlswede and Winter [AW02]. The excellent survey of Tropp [Tro12] provides many results of this kind.

In applications of the following the operator norms of the summands $X_{1},\ldots,X_{n}$ have well-behaved tails and so the truncation is a routine formality. Two corollaries following the proposition and its proof capture truncation for all the matrices we encounter in the present work.

Furthermore, suppose that for each $X_{i}$ ,

Then for $X=\sum_{i\in[n]}X_{i}$ , we have

Define $Y=\sum_{i\in[n]}Y_{i}$ . We claim that

Now we are ready to apply the non-commutative Bernstein’s inequality to $Y$ . We have

Putting everything together and setting $\alpha=t-nq$ ,

The following lemma helps achieve the assumptions of Proposition A.7 easily for a useful class of thin-tailed random matrices.

where $c$ is a universal constant. Taking $t=R=\operatorname{polylog}(n)$ gives us (A.4).

We now address (A.5). To this end, let $p(t)$ and $P(t)$ be the probability density function and cumulative density function of $\|X\|_{op}$ , respectively. We apply Jensen’s inequality and instead bound

Evaluating at $R=\operatorname{polylog}n$ for a sufficiently large polynomial in the log gives us

Appendix B Concentration bounds for planted sparse vector in random linear subspace

Let $c\mathrel{\mathop{:}}=\sum_{i=1}^{n}v(i)b_{i}$ . The matrix in question has a nice block structure:

By [Ver10, Corollary 5.50] applied to the subgaussian vectors $nb_{i}$ , w.ov.p.

These are proved easily via the identity $(1+\varepsilon)^{-1}=\sum_{k=1}^{\infty}\varepsilon^{k}$ and similar. ∎

We conclude that with overwhelming probability $1-d^{-\omega(1)}$ ,

Let $A_{-j}=\sum_{i\neq j}a_{i}a_{i}^{\top}$ . By Sherman–Morrison,

Let $A$ be a block matrix where one of the diagonal blocks is the $1\times 1$ identity; that is,

for some matrix $B$ and vector $c$ . Let $x$ be a vector which decomposes as $x=(x(1)\,\,x^{\prime})$ where $x(1)=\langle x,e_{1}\rangle$ for $e_{1}$ the first standard basis vector.

By the formula for block matrix inverses,

The result follows by Sherman-Morrison applied to $(B-cc^{\top})^{-1}$ and the definition of $x$ . ∎

where the first two use Cauchy-Schwarz and the last follows from the first two.

We turn now to the expansion of $\langle a_{i},A^{-1}a_{i}\rangle$ offered by Lemma B.3,

Addressing (B.1) first, by the above estimates and Lemma B.2 applied to $\langle b_{i},B^{-1}b_{i}\rangle$ ,

w.ov.p.. For (B.2), we pull out the important factor of $\|c\|$ and separate $v(i)$ from $b_{i}$ : w.ov.p.,

where the last inequality follows from our estimates above and Cauchy-Schwarz.

Appendix C Concentration bounds for overcomplete tensor decomposition

We require some facts about the concentration of certain scalar-and matrix-valued random variables, which generally follow from standard concentration arguments. We present proofs here for completeness.

The first lemma captures standard facts about random Gaussians.

Inner products $|\langle a_{i},a_{j}\rangle|$ are all $\approx 1/\sqrt{d}$ :

We start with Item 1. Consider the quantity $\langle a_{i},a_{j}\rangle^{2}$ . We calculate the expectation,

Since this is a degree-4 square polynomial in the entries of $a_{i}$ and $a_{j}$ , we may apply Lemma A.5 to conclude that

Applying this fact with $t=\operatorname{polylog}(n)$ and taking a union bound over pairs $i,j\in[n]$ gives us the desired result.

Next is Item 2. Consider the quantity $\|a_{i}\|_{2}^{2}$ . We will apply Lemma A.5 in order to obtain a tail bound for the value of the polynomial $(\|a_{i}\|_{2}^{2}-1)^{2}$ . We have

and now applying Lemma A.5 with the square root of this expectation, we have

This gives both bounds for a single $a_{i}$ . The result now follows from taking a union bound over all $i$ .

and taking $t=\operatorname{polylog}(n)$ the conclusion follows. ∎

We also use the fact that the covariance matrix of a sum of sufficiently many gaussian outer products concentrates about its expectation.

We apply a truncated matrix bernstein inequality. For convenience, $A:=\sum_{i\in[n]}a_{i}a_{i}^{\top}$ and let $A_{i}:=a_{i}a_{i}^{\top}$ be a single summand. To begin, we calculate the first and second moments of the summands,

The proof proceeds by truncated matrix Bernstein, since the summands are independent for fixed $g,a_{j}$ . For this we need to compute the variance:

The norm of each term in the sum is bounded by a constant-degree polynomial of Gaussians. Straightforward calculations show that in expectation each term is $O(\tfrac{1}{d}\langle g,a_{i}\rangle)$ in norm; w.ov.p. this is $O(\sigma)$ . So Lemma A.5 applies to establish the hypothesis of truncated Bernstein Proposition A.7. In turn, Proposition A.7 yields that w.ov.p.

We will derive Fact C.4 as a corollary of a more general claim about rotationally symmetric distributions.

Now we can prove some concentration claims we deferred:

And one direction of the second item, using Fact C.4 and Fact C.1 (the other direction is similar):

C.0.2 Proof of Lemma 5.9

To prove Lemma 5.9 we will begin by reducing to the case $S=[n]$ via the following.

It will be convenient to reduce concentration for matrices involving $a_{i}\otimes a_{i}$ to analogous matrices where the vectors $a_{i}\otimes a_{i}$ are replaced by isotropic vectors of constant norm. The following lemma shows how to do this.

We need the following bound on the deviation from expectation of a tall matrix with independent columns.

We are now prepared to handle the case of $S=[n]$ via spectral concentration for matrices with independent columns, Theorem C.9.

Appendix D Concentration bounds for tensor principal component analysis

For convenience, we restate Lemma 6.5 here.

For any $v$ , with high probability over $\mathbf{A}$ , the following occur:

For notational convenience, let $A$ be distributed like a generic $A_{i}$ . By a union bound, we have both of the following:

The other matrices are easier. First of all, we note that the matrix $\sum_{i}v(i)\cdot A_{i}$ has independent standard Gaussian entries, so it is standard that with high probability $\|\sum_{i}v(i)\cdot A_{i}\|\leqslant O(\sqrt{n}\log n)$ . Second, we have

The random variable $\operatorname{Tr}(A_{i})$ is a centered Gaussian with variance $n$ , and since $v$ is a unit vector, $\sum_{i}v(i)\operatorname{Tr}(A_{i})$ is also a centered Gaussian with variance $n$ . So with high probability we get

by standard estimates. This completes the proof. ∎