Fast and robust tensor decomposition with applications to dictionary learning

Tselil Schramm, David Steurer

Introduction

Tensor decomposition is studied extensively across many disciplines including machine learning and signal processing. It is a powerful primitive for solving a wide range of other inverse / learning problems, for example: blind source separation / independent component analysis [LCC07], learning phylogenetic trees and hidden Markov models [MR05], mixtures of Gaussians [HK13], topic models [AFH+12], dictionary learning [BKS15, MSS16], and noisy-or Bayes nets [AGMR16].

A classical algorithm based on simultaneous diagonalization [Har70, LMV96] (often attributed to R. Jennrich) can decompose the input tensor (1.1) when the components are linearly independent, there is no error, and the order of the tensor is at least 33. Current research on algorithms for tensor decomposition aims to improve over the guarantees of this classical algorithm in two important ways:

What conditions allow us to decompose tensors when components are linearly dependent?

What kind of errors can efficient decomposition algorithms tolerate? Can we tolerate errors EE with “magnitude” comparable to the low-rank part i=1naik\sum_{i=1}^{n}a_{i}^{\otimes k}?

The focus of this work is on robustness. There are two ways in which errors arise in applications of tensor decomposition. The first is due to finite samples. For example, in some applications TT is the empirical kk-th moment of some distribution and the error EE accounts for the difference between the empirical moment and actual moment (“population moment”). Errors of this kind can be made smaller at the expense of requiring a larger number of samples from the distribution. Therefore, robustness of decomposition algorithms helps with reducing sample complexity.

Another way in which errors arise is from modeling errors (“systematic errors”). These kinds of errors are more severe because they cannot be reduced by taking larger samples. Two important applications of tensor decomposition with such errors are learning Noisy-or Bayes networks [AGMR16] and sparsely-used dictionaries [BKS15]. For noisy-or networks, the errors arise due to non-linearities in the model. For sparsely-used dictionaries, the errors arise due to unknown correlations in the distribution of sparse feature representations. These examples show that robust tensor decomposition allows us to capture a wider range of models.

Robustness guarantees for tensor decomposition algorithms have been studied extensively (e.g., the work on tensor power iteration [AGH+14]). The polynomial-time algorithm with the best known robustness guarantees for tensor decomposition [BKS15, MSS16] are based on the sum-of-squares (SOS) method, a powerful meta-algorithm for polynomial optimization problems based on semidefinite programming relaxations. Unfortunately, these algorithms are far from practical and have polynomial running times with large exponents. The goal of this work is to develop practical tensor decomposition algorithms with robustness guarantees close to those of SOS-based algorithms.

In this work, we develop an easy-to-implement, randomized spectral algorithm to decompose 4-tensors with orthonormal components even when the error tensor EE has small but constant spectral norm as a d2d^{2}-by-d2d^{2} matrix. This robustness guarantee is qualitatively optimal with respect to this norm in the sense that an error tensor EE with constant spectral norm (as a d2d^{2}-by-d2d^{2} matrix) could change each component by a constant proportion of its norm. To the best of our knowledge, the only previous algorithms with this kind of robustness guarantee are based on SOS.We remark that the aforementioned analysis [MSS16] of Jennrich’s algorithm can tolerate errors EE if its spectral norm as a non-square d3d^{3}-by-dd matrix is constant. However, this norm of EE can be larger by a d\sqrt{d} factor than its spectral norm as a square matrix. Our algorithm runs in time d2+ωd4.373d^{2+\omega}\leqslant d^{4.373} using fast matrix multiplication. Even without fast matrix multiplication, our running time of d5d^{5} is close to linear in the size of the input d4d^{4} and significantly faster than the running time of SOS. As we will discuss later, an extension of this algorithm allows us to solve instances of dictionary learning that previously could provably be solved only by SOS.

A related previous work [HSSS16] also studied the question how to achieve similar guarantees for tensor decomposition as SOS using just spectral algorithms. Our algorithms follow the same general strategy as the algorithms in this prior work: In an algorithm based on the SOS semidefinite program, one solves a convex programming relaxation to obtain a “proxy” for a true, integral solution (such as a component of the tensor). Because the program is a relaxation, one must then process or “round” the relaxation or proxy into a true solution. In our algorithms, as in [HSSS16], instead of finding solutions to SOS semidefinite programs, the algorithms find “proxy objects” that behave in similar ways with respect to the rounding procedures used by SOS-based algorithms. Since these rounding procedures tend to be quite simple, there is hope that generating proxy objects that “fool” these procedures is computationally more efficient than solving general semidefinite programs.

However, our algorithmic techniques for finding these “proxy objects” differ significantly from those in prior work. The reason is that many of the techniques in [HSSS16], e.g., concentration inequalities for matrix-valued polynomials, are tailored to average-case problems and therefore do not apply in our setting because we do not make distributional assumption about the errors EE.

Dictionary learning, also known as sparse coding, is studied extensively in neuroscience [OF97], machine learning [EP07, MRBL07], and computer vision [EA06, MLB+08, YWHM08]. Most algorithms for this problem used in practice do not come with strong provable guarantees. In recent years, several algorithms with provable guarantees have been developed for this problem [AAJ+14, ABGM14, AGMM15, BKS15, MSS16, HM16].

For the case that the coordinates of the distribution {x}\{x\} are independent (and non-Gaussian) there is a well-known reductionThe reduction only requires kk-wise independence where kk is the order to tensor the reduction produces. of this problem to tensor decomposition where the components are the columns of AA and the error EE can be made inverse polynomially small by taking sufficiently many samples. (In the case of independent coordinates, dictionary learning becomes a special case of independent component analysis / blind source separation, where this reduction originated.) Variants of Jennrich’s spectral tensor decomposition algorithm (e.g., [BCMV14, MSS16]) imply strong provable guarantees for dictionary learning in the case that {x}\{x\} has independent coordinates even in the “overcomplete” regime when ndn\gg d (using a polynomial number of samples).

More challenging is the case that {x}\{x\} has non-independent coordinates, especially if those correlations are unknown. We consider a model proposed in [BKS15] (similar to a model in [ABGM14]): We say that the distribution {x}\{x\} is τ\tau-nice if

An extension of our aforementioned algorithm for orthogonal tensor decomposition with spectral norm error allows us to learn orthonormal dictionaries from τ\tau-nice distributions. To the best of our knowledge, the only previous algorithms to provably solve this problem use sum-of-squares relaxations [BKS15, MSS16], which have large polynomial running time. Our algorithm recovers a 0.990.99 fraction of the columns of AA up to error τ\tau from O~(n3){\widetilde{O}}(n^{3}) samples, and runs in time n3+O(τ)d4n^{3+O(\tau)}d^{4}. By a standard reduction, our algorithm also works for non-singular dictionaries and the running time increases by a factor polynomial in the condition number of AA.

1 Results

For tensor decomposition, we give an algorithm with close to linear running time that recovers the rank-1 components of a tensor with orthonormal components, so long as the spectral norm of the square unfoldings of the error tensor is small.

Furthermore, if εO(loglogn/log3n)\varepsilon\leqslant O(\log\log n/\log^{3}n), the algorithm can recover all components up to error O(ε)O(\varepsilon) in time O~(d2+ω+nd4){\widetilde{O}}(d^{2+\omega}+nd^{4}) with high probability. Here, ω2.373\omega\leqslant 2.373 is the matrix multiplication exponent.

The orthogonality condition may at first seem restrictive, but for most applications it is possible to take a tensor with linearly independent components and transform it to a tensor with orthogonal components, as we will do for our dictionary learning result below. Furthermore, our algorithm works as-is if the components are sufficiently close to orthonormal:

These robustness guarantees are comparable to those of the sum-of-squares-based algorithms in [BKS15, MSS16] for the undercomplete case, which are the best known. Meanwhile, the sum-of-squares based algorithms require solving large semidefinite programs, while the running time of our algorithms is close to linear in the size of the input, and our algorithms are composed of simple matrix-vector multiplications.

On the other hand, our algorithms fail to work in the overcomplete case, when the rank grows above nn, and the components are no longer linearly independent. One interesting open question is whether the techniques used in this paper can be extended to the overcomplete case.

Dictionary learning

Using our tensor decomposition algorithm as a primitive, we give an algorithm for dictionary learning when the sample distribution is τ\tau-nice.

The total runtime is thus O~(n3d4){\widetilde{O}}(n^{3}d^{4})—in the theorem statement, we write it in terms of the number of samples mm in order to separate the time spent processing the samples from the learning phase. We note that the sample complexity bound that we have, m=O~(n3)m={\widetilde{O}}(n^{3}), may very well be sub-optimal; we suspect that m=O~(n2)m={\widetilde{O}}(n^{2}) is closer to the truth, which would yield a better runtime.

We are also able to apply standard whitening operations (as in e.g. [AGHK13]) to extend our algorithm to dictionaries with linearly-independent, but non-orthonormal, columns, at the cost of polynomially many additional samples.

To our knowledge, our algorithms are the only remotely efficient dictionary learning algorithms with provable guarantees that permit τ\tau-nice distributions in which the coordinates of xx may be correlated by constant factors, the only other ones being the sum-of-squares semidefinite programming based algorithms of [BKS15, MSS16].

Preliminaries

We will also make frequent use of the following lemma, which states that the distance between two points cannot increase when both are projected onto a closed, convex set.

This lemma is well-known (see e.g. [Roc76]), but we will prove it for completeness in Appendix A.

Techniques

In this section we give a high-level overview of the algorithms in our paper, and of their analyses. We begin with the tensor decomposition algorithm, after which we’ll explain the (non-trivial) extension to the dictionary learning application. At the very end, we will discuss the relationship between our algorithms and sum-of-squares relaxations.

where TijT_{ij} is the i,ji,jth d×dd\times d matrix slice of the tensor T\mathbf{T}. Since the aia_{i} are orthogonal, the coefficients g,ai2\langle g,a_{i}^{\otimes 2}\rangle are independent, and so we find ourselves in an ideal situation—MgM_{g} is a sum of the orthogonal components we want to recover with independent Gaussian coefficients. A simple eigendecomposition will recover all of the aia_{i}.

On the other hand, when we have a nonzero noise tensor E\mathbf{E}, a random contraction along modes {1,2}\{1,2\} results in the matrix

where the EijE_{ij} are d×dd\times d slices of the tensor E\mathbf{E}. The last term, composed of the error, complicates things. Standard facts about Gaussian matrix series assert that the spectral norm of the error term behaves like E{1,2,3}{4}\|E_{\{1,2,3\}\{4\}}\|, the spectral norm of a d3×dd^{3}\times d reshaping of EE, whereas we only have control over square reshapings such as E{1,2}{3,4}\|E_{\{1,2\}\{3,4\}}\|.In fact, this observation was crucial in the analysis of [MSS16]—in that work, semidefinite programming constraints are used to control the spectral norm of the rectangular reshapings. These can be off by polynomial factors. If the Frobenius norm of E\mathbf{E} is EF2ε2d2\|\mathbf{E}\|_{F}^{2}\approx\varepsilon^{2}d^{2}, which is the magnitude one would expect from a tensor whose square reshapings are full-rank matrices with spectral norm ε\varepsilon, then we have that necessarily

since there are at most dd nonzero singular values of rectangular reshapings of E\mathbf{E}. In this case, unless ε1/d\varepsilon\ll 1/\sqrt{d}, the components aiaia_{i}a_{i}^{\top} are completely drowned out by the contribution of the noise, and so the robustness guarantees leave something to be desired.

Basic idea

The above suggests that, as long as we allow the error E\mathbf{E} to have large Frobenius norm, an approach based on random contraction will not succeed. Our basic idea is to take T\mathbf{T}, whose error has small spectral norm, and transform it into a tensor T\mathbf{T}^{\prime} whose error has small Frobenius norm.

Because we do not know the decomposition of T\mathbf{T}, we cannot access the error E\mathbf{E} directly. However, we do know that for any d2×d2d^{2}\times d^{2} reshaping TT of T\mathbf{T},

where S=i=1nai4S=\sum_{i=1}^{n}a_{i}^{\otimes 4}, and Eε\|E\|\leqslant\varepsilon. The rank of SS is ndn\leqslant d, and all eigenvalues of SS are 11. Thus, if we perform the operation

where ()+(\cdot)_{+} denotes projection to the cone of positive semidefinite matrices, we expect that the signal term SS will survive, while the noise term EE will be dampened. More formally, we know that TT has nn eigenvalues of magnitude 1±ε1\pm\varepsilon, and d2nd^{2}-n eigenvalues of magnitude at most ε\varepsilon, and therefore rank(T>ε)n\operatorname{rank}(T^{>\varepsilon})\leqslant n. Also by definition, TT>εε\|T-T^{>\varepsilon}\|\leqslant\varepsilon. Therefore, we have that

with Eε\|E^{\prime}\|\leqslant\varepsilon, and thus

Since S,T>εS,T^{>\varepsilon} are both of rank at most nn, and E,EE^{\prime},E have spectral norm bounded by ε\varepsilon, we have that

Finally, to eliminate these large singular values, we will project T>ε\mathbf{T}^{>\varepsilon} into the set of matrices whose rectangular reshapings have singular values at most 11—because SS is a member of this convex set, the projection can only decrease the Frobenius norm. After this, we will apply the random contraction algorithm, as originally suggested.

Variance of Gaussian matrix series

Recall that we wanted to sample a random d2d^{2}-dimensional Gaussian vector gg, and the perform the contraction

The error term on the right is a matrix Gaussian series. The following lemma describes the behavior of the spectra of matrix Gaussian series:

It turns out that for us it suffices to perform two projections in sequence—we first reshape T>ε\mathbf{T}^{>\varepsilon} to the matrix T{123}{4}>ε\mathbf{T}^{>\varepsilon}_{\{123\}\{4\}},project it to the set of matrices with singular values at most 11, and then reshape the result along modes {124}{3}\{124\}\{3\}, and project to the same set again. As mentioned before, because projection to a convex set containing SS cannot increase the distance from SS, the Frobenius norm of the new error can only decrease. What is less obvious is that performing the second projection will not destroy the property that the reshaping along modes {123}{4}\{123\}\{4\} has spectral norm at most 11. By showing that each projection corresponds to either left- or right- multiplication of T{123}{4}>ε\mathbf{T}^{>\varepsilon}_{\{123\}\{4\}} and T{124}{3}>ε\mathbf{T}^{>\varepsilon}_{\{124\}\{3\}} by matrices of spectral norm at most 11, we are able to show that the second projection does not create large singular values for the first flattening, and so two projections are indeed enough. Call the resulting tensor (T>ε)1(\mathbf{T}^{>\varepsilon})^{\leqslant 1}.

Now, if we perform a random contraction in the modes 1,21,2, we will have

where the spectral norm of E^glogn\|\hat{E}_{g}\|\leqslant\sqrt{\log n} with good probability. So, ignoring for the moment dependencies between ai2,g\langle a_{i}^{\otimes 2},g\rangle and EgE_{g}, maxig,ai2>1.1Eg\max_{i}|\langle g,a_{i}^{\otimes 2}\rangle|>1.1\cdot\|E_{g}\| with probability at least n.2n^{-.2}, which will give aia_{i} correlation 0.90.9 with the top eigenvector of MgM_{g} with good probability.

Improving accuracy of components

The algorithm described thus far will recover components bib_{i} that are 0.90.9-correlated with the aia_{i}, in the sense that bi,ai20.9\langle b_{i},a_{i}\rangle^{2}\geqslant 0.9. To boost the accuracy of the recovered components, we use a simple method which resembles a single step of tensor power iteration.

We’ll use the closeness of our original tensor T\mathbf{T} to iai4\sum_{i}a_{i}^{\otimes 4} in spectral norm. We let TT be a d2×d2d^{2}\times d^{2} flattening of T\mathbf{T}, and compute the vector

Now, when the vector vv is reshaped to a d×dd\times d matrix VV, the term E(bibi)E(b_{i}\otimes b_{i}) is a matrix of Frobenius norm (and thus spectral norm) at most ε\varepsilon. By the orthonormality of the aja_{j}, the sum of the coefficients in the second term is at most 0.1, and so aia_{i} is ε\varepsilon-close to the top eigenvector of VV.

Recovering every component

Because the Frobenius norm of the error in (T>ε)1(\mathbf{T}^{>\varepsilon})^{\leqslant 1} is O(εn)O(\varepsilon\sqrt{n}), there may be a small fraction of the components ai4a_{i}^{\otimes 4} that are “canceled out” by the error—for instance, we can imagine that the error term is E^=i=1ε2nai4\hat{\mathbf{E}}=-\sum_{i=1}^{\varepsilon^{2}n}a_{i}^{\otimes 4}. So while only a constant fraction of the components ai4a_{i}^{\otimes 4} can be more than ε\varepsilon-correlated with the error, we may still be unable to recover some fixed ε2\varepsilon^{2}-fraction of the aia_{i} via random contractions.

To recover all components, we must subtract the components that we have found already and run the algorithm iteratively—if we have found m=0.99nm=0.99n components, then if we could perfectly subtract them from T\mathbf{T}, we would end up with an even lower-rank signal tensor, and thus be able to make progress by truncating all but 0.1n0.1n eigenvalues in the first step.

The challenge is that we have recovered b1,,bmb_{1},\ldots,b_{m} that are only (1ε)(1-\varepsilon)-correlated with the aia_{i}, and so naively subtracting Ti=1mbi4\mathbf{T}-\sum_{i=1}^{m}b_{i}^{\otimes 4} can result in a Frobenius and spectral norm error of magnitude εm\varepsilon\sqrt{m}—thus the total error is still proportional to n\sqrt{n} rather than 0.1n\sqrt{0.1n}.

Dictionary learning

We can use our tensor decomposition algorithm to learn the dictionary AA, as long as the columns of AA are linearly independent. For the sake of this overview, assume instead that the columns of AA are orthonormal. Then given samples y(1),,y(m)y^{(1)},\ldots,y^{(m)} for m=poly(n)m=\operatorname{poly}(n), we can compute the 44th moment tensor to accuracy ε\varepsilon in the spectral norm,

In the sum-of-squares relaxation, this issue is overcome easily: because of the symmetries required of the SDP solution matrix XX, X,aiajaiaj=X,aiaiajaj\langle X,a_{i}a_{j}^{\top}\otimes a_{i}a_{j}^{\top}\rangle=\langle X,a_{i}a_{i}^{\top}\otimes a_{j}a_{j}^{\top}\rangle, so by linearity this low-rank error term cannot influence the objective function any more than the aiaiajaja_{i}a_{i}^{\top}\otimes a_{j}a_{j}^{\top} term.

This removes the spectrum in the direction of the “nice” error terms, corresponding to aiaiajaja_{i}a_{i}^{\top}\otimes a_{j}a_{j}^{\top} and aiajajaia_{i}a_{j}^{\top}\otimes a_{j}a_{i}^{\top}. The fact that the rest of the matrix is low-rank means that we can apply an analysis similar to the analysis in the first step of our tensor decomposition to argue that M>εiai4FMiai4F\|M^{>\varepsilon}-\sum_{i}a_{i}^{\otimes 4}\|_{F}\leqslant\|M-\sum_{i}a_{i}^{\otimes 4}\|_{F}.

Now, we re-shape M>εM^{>\varepsilon}, so that if initially we had the flattening MM{1,2}{3,4}\mathbf{M}\to M_{\{1,2\}\{3,4\}}, we look at the flattening M{1,3}{2,4}>εM^{>\varepsilon}_{\{1,3\}\{2,4\}}. In this flattening, the term aiajaiaja_{i}a_{j}^{\top}\otimes a_{i}a_{j}^{\top} from M>εM^{>\varepsilon} is transformed to aiaiajaja_{i}a_{i}^{\top}\otimes a_{j}a_{j}^{\top}, and so the problematic error term from the original flattening has spectral norm ε\varepsilon in this flattening! Applying the projection

eliminates the problematic term, and brings us again closer to the target matrix iai4\sum_{i}a_{i}^{\otimes 4}. Thus, we end up with a tensor that is close to iai4\sum_{i}a_{i}^{\otimes 4} in Frobenius norm, and we can apply our tensor decomposition algorithm.

Connection to sum-of-squares algorithms

We take a moment to draw parallels between our algorithm and tensor decomposition algorithms in the sum-of-squares hierarchy. In noisy orthogonal tensor decomposition, we want to solve the non-convex program

The (somewhat simplified) corresponding sum-of-squares relaxation is the semidefinite program

To solve this semidefinite program, one should project T\mathbf{T} into the intersection of all of the convex feasible regions of the constraints. However, projecting to the intersection is an expensive operation in terms of runtime. Instead, we choose a subset of these constraints, and project T\mathbf{T} into the set of points satisfying each constraint sequentially, rather than simultaneously. These are not equivalent projection operations, but because we select our operations carefully, we are able to show that our T\mathbf{T} is close to the SDP optimum in a sense that is sufficient for successfully running Jennrich’s algorithm.

Decomposing orthogonal 444-tensors

Truncate to 0 all eigenvalues of that have magnitude less than ε\varepsilon:

where we have used (M)+(M)_{+} to denote projection to the PSD cone.

To compute T>εT^{>\varepsilon}, we can compute the top nn eigenvectors of TT. Since O(ε)λnλn+1O(\varepsilon)\cdot\lambda_{n}\geqslant\lambda_{n+1}, where λn,λn+1\lambda_{n},\lambda_{n+1} are the nnth and (n+1)(n+1)st eigenvalues, we can compute this in time O~(min{nd4,d2+ω}){\widetilde{O}}(\min\{nd^{4},d^{2+\omega}\}) via subspace power iteration (see for example [HP14]). ∎

Project T\mathbf{T} to the set of tensors whose rectangular reshapings along modes {1,2,3},{4}\{1,2,3\},\{4\} have spectral norm at most 11 (obtaining a new tensor T^\hat{\mathbf{T}} with T^{1,2,3}{4}1\|\hat{\mathbf{T}}_{\{1,2,3\}\{4\}}\|\leqslant 1).

Project T^\hat{\mathbf{T}} to the set of tensors whose rectangular reshapings along modes {1,2,4},{3}\{1,2,4\},\{3\} have spectral norm at most 11, obtaining a new tensor T1=S+E1\mathbf{T}^{\leqslant 1}=\mathbf{S}+\mathbf{E}^{\leqslant 1}.

Output: uLu_{L} and uRu_{R}, the top left- and right- unit singular vectors of MgM_{g}.

Under the appropriate conditions on T\mathbf{T}, with probability O~(nε){\widetilde{O}}(n^{-\varepsilon}), Algorithm 4.3 will output a vector that is 0.90.9-correlated with aj2a_{j}^{\otimes 2} for some j[n]j\in[n].

Recovering one component requires time O~(d2+ωnO(η)){\widetilde{O}}(d^{2+\omega}n^{O(\eta)}), and recovering mm components requires time O~(max{mnO(η)d4,d2+ω}){\widetilde{O}}(\max\{mn^{O(\eta)}d^{4},d^{2+\omega}\}).

We can then post-process the vectors to obtain a vector that has correlation 1ε1-\varepsilon with aja_{j}—the details are given in Section 4.3 below.

We will now prove the correctness of Algorithm 4.3 step-by-step, tying details together at the end of this subsection. First, we argue that in step 1, the truncation of the large eigenvalues cannot increase the Frobenius norm of the error.

Suppose that we define T1\mathbf{T}^{\leqslant 1} to be the result of projecting T=S+E\mathbf{T}=\mathbf{S}+\mathbf{E} to the set of tensors whose rectangular reshapings along modes {1,2,3},{4}\{1,2,3\},\{4\} have spectral norm at most 11, then projecting the result to the set of tensors whose rectangular reshapings along modes {1,2,3},{4}\{1,2,3\},\{4\} have spectral norm at most 11. Then T1SFEF\|\mathbf{T}^{\leqslant 1}-\mathbf{S}\|_{F}\leqslant\|E\|_{F}, and

This operation requires time O~(d2+ω){\widetilde{O}}(d^{2+\omega}).

To establish the first claim, we note that the tensor T1\mathbf{T}^{\leqslant 1} was obtained by two projections of different rectangular reshapings of the matrix S+ES+E to the set of rectangular matrices with singular value at most 11. This set is closed, convex, and contains SS, and so the error can only decrease in Frobenius norm (see Lemma A.2 in Appendix A for a proof),

Finally, each reshaping step takes O(d4)O(d^{4}) time. Since we are only interested in truncating large singular values, it suffices for us to compute the SVD corresponding to singular values between n\sqrt{n} and 11. This can be done via subspace power iteration, which here involves the multiplication of a d3×dd^{3}\times d matrix and a d×dd\times d matrix (with intermediate orthogonalization steps for the d×dd\times d matrix, see [HP14]), which requires time O~(d2+ω){\widetilde{O}}(d^{2+\omega}), where ω\omega is the matrix multiplication constant. Going forward the representation of the matrix will be as the original matrix, with the subtracted SVD corresponding to large singular values. ∎

We will need to argue that if the Frobenius norm of the error matrix is small, this is a sufficient condition under which we succeed. For this, we will use the following two lemmas. The first tells us that with probability Ω~(nO(ε)){\widetilde{\Omega}}(n^{-O(\varepsilon)}) over the choice of gg, we will have for some i[n]i\in[n] that

where cN|c|\geqslant\|N\| and furthermore Nai\|Na_{i}\| and Nai\|N^{\top}a_{i}\| are small:

Then for a 13δ1-3\delta fraction of j[n]j\in[n],

In particular, if δ=Ω(1)\delta=\Omega(1), β=O(ε)\beta=O(\varepsilon), β<1\beta<1, then this probability is Ω~(n(1+O(ε))){\widetilde{\Omega}}(n^{-(1+O(\varepsilon))}).

The proof consists primarily of the application of concentration inequalities, and we provide it below in Section 4.4.

The second lemma states that if MgM_{g} indeed has the form above, the top singular vectors of MgM_{g} must be close to the component aia_{i}.

with c(1+β)N|c|\geqslant(1+\beta)\|N\| for β>0\beta>0, and Na1,Na1εc\|Na_{1}\|,\|N^{\top}a_{1}\|\leqslant\varepsilon|c| so that the relationship 2ε(1+β)β<0.01\frac{2\varepsilon(1+\beta)}{\beta}<0.01 holds. Then letting uu be a top singular vector of MgM_{g}, it follows that

The proof requires some careful calculations, but is not complicated, and we will prove it below in Section 4.4.

Finally, we are ready to stitch these arguments together and prove that Algorithm 4.3 works.

After reshaping and truncating T\mathbf{T} in step 11 of the algorithm, by Lemma 4.5 the matrix T1=iai4+ET^{\leqslant 1}=\sum_{i}a_{i}^{\otimes 4}+E has the properties that

and also that still EFηn\|E\|_{F}\leqslant\eta\sqrt{n}.

We can now apply Lemma 4.6 with δ=1300\delta=\frac{1}{300} and β=400η/δ=O(η)\beta=400\eta/\delta=O(\eta) to conclude that for at least a 0.990.99-fraction of the i[n]i\in[n], with probability at least O~(n1O(η)){\widetilde{O}}(n^{-1-O(\eta)}), we will have

where N(1+β)c\|N\|\leqslant(1+\beta)c and Nai,Nai48ηc\|Na_{i}\|,\|N^{\top}a_{i}\|\leqslant 48\eta\cdot|c|. Applying Lemma 4.7, we have that either the left- or right- top unit singular vector uu of MgM_{g} has correlation at least

For runtime, by our arguments in Lemma 4.5 step 1 takes time O~(d2+ω){\widetilde{O}}(d^{2+\omega}). After this, with either representation of our matrix T1T^{\leqslant 1} (whether we compute the full truncated SVD or have the original matrix minus the subtracted SVD), performing power iteration to find the top eigenvector with the flattening MgM_{g} takes time O~(d4){\widetilde{O}}(d^{4}), and finding a single component takes O~(nO(η)){\widetilde{O}}(n^{-O(\eta)}) samples of random contractions. Since we can reuse T1T^{\leqslant 1} with new random contractions, the total runtime for recovering one component is O~(d2+ωnO(η)){\widetilde{O}}(d^{2+\omega}n^{-O(\eta)}), and by the independence of the runs recovering mm components requires O~(d2+ω)+mnO(η)O~(d4){\widetilde{O}}(d^{2+\omega})+mn^{-O(\eta)}\cdot{\widetilde{O}}(d^{4}) time. ∎

With the core of our algorithm in place, we now take care of the remaining technical issues: recovery precision, working with near-orthonormal vectors, and recovering the full set of component vectors.

Because the precision of recovery will be important in not amplifying the error, we begin with our precision-amplifying postprocessing algorithm.

Output: If for one of v{vL,vR}v\in\{v_{L},v_{R}\}, (v2)Tv2(13ε)2ε(v^{\otimes 2})^{\top}Tv^{\otimes 2}\geqslant(1-3\varepsilon)^{2}-\varepsilon, output vv.

Suppose that vv is a unit vector with v,ai20.99\langle v,a_{i}\rangle^{2}\geqslant 0.99, and T=iai4+ET=\sum_{i}a_{i}^{\otimes 4}+E for Eε\|E\|\leqslant\varepsilon and a1,,ana_{1},\ldots,a_{n} orthonormal. Then if we let AA be the reshaping of T(vv)T(v\otimes v) to a d×dd\times d matrix, and if we let uL,uRu_{L},u_{R} be the top left- and right- unit singular vectors of MM, then

In other words, Algorithm 4.8 succeeds. Further, the time required is O~(d4){\widetilde{O}}(d^{4}).

Therefore, defining MM to be the n×nn\times n reshaping of T(vv)T(v\otimes v) and defining NN to be the n×nn\times n reshaping of E(vv)E(v\otimes v),

where NFε\|N\|_{F}\leqslant\varepsilon, since Eε\|E\|\leqslant\varepsilon.

and that if we choose η\eta so that 1η2α1/501-\eta\geqslant 2\alpha\geqslant 1/50,

Thus, Mηa1a1Mη+2ε\|M-\eta a_{1}a_{1}^{\top}\|\leqslant\|M\|-\eta+2\varepsilon, and we have by Fact A.3 (see Appendix A for a proof) that the top unit eigenvector uu of MM is such that

Choosing η=49/50\eta=49/50, the result follows. ∎

2 Near-orthonormal components

We’ll now dispense with the discrepancy between the orthonormal and near-orthonormal cases.

3 Full Recovery

Now we give the full algorithm, which will remove the components we find in each step from the tensor without amplifying the spectral norm of the error too much.

Initialize the set of known components K=K=\emptyset, and the set of components under inspection B=B=\emptyset.

Initialize a working copy of T\mathbf{T}, Twork(0)\mathbf{T}^{(0)}_{work}, and keep a clean copy of T\mathbf{T} called Tclean\mathbf{T}_{clean}.

Preprocess Twork(t)\mathbf{T}^{(t)}_{work} with Algorithm 4.1, then run Algorithm 4.3 with Twork(t)\mathbf{T}_{work}^{(t)} O~(n){\widetilde{O}}(n) times, then postprocess with Algorithm 4.8 using Tclean\mathbf{T}_{clean} and error parameter ε\varepsilon. If this produces an output vector vv, add vv to BB (unless KK already contains a vector that is 1ε1-\varepsilon correlated with vv).

Given T=i=1nai4+E\mathbf{T}=\sum_{i=1}^{n}a_{i}^{\otimes 4}+E where the aia_{i} are orthonormal and E{1,2}{3,4}ε\|E_{\{1,2\}\{3,4\}}\|\leqslant\varepsilon, then if ε<O(η2/log2n)\varepsilon<O(\eta^{2}/\log^{2}n), with probability 1o(1)1-o(1), Algorithm 4.11 recovers orthonormal vectors b1,,bnb_{1},\ldots,b_{n} so that there exists a permutation π:[n][n]\pi:[n]\to[n] such that for each i[n]i\in[n],

Furthermore, this requires runtime O~(n1+O(η)d2+ω){\widetilde{O}}(n^{1+O(\eta)}d^{2+\omega}).

First, we prove that if we have an orthonormal basis that approximates a1,,aka_{1},\ldots,a_{k}, we can subtract it without introducing a large spectral norm error—this motivates and justifies steps 3(b)–3(e).

So it suffices for us to bound UVU+V\|U-V\|\cdot\|U+V\|.

By the subadditivity of the norm, U+VU+V=2\|U+V\|\leqslant\|U\|+\|V\|=2. Meanwhile, the singular values of UVU-V are the square roots of the eigenvalues of (UV)(UV)\|(U-V)^{\top}(U-V)\|, and so we bound

where the second line follows because UU and VV have orthonormal columns. Now, by assumption we know that

where for iji\neq j, Eij=bi,aj2E_{ij}=\langle b_{i},a_{j}\rangle^{2} and Eii=bi,ai2(1ε)E_{ii}=\langle b_{i},a_{i}\rangle^{2}-(1-\varepsilon), and by the orthonormality of the aja_{j},

So the 11-norm of the rows of EE is at most ε\varepsilon. By the orthonormality of the bjb_{j}, the same holds for the 11-norm of the columns, iEij\sum_{i}|E_{ij}|. It follows that Eε\|E\|\leqslant\varepsilon. Therefore,

where E^2ε\|\hat{E}\|\leqslant 2\varepsilon. Returning to (4.1), we can conclude that (UV)2ε\|(U-V)\|\leqslant 2\sqrt{\varepsilon}, and we have our result. ∎

Where we have used that the spectral norm and nuclear norm are dual. Therefore,

Finally, we are ready to prove that Algorithm 4.11 works.

We claim that in the ttth iteration of step 33, with high probability we have at most 0.45tn0.45^{t}n components remaining to be found, and that Twork(t)=T+F(t)\mathbf{T}_{work}^{(t)}=\mathbf{T}+F^{(t)} where F(t)8tε\|F^{(t)}\|\leqslant 8t\sqrt{\varepsilon}. For t=0t=0, this is easily true.

Now assume this holds for tt, and we will prove it for t+1t+1. Since by assumption tεlogn100εlognO(η)t\sqrt{\varepsilon}\log n\leqslant 100\sqrt{\varepsilon}\log n\leqslant O(\eta), applying Lemma 4.2 and Theorem 4.4 to the running of preprocessing Algorithm 4.1 and the main step Algorithm 4.3 with Twork(t)\mathbf{T}_{work}^{(t)} and Lemma 4.9 to the running of the postprocessing Algorithm 4.8 with Tclean\mathbf{T}_{clean}, in step 3(a)3(a) with high probability we will find m0.9ntm\geqslant 0.9n_{t} vectors b1,,bmb_{1},\ldots,b_{m} so that bi,ai213ε\langle b_{i},a_{i}\rangle^{2}\geqslant 1-3\varepsilon. Furthermore, this takes a total of O~(mnO(η)d2+ω){\widetilde{O}}(mn^{O(\eta)}d^{2+\omega}) time.

for a matrix FF with F8ε\|F\|\leqslant 8\sqrt{\varepsilon}. By induction, this implies that Twork(t+1)=T+F(t+1)\mathbf{T}_{work}^{(t+1)}=\mathbf{T}+F^{(t+1)} where F(t+1)F+F(t)(t+1)8ε\|F^{(t+1)}\|\leqslant\|F\|+\|F^{(t)}\|\leqslant(t+1)8\sqrt{\varepsilon}.

Taking a union bound over the high-probability success of Algorithm 4.3, we have that after t=O(logn)t=O(\log n) steps we have found all of the components. We have spent a total of O~(n1+O(η)d2+ω){\widetilde{O}}(n^{1+O(\eta)}d^{2+\omega}) time in step 3(a). Finally, steps 3(b)-3(e) of Algorithm 4.11 require no more than O~(d3){\widetilde{O}}(d^{3}) time, and since the entire loop runs O~(1){\widetilde{O}}(1) times, we have our result. ∎

4 Supporting Lemmas

Now we circle back and prove the omitted supporting lemmas.

For convenience, fix j:=1j:=1. Let g(1)g^{(1)} be the component of gg in the direction a12a_{1}^{\otimes 2}, and let g(>1)g^{(>1)} be the component of gg orthogonal to a12a_{1}^{\otimes 2}. Notice that g(1),g(>1)g^{(1)},g^{(>1)} are independent.

By the orthogonality of the aia_{i}, our matrix MgM_{g} can be written as

where TjT_{j} is the jjth matrix slice of TT. For convenience, we can refer to the two sums on the right as

First, we get a lower bound on the probability that the coefficient of a1a1a_{1}a_{1}^{\top} is large. Let G1(α)\mathcal{G}_{1}(\alpha) be the event that g(1),a12=g(1)2αlogn|\langle g^{(1)},a_{1}^{\otimes 2}\rangle|=\|g^{(1)}\|\geqslant\sqrt{2\alpha\log n}. By standard tail estimates on univariate Gaussians, we have that

Now, we bound N\|N\|. Define the event E>1(ρ)\mathcal{E}_{>1}(\rho) to be the even that

To bound N\|N\|, it thus remains to understand the term

and then by an application of Markov’s inequality. ∎

Thus, combining the above we have a two-part upper bound on N\|N\|.

Finally, define the event Ea1,E(θ)\mathcal{E}_{a_{1},E}(\theta) to be the event that

We can apply the same arguments to (jgj(>1)Ej)a1\left(\sum_{j}g_{j}^{(>1)}E_{j}\right)^{\top}a_{1}, and we conclude that

Now, by the union bound E>1(ρ)\mathcal{E}_{>1}(\rho) and Ea1,E\mathcal{E}_{a_{1},E} both occur with probability at least 1dρ2dθ1-d^{-\rho}-2d^{-\theta}. Also, we notice that E>1Ea1,E\mathcal{E}_{>1}\cup\mathcal{E}_{a_{1},E} and G1\mathcal{G}_{1} are independent. Therefore, for ρ,θloglogn/logn\rho,\theta\geqslant\log\log n/\log n,

Conditioning on E>1\mathcal{E}_{>1} and G1\mathcal{G}_{1},

where c2αlogn|c|\geqslant\sqrt{2\alpha\log n}, and NN is a matrix of norm at most (ε/δ)c+2(1+ρ)logd(\varepsilon/\delta)c+\sqrt{2(1+\rho)\log d}, such that Na1,Na1(ε/δ)(c+2(1+θ))\|Na_{1}\|,\|N^{\top}a_{1}\|\leqslant(\varepsilon/\delta)(c+\sqrt{2(1+\theta)}).

We now set α\alpha so that cβN|c|\geqslant\beta\|N\|. This occurs when

Choosing β=1+β\beta=1+\beta^{\prime}, ρ,θ=loglogn/logn\rho,\theta=\log\log n/\log n, we have our conclusion for a1a_{1}, and by symmetry for all other aia_{i} in the 13δ1-3\delta fraction of i[n]i\in[n] for which the Frobenius norms of the contractions are small. ∎

Now expanding the vNNvv^{\top}N^{\top}Nv term along the components of vv,

It is easy to see that when c/(1+β)<cκc/(1+\beta)<c-\kappa, this quantity is maximized at α=1\alpha=1, and so by our choice of κ\kappa we have that

and thus Mgκa1a1(1+ε)cκ\|M_{g}-\kappa a_{1}a_{1}^{\top}\|\leqslant(1+\varepsilon)c-\kappa.

Where we have applied the Cauchy-Schwarz inequality, and the assumption that Na1εc\|Na_{1}\|\leqslant\varepsilon c. It follows that

Finally applying Fact A.3, we can conclude that for either the left- or right-singular unit vector uu of MgM_{g},

Choosing δ=2ε(1+β)β\delta=\frac{2\varepsilon(1+\beta)}{\beta} as small as possible, we have our result. ∎

Learning Orthonormal Dictionaries

Here, we show how to use our tensor decomposition algorithm to learn dictionaries with orthonormal basis vectors.

Below, in Section 5.3, we will prove that O~(n3){\widetilde{O}}(n^{3}) samples suffice to estimate the 44th moment tensor within o(1)o(1) spectral norm error. Computing this matrix from O~(n3){\widetilde{O}}(n^{3}) samples takes O~(n3d4){\widetilde{O}}(n^{3}d^{4}) time. Thus we can equivalently formulate the problem as follows:

Finally, given access to a sufficiently large number of samples, we can reduce to the case where the columns of AA are orthogonal:

The proof is straightforward, involving a transformation by the empirical covariance matrix, and we give it below in Section 5.2.

Reshape T\mathbf{T} to T{1,2}{3,4}T_{\{1,2\}\{3,4\}}, and perform the eigenvalue truncation

where ++ denotes projection to the PSD cone.

Compute the truncation of a different reshaping of T>εT^{>\varepsilon},

It follows that when we perform the eigenvalue truncation in step 1 of Algorithm 5.4,

We now remark that σ\sigma maps NN to one of the bounded-norm matrices from Claim 5.6,

Furthermore, because the positive semidefinite cone is a closed convex set, and because projection to closed convex sets can only decrease distances (see Lemma A.2),

Now we prove that the spectral norms of the symmetrizations of the tensor have small spectral norm.

The first matrix that we are interested in is PSD, and can dominated by a tensor power of the identity:

Without loss of generality, let i:=1i:=1 so that b,ai2=1η0.99\langle b,a_{i}\rangle^{2}=1-\eta\geqslant 0.99 (henceforth, we use ii as an ordinary index). We have that

An identical proof, up to signs, gives us a lower bound of 2pα2p\alpha.

For a matrix EE with E4α\|E\|\leqslant 4\alpha.

It remains to argue that the top eigenvector of MuM_{u} is a1a_{1}. We have that

where the last line follows because ww is a unit vector, and we chose ε\varepsilon so that 1ηε>η1-\eta-\varepsilon>\eta. Therefore

Applying Fact A.3, we conclude that if vv is the top eigenvector of MuM_{u}, then v,a12ε8αε18αε\langle v,a_{1}\rangle^{2}\geqslant\frac{\varepsilon-8\alpha}{\varepsilon}\geqslant 1-\frac{8\alpha}{\varepsilon}. Since we can choose ε=12\varepsilon=\frac{1}{2} and still have that ε<0.98<12η\varepsilon<0.98<1-2\eta, the conclusion follows. ∎

2 From independent columns to orthonormal columns

We now use standard techniques to prove that one can reduce from a dictionary with independent columns to a dictionary with orthogonal columns, given sufficiently many samples.

Note that the expected covariance matrix of the samples is equal to a scaled version of the covariance matrix,

So given sufficiently many samples, we can compute a good spectral approximation of Σ1/2\Sigma^{-1/2}, Σ^1/2\hat{\Sigma}^{-1/2} with

Now, define Sdiff=((Σ^1/2)2(Σ1/2)2)S_{diff}=\left((\hat{\Sigma}^{-1/2})^{\otimes 2}-(\Sigma^{-1/2})^{\otimes 2}\right) and Ssum=((Σ^1/2)2+(Σ1/2)2)S_{sum}=\left((\hat{\Sigma}^{-1/2})^{\otimes 2}+(\Sigma^{-1/2})^{\otimes 2}\right). We can factor the difference

Since we can choose the number of samples so as to make this last quantity as small as we would like, as a function of the condition number, and then appealing to Fact 4.10, the reduction is complete. ∎

3 Sample complexity bounds

Below is our bound on the sample complexity of approximating the 44th moment tensor, which we believe may be loose.

This is because of the orthonormality of the aia_{i}, which guarantees that terms in the product MMMM^{\top} in which we have an inner product between two non-identical vectors drop out to .

References

Appendix A Useful Tools

We compute the expectation and variance of our matrix,

The two variance terms correspond to A{αβ}{γ}2\|A_{\{\alpha\beta\}\{\gamma\}}\|^{2} and A{αγ}{β}2\|A_{\{\alpha\gamma\}\{\beta\}}\|^{2} respectively. We can now apply concentration results for matrix Gaussian series to conclude the proof [Oli10]. ∎

The following lemma states that distances can only decrease under projections to a convex set, and is well-known (see e.g. [Roc76]).

If we let Dx=xΠ(x)D_{x}=x-\Pi(x), Dy=yΠ(y)D_{y}=y-\Pi(y),

Now the conclusion will follow from the fact that

where the second inequality is the triangle inequality. Rearranging, the conclusion follows. ∎