Tensor decompositions for learning latent variable models

Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, Matus Telgarsky

Introduction

The method of moments is a classical parameter estimation technique (Pearson, 1894) from statistics which has proved invaluable in a number of application domains. The basic paradigm is simple and intuitive: (i) compute certain statistics of the data—often empirical moments such as means and correlations—and (ii) find model parameters that give rise to (nearly) the same corresponding population quantities. In a number of cases, the method of moments leads to consistent estimators which can be efficiently computed; this is especially relevant in the context of latent variable models, where standard maximum likelihood approaches are typically computationally prohibitive, and heuristic methods can be unreliable and difficult to validate with high-dimensional data. Furthermore, the method of moments can be viewed as complementary to the maximum likelihood approach; simply taking a single step of Newton-Raphson on the likelihood function starting from the moment based estimator (Le Cam, 1986) often leads to the best of both worlds: a computationally efficient estimator that is (asymptotically) statistically optimal.

The primary difficulty in learning latent variable models is that the latent (hidden) state of the data is not directly observed; rather only observed variables correlated with the hidden state are observed. As such, it is not evident the method of moments should fare any better than maximum likelihood in terms of computational performance: matching the model parameters to the observed moments may involve solving computationally intractable systems of multivariate polynomial equations. Fortunately, for many classes of latent variable models, there is rich structure in low-order moments (typically second- and third-order) which allow for this inverse moment problem to be solved efficiently (Cattell, 1944; Cardoso, 1991; Chang, 1996; Mossel and Roch, 2006; Hsu et al., 2012b; Anandkumar et al., 2012c, a; Hsu and Kakade, 2013). What is more is that these decomposition problems are often amenable to simple and efficient iterative methods, such as gradient descent and the power iteration method.

In this work, we observe that a number of important and well-studied latent variable models—including Gaussian mixture models, hidden Markov models, and Latent Dirichlet allocation—share a certain structure in their low-order moments, and this permits certain tensor decomposition approaches to parameter estimation. In particular, this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices.

While much of this (or similar) structure was implicit in several previous works (Chang, 1996; Mossel and Roch, 2006; Hsu et al., 2012b; Anandkumar et al., 2012c, a; Hsu and Kakade, 2013), here we make the decomposition explicit under a unified framework. Specifically, we express the observable moments as sums of rank-one terms, and reduce the parameter estimation task to the problem of extracting a symmetric orthogonal decomposition of a symmetric tensor derived from these observable moments. The problem can then be solved by a variety of approaches, including fixed-point and variational methods.

One approach for obtaining the orthogonal decomposition is the tensor power method of Lathauwer et al. (2000, Remark 3). We provide a convergence analysis of this method for orthogonally decomposable symmetric tensors, as well as a detailed perturbation analysis for a robust (and a computationally tractable) variant (Theorem 5.1). This perturbation analysis can be viewed as an analogue of Wedin’s perturbation theorem for singular vectors of matrices (Wedin, 1972), providing a bound on the error of the recovered decomposition in terms of the operator norm of the tensor perturbation. This analysis is subtle in at least two ways. First, unlike for matrices (where every matrix has a singular value decomposition), an orthogonal decomposition need not exist for the perturbed tensor. Our robust variant uses random restarts and deflation to extract an approximate decomposition in a computationally tractable manner. Second, the analysis of the deflation steps is non-trivial; a naïve argument would entail error accumulation in each deflation step, which we show can in fact be avoided. When this method is applied for parameter estimation in latent variable models previously discussed, improved sample complexity bounds (over previous work) can be obtained using this perturbation analysis.

Finally, we also address computational issues that arise when applying the tensor decomposition approaches to estimating latent variable models. Specifically, we show that the basic operations of simple iterative approaches (such as the tensor power method) can be efficiently executed in time linear in the dimension of the observations and the size of the training data. For instance, in a topic modeling application, the proposed methods require time linear in the number of words in the vocabulary and in the number of non-zero entries of the term-document matrix. The combination of this computational efficiency and the robustness of the tensor decomposition techniques makes the overall framework a promising approach to parameter estimation for latent variable models.

2 Related Work

The connection between tensor decompositions and latent variable models has a long history across many scientific and mathematical disciplines. We review some of the key works that are most closely related to ours.

The role of tensor decompositions in the context of latent variable models dates back to early uses in psychometrics (Cattell, 1944). These ideas later gained popularity in chemometrics, and more recently in numerous science and engineering disciplines, including neuroscience, phylogenetics, signal processing, data mining, and computer vision. A thorough survey of these techniques and applications is given by Kolda and Bader (2009). Below, we discuss a few specific connections to two applications in machine learning and statistics, independent component analysis and latent variable models (between which there is also significant overlap).

Tensor decompositions have been used in signal processing and computational neuroscience for blind source separation and independent component analysis (ICA) (Comon and Jutten, 2010). Here, statistically independent non-Gaussian sources are linearly mixed in the observed signal, and the goal is to recover the mixing matrix (and ultimately, the original source signals). A typical solution is to locate projections of the observed signals that correspond to local extrema of the so-called “contrast functions” which distinguish Gaussian variables from non-Gaussian variables. This method can be effectively implemented using fast descent algorithms (Hyvarinen, 1999). When using the excess kurtosis (i.e., fourth-order cumulant) as the contrast function, this method reduces to a generalization of the power method for symmetric tensors (Lathauwer et al., 2000; Zhang and Golub, 2001; Kofidis and Regalia, 2002). This case is particularly important, since all local extrema of the kurtosis objective correspond to the true sources (under the assumed statistical model) (Delfosse and Loubaton, 1995); the descent methods can therefore be rigorously analyzed, and their computational and statistical complexity can be bounded (Frieze et al., 1996; Nguyen and Regev, 2009; Arora et al., 2012b).

Higher-order tensor decompositions have also been used to develop estimators for commonly used mixture models, hidden Markov models, and other related latent variable models, often using the the algebraic procedure of R. Jennrich (as reported in the article of Harshman, 1970), which is based on a simultaneous diagonalization of different ways of flattening a tensor to matrices. Jennrich’s procedure was employed for parameter estimation of discrete Markov models by Chang (1996) via pair-wise and triple-wise probability tables; and it was later used for other latent variable models such as hidden Markov models (HMMs), latent trees, Gaussian mixture models, and topic models such as latent Dirichlet allocation (LDA) by many others (Mossel and Roch, 2006; Hsu et al., 2012b; Anandkumar et al., 2012c, a; Hsu and Kakade, 2013). In these contexts, it is often also possible to establish strong identifiability results, without giving an explicit estimators, by invoking the non-constructive identifiability argument of Kruskal (1977)—see the article by Allman et al. (2009) for several examples.

Related simultaneous diagonalization approaches have also been used for blind source separation and ICA (as discussed above), and a number of efficient algorithms have been developed for this problem (Bunse-Gerstner et al., 1993; Cardoso and Souloumiac, 1993; Cardoso, 1994; Cardoso and Comon, 1996; Corless et al., 1997; Ziehe et al., 2004). A rather different technique that uses tensor flattening and matrix eigenvalue decomposition has been developed by Cardoso (1991) and later by De Lathauwer et al. (2007). A significant advantage of this technique is that it can be used to estimate overcomplete mixtures, where the number of sources is larger than the observed dimension.

The relevance of tensor analysis to latent variable modeling has been long recognized in the field of algebraic statistics (Pachter and Sturmfels, 2005), and many works characterize the algebraic varieties corresponding to the moments of various classes of latent variable models (Drton et al., 2007; Sturmfels and Zwiernik, 2013). These works typically do not address computational or finite sample issues, but rather are concerned with basic questions of identifiability.

The specific tensor structure considered in the present work is the symmetric orthogonal decomposition. This decomposition expresses a tensor as a linear combination of simple tensor forms; each form is the tensor product of a vector (i.e., a rank-11 tensor), and the collection of vectors form an orthonormal basis. An important property of tensors with such decompositions is that they have eigenvectors corresponding to these basis vectors. Although the concepts of eigenvalues and eigenvectors of tensors is generally significantly more complicated than their matrix counterpart—both algebraically (Qi, 2005; Cartwright and Sturmfels, 2013; Lim, 2005) and computationally (Hillar and Lim, 2013; Kofidis and Regalia, 2002)—the special symmetric orthogonal structure we consider permits simple algorithms to efficiently and stably recover the desired decomposition. In particular, a generalization of the matrix power method to symmetric tensors, introduced by Lathauwer et al. (2000, Remark 3) and analyzed by Kofidis and Regalia (2002), provides such a decomposition. This is in fact implied by the characterization of Zhang and Golub (2001), which shows that iteratively obtaining the best rank-11 approximation of such orthogonally decomposable tensors also yields the exact decomposition. We note that in general, obtaining such approximations for general (symmetric) tensors is NP-hard (Hillar and Lim, 2013).

2.2 Latent Variable Models

This work focuses on the particular application of tensor decomposition methods to estimating latent variable models, a significant departure from many previous approaches in the machine learning and statistics literature. By far the most popular heuristic for parameter estimation for such models is the Expectation-Maximization (EM) algorithm (Dempster et al., 1977; Redner and Walker, 1984). Although EM has a number of merits, it may suffer from slow convergence and poor quality local optima (Redner and Walker, 1984), requiring practitioners to employ many additional heuristics to obtain good solutions. For some models such as latent trees (Roch, 2006) and topic models (Arora et al., 2012a), maximum likelihood estimation is NP-hard, which suggests that other estimation approaches may be more attractive. More recently, algorithms from theoretical computer science and machine learning have addressed computational and sample complexity issues related to estimating certain latent variable models such as Gaussian mixture models and HMMs (Dasgupta, 1999; Arora and Kannan, 2005; Dasgupta and Schulman, 2007; Vempala and Wang, 2004; Kannan et al., 2008; Achlioptas and McSherry, 2005; Chaudhuri and Rao, 2008; Brubaker and Vempala, 2008; Kalai et al., 2010; Belkin and Sinha, 2010; Moitra and Valiant, 2010; Hsu and Kakade, 2013; Chang, 1996; Mossel and Roch, 2006; Hsu et al., 2012b; Anandkumar et al., 2012c; Arora et al., 2012a; Anandkumar et al., 2012a). See the works by Anandkumar et al. (2012c) and Hsu and Kakade (2013) for a discussion of these methods, together with the computational and statistical hardness barriers that they face. The present work reviews a broad range of latent variables where a mild non-degeneracy condition implies the symmetric orthogonal decomposition structure in the tensors of low-order observable moments.

Notably, another class of methods, based on subspace identification (Overschee and Moor, 1996) and observable operator models/multiplicity automata (Schützenberger, 1961; Jaeger, 2000; Littman et al., 2001), have been proposed for a number of latent variable models. These methods were successfully developed for HMMs by Hsu et al. (2012b), and subsequently generalized and extended for a number of related sequential and tree Markov models models (Siddiqi et al., 2010; Bailly, 2011; Boots et al., 2010; Parikh et al., 2011; Rodu et al., 2013; Balle et al., 2012; Balle and Mohri, 2012), as well as certain classes of parse tree models (Luque et al., 2012; Cohen et al., 2012; Dhillon et al., 2012). These methods use low-order moments to learn an “operator” representation of the distribution, which can be used for density estimation and belief state updates. While finite sample bounds can be given to establish the learnability of these models (Hsu et al., 2012b), the algorithms do not actually give parameter estimates (e.g., of the emission or transition matrices in the case of HMMs).

3 Organization

The rest of the paper is organized as follows. Section 2 reviews some basic definitions of tensors. Section 3 provides examples of a number of latent variable models which, after appropriate manipulations of their low order moments, share a certain natural tensor structure. Section 4 reduces the problem of parameter estimation to that of extracting a certain (symmetric orthogonal) decomposition of a tensor. We then provide a detailed analysis of a robust tensor power method and establish an analogue of Wedin’s perturbation theorem for the singular vectors of matrices. The discussion in Section 6 addresses a number of practical concerns that arise when dealing with moment matrices and tensors.

Preliminaries

Note that if AA is a matrix (p=2p=2), then

where II is the n×nn\times n identity matrix. As a final example of this notation, observe

The notion of tensor (symmetric) rank is considerably more delicate than matrix (symmetric) rank. For instance, it is not clear a priori that the symmetric rank of a tensor should even be finite (Comon et al., 2008). In addition, removal of the best rank-11 approximation of a (general) tensor may increase the tensor rank of the residual (Stegeman and Comon, 2010).

Throughout, we use v=(ivi2)1/2\|v\|=(\sum_{i}v_{i}^{2})^{1/2} to denote the Euclidean norm of a vector vv, and M\|M\| to denote the spectral (operator) norm of a matrix. We also use T\|T\| to denote the operator norm of a tensor, which we define later.

Tensor Structure in Latent Variable Models

In this section, we give several examples of latent variable models whose low-order moments can be written as symmetric tensors of low symmetric rank; some of these examples can be deduced using the techniques developed in the text by McCullagh (1987). The basic form is demonstrated in Theorem 3.1 for the first example, and the general pattern will emerge from subsequent examples.

One advantage of this encoding of words is that the (cross) moments of these random vectors correspond to joint probabilities over words. For instance, observe that

The second advantage of the vector encoding of words is that the conditional expectation of xtx_{t} given h=jh=j is simply μj\mu_{j}, the vector of word probabilities for topic jj:

(where [μj]i[\mu_{j}]_{i} is the ii-th entry in the vector μj\mu_{j}). Because the words are conditionally independent given the topic, we can use this same property with conditional cross moments, say, of x1x_{1} and x2x_{2}:

This and similar calculations lead one to the following theorem.

As we will see in Section 4.3, the structure of M2M_{2} and M3M_{3} revealed in Theorem 3.1 implies that the topic vectors μ1,μ2,,μk\mu_{1},\mu_{2},\dotsc,\mu_{k} can be estimated by computing a certain symmetric tensor decomposition. Moreover, due to exchangeability, all triples (resp., pairs) of words in a document—and not just the first three (resp., two) words—can be used in forming M3M_{3} (resp., M2M_{2}); see Section 6.1.

2 Beyond Raw Moments

In the single topic model above, the raw (cross) moments of the observed words directly yield the desired symmetric tensor structure. In some other models, the raw moments do not explicitly have this form. Here, we show that the desired tensor structure can be found through various manipulations of different moments.

We now consider a mixture of kk Gaussian distributions with spherical covariances. We start with the simpler case where all of the covariances are identical; this probabilistic model is closely related to the (non-probabilistic) kk-means clustering problem (MacQueen, 1967).

2.2 Spherical Gaussian Mixtures: Differing Covariances

2.3 Independent Component Analysis (ICA)

Let μi\mu_{i} denote the ii-th column of the mixing matrix AA.

where TT is the fourth-order tensor with

Note that κi\kappa_{i} corresponds to the excess kurtosis, a measure of non-Gaussianity as κi=0\kappa_{i}=0 if hih_{i} is a standard normal random variable. Furthermore, note that AA is not identifiable if hh is a multivariate Gaussian.

We may derive forms similar to that of M2M_{2} and M3M_{3} from Theorem 3.1 using M4M_{4} by observing that

2.4 Latent Dirichlet Allocation (LDA)

The parameter α0\alpha_{0} (the sum of the “pseudo-counts”) characterizes the concentration of the distribution. As α00\alpha_{0}\rightarrow 0, the distribution degenerates to a single topic model (i.e., the limiting density has, with probability 11, exactly one entry of hh being 11 and the rest are ). At the other extreme, if α=(c,c,,c)\alpha=(c,c,\dotsc,c) for some scalar c>0c>0, then as α0=ck\alpha_{0}=ck\to\infty, the distribution of hh becomes peaked around the uniform vector (1/k,1/k,,1/k)(1/k,1/k,\dotsc,1/k) (furthermore, the distribution behaves like a product distribution). We are typically interested in the case where α0\alpha_{0} is small (e.g., a constant independent of kk), whereupon hh typically has only a few large entries. This corresponds to the setting where the documents are mainly comprised of just a few topics.

Note that α0\alpha_{0} needs to be known to form M2M_{2} and M3M_{3} from the raw moments. This, however, is a much weaker than assuming that the entire distribution of hh is known (i.e., knowledge of the whole parameter vector α\alpha).

3 Multi-View Models

We first note the form for the raw (cross) moments.

The cross moments do not possess a symmetric tensor form when the conditional distributions are different. Nevertheless, the moments can be “symmetrized” via a simple linear transformation of x1x_{1} and x2x_{2} (roughly speaking, this relates x1x_{1} and x2x_{2} to x3x_{3}); this leads to an expression from which the conditional means of x3x_{3} (i.e., μ3,1,μ3,2,,μ3,k\mu_{3,1},\mu_{3,2},\dotsc,\mu_{3,k}) can be recovered. For simplicity, we assume d1=d2=d3=kd_{1}=d_{2}=d_{3}=k; the general case (with dtkd_{t}\geq k) is easily handled using low-rank singular value decompositions.

Assume that {μv,1,μv,2,,μv,k}\{\mu_{v,1},\mu_{v,2},\dotsc,\mu_{v,k}\} are linearly independent for each v{1,2,3}v\in\{1,2,3\}. Define

We now discuss three examples (taken mostly from Anandkumar et al., 2012c) where the above observations can be applied. The first two concern mixtures of product distributions, and the last one is the time-homogeneous hidden Markov model.

For a mixture of product distributions, any partitioning of the dimensions [n][n] into three groups creates three (possibly asymmetric) “views” which are conditionally independent once the mixture component is selected. However, recall that Theorem 3.6 requires that for each view, the kk conditional means be linearly independent. In general, this may not be achievable; consider, for instance, the case μi=ei\mu_{i}=e_{i} for each i[k]i\in[k]. Such cases, where the component means are very aligned with the coordinate basis, are precluded by the incoherence condition.

Define coherence(A):=maxi[n]{eiΠAei}\operatorname{coherence}(A):=\max_{i\in[n]}\{e_{i}^{\scriptscriptstyle\top}\Pi_{A}e_{i}\} to be the largest diagonal entry of the orthogonal projector to the range of AA, and assume AA has rank kk. The coherence lies between k/nk/n and 11; it is largest when the range of AA is spanned by the coordinate axes, and it is k/nk/n when the range is spanned by a subset of the Hadamard basis of cardinality kk. The incoherence condition requires, for some ε,δ(0,1)\varepsilon,\delta\in(0,1), coherence(A)(ε2/6)/ln(3k/δ)\operatorname{coherence}(A)\leq(\varepsilon^{2}/6)/\ln(3k/\delta). Essentially, this condition ensures that the non-degeneracy of the component means is not isolated in just a few of the nn dimensions. Operationally, it implies the following.

for some ε,δ(0,1)\varepsilon,\delta\in(0,1). With probability at least 1δ1-\delta, a random partitioning of the dimensions [n][n] into three groups (for each i[n]i\in[n], independently pick t{1,2,3}t\in\{1,2,3\} uniformly at random and put ii in group tt) has the following property. For each t{1,2,3}t\in\{1,2,3\} and j[k]j\in[k], let μt,j\mu_{t,j} be the entries of μj\mu_{j} put into group tt, and let At:=[μt,1μt,2μt,k]A_{t}:=[\mu_{t,1}|\mu_{t,2}|\dotsb|\mu_{t,k}]. Then for each t{1,2,3}t\in\{1,2,3\}, AtA_{t} has full column rank, and the kk-th largest singular value of AtA_{t} is at least (1ε)/3\sqrt{(1-\varepsilon)/3} times that of AA.

Therefore, three asymmetric views can be created by randomly partitioning the observed random vector xx into x1x_{1}, x2x_{2}, and x3x_{3}, such that the resulting component means for each view satisfy the conditions of Theorem 3.6.

3.2 Spherical Gaussian Mixtures, Revisited

Consider again the case of spherical Gaussian mixtures (cf. Section 3.2). As we shall see in Section 4.3, the previous techniques (based on Theorem 3.2 and Theorem 3.3) lead to estimation procedures when the dimension of xx is kk or greater (and when the kk component means are linearly independent). We now show that when the dimension is slightly larger, say greater than 3k3k, a different (and simpler) technique based on the multi-view structure can be used to extract the relevant structure.

3.3 Hidden Markov Models

Define h:=y2h:=y_{2}, where y2y_{2} is the second hidden state in the Markov chain. Then

x1,x2,x3x_{1},x_{2},x_{3} are conditionally independent given hh;

the distribution of hh is given by the vector w:=TπΔk1w:=T\pi\in\Delta^{k-1};

Note the matrix of conditional means of xtx_{t} has full column rank, for each t{1,2,3}t\in\{1,2,3\}, provided that: (i) OO has full column rank, (ii) TT is invertible, and (iii) π\pi and TπT\pi have positive entries.

Orthogonal Tensor Decompositions

We now show how recovering the μi\mu_{i}’s in our aforementioned problems reduces to the problem of finding a certain orthogonal tensor decomposition of a symmetric tensor. We start by reviewing the spectral decomposition of symmetric matrices, and then discuss a generalization to the higher-order tensor case. Finally, we show how orthogonal tensor decompositions can be used for estimating the latent variable models from the previous section.

Such a decomposition is guaranteed to exist for every symmetric matrix.

Recovery of the viv_{i}’s and λi\lambda_{i}’s can be viewed at least two ways. First, each viv_{i} is fixed under the mapping uMuu\mapsto Mu, up to a scaling factor λi\lambda_{i}:

as vjvi=0v_{j}^{\scriptscriptstyle\top}v_{i}=0 for all jij\neq i by orthogonality. The viv_{i}’s are not necessarily the only such fixed points. For instance, with the multiplicity λ1=λ2=λ\lambda_{1}=\lambda_{2}=\lambda, then any linear combination of v1v_{1} and v2v_{2} is similarly fixed under MM. However, in this case, the decomposition in (1) is not unique, as λ1v1v1+λ2v2v2\lambda_{1}v_{1}v_{1}^{\scriptscriptstyle\top}+\lambda_{2}v_{2}v_{2}^{\scriptscriptstyle\top} is equal to λ(u1u1+u2u2)\lambda(u_{1}u_{1}^{\scriptscriptstyle\top}+u_{2}u_{2}^{\scriptscriptstyle\top}) for any pair of orthonormal vectors, u1u_{1} and u2u_{2} spanning the same subspace as v1v_{1} and v2v_{2}. Nevertheless, the decomposition is unique when λ1,λ2,,λk\lambda_{1},\lambda_{2},\dotsc,\lambda_{k} are distinct, whereupon the vjv_{j}’s are the only directions fixed under uMuu\mapsto Mu up to non-trivial scaling.

The second view of recovery is via the variational characterization of the eigenvalues. Assume λ1>λ2>>λk\lambda_{1}>\lambda_{2}>\dotsb>\lambda_{k}; the case of repeated eigenvalues again leads to similar non-uniqueness as discussed above. Then the Rayleigh quotient

is maximized over non-zero vectors by v1v_{1}. Furthermore, for any s[k]s\in[k], the maximizer of the Rayleigh quotient, subject to being orthogonal to v1,v2,,vs1v_{1},v_{2},\dotsc,v_{s-1}, is vsv_{s}. Another way of obtaining this second statement is to consider the deflated Rayleigh quotient

and observe that vsv_{s} is the maximizer.

Efficient algorithms for finding these matrix decompositions are well studied (Golub and van Loan, 1996, Section 8.2.3), and iterative power methods are one effective class of algorithms.

We remark that in our multilinear tensor notation, we may write the maps uMuu\mapsto Mu and uuMu/u22u\mapsto u^{\scriptscriptstyle\top}Mu/\|u\|_{2}^{2} as

2 The Tensor Case

Decomposing general tensors is a delicate issue; tensors may not even have unique decompositions. Fortunately, the orthogonal tensors that arise in the aforementioned models have a structure which permits a unique decomposition under a mild non-degeneracy condition. We focus our attention to the case p=3p=3, i.e., a third order tensor; the ideas extend to general pp with minor modifications.

Note that since we are focusing on odd-order tensors (p=3p=3), we have added the requirement that the λi\lambda_{i} be positive. This convention can be followed without loss of generality since λivip=λi(vi)p-\lambda_{i}v_{i}^{\otimes p}=\lambda_{i}(-v_{i})^{\otimes p} whenever pp is odd. Also, it should be noted that orthogonal decompositions do not necessarily exist for every symmetric tensor.

In analogy to the matrix setting, we consider two ways to view this decomposition: a fixed-point characterization and a variational characterization. Related characterizations based on optimal rank-11 approximations are given by Zhang and Golub (2001).

For a tensor TT, consider the vector-valued map

which is the third-order generalization of (2). This can be explicitly written as

Observe that (5) is not a linear map, which is a key difference compared to the matrix case.

(To simplify the discussion, we assume throughout that eigenvectors have unit norm; otherwise, for scaling reasons, we replace the above equation with T(I,u,u)=λuuT(I,u,u)=\lambda\|u\|u.) This concept was originally introduced by Lim (2005) and Qi (2005). For orthogonally decomposable tensors T=i=1kλivi3T=\sum_{i=1}^{k}\lambda_{i}v_{i}^{\otimes 3},

By the orthogonality of the viv_{i}, it is clear that T(I,vi,vi)=λiviT(I,v_{i},v_{i})=\lambda_{i}v_{i} for all i[k]i\in[k]. Therefore each (vi,λi)(v_{i},\lambda_{i}) is an eigenvector/eigenvalue pair.

There are a number of subtle differences compared to the matrix case that arise as a result of the non-linearity of (5). First, even with the multiplicity λ1=λ2=λ\lambda_{1}=\lambda_{2}=\lambda, a linear combination u:=c1v1+c2v2u:=c_{1}v_{1}+c_{2}v_{2} may not be an eigenvector. In particular,

may not be a multiple of c1v1+c2v2c_{1}v_{1}+c_{2}v_{2}. This indicates that the issue of repeated eigenvalues does not have the same status as in the matrix case. Second, even if all the eigenvalues are distinct, it turns out that the viv_{i}’s are not the only eigenvectors. For example, set u:=(1/λ1)v1+(1/λ2)v2u:=(1/\lambda_{1})v_{1}+(1/\lambda_{2})v_{2}. Then,

so u/uu/\|u\| is an eigenvector. More generally, for any subset S[k]S\subseteq[k], the vector

starting from θ\theta converges to uu. Note that the map (6) rescales the output to have unit Euclidean norm. Robust eigenvectors are also called attracting fixed points of (6) (see, e.g., Kolda and Mayo, 2011).

The following theorem implies that if TT has an orthogonal decomposition as given in (4), then the set of robust eigenvectors of TT are precisely the set {v1,v2,vk}\{v_{1},v_{2},\ldots v_{k}\}, implying that the orthogonal decomposition is unique. (For even order tensors, the uniqueness is true up to sign-flips of the viv_{i}.)

Let TT have an orthogonal decomposition as given in (4).

The set of robust eigenvectors of TT is equal to {v1,v2,,vk}\{v_{1},v_{2},\dotsc,v_{k}\}.

The proof of Theorem 4.1 is given in Appendix A.1, and follows readily from simple orthogonality considerations. Note that every viv_{i} in the orthogonal tensor decomposition is robust, whereas for a symmetric matrix MM, for almost all initial points, the map θˉMθˉMθˉ\bar{\theta}\mapsto\frac{M\bar{\theta}}{\|M\bar{\theta}\|} converges only to an eigenvector corresponding to the largest magnitude eigenvalue. Also, since the tensor order is odd, the signs of the robust eigenvectors are fixed, as each vi-v_{i} is mapped to viv_{i} under (6).

2.2 Variational Characterization

We now discuss a variational characterization of the orthogonal decomposition. The generalized Rayleigh quotient (Zhang and Golub, 2001) for a third-order tensor is

Let TT have an orthogonal decomposition as given in (4), and consider the optimization problem

The stationary points are eigenvectors of TT.

A stationary point uu is an isolated local maximizer if and only if u=viu=v_{i} for some i[k]i\in[k].

The proof of Theorem 4.2 is given in Appendix A.2. It is similar to local optimality analysis for ICA methods using fourth-order cumulants (e.g., Delfosse and Loubaton, 1995; Frieze et al., 1996).

Again, we see similar distinctions to the matrix case. In the matrix case, the only local maximizers of the Rayleigh quotient are the eigenvectors with the largest eigenvalue (and these maximizers take on the globally optimal value). For the case of orthogonal tensor forms, the robust eigenvectors are precisely the isolated local maximizers.

An important implication of the two characterizations is that, for orthogonally decomposable tensors TT, (i) the local maximizers of the objective function uT(u,u,u)/(uu)3/2u\mapsto T(u,u,u)/(u^{\scriptscriptstyle\top}u)^{3/2} correspond precisely to the vectors viv_{i} in the decomposition, and (ii) these local maximizers can be reliably identified using a simple fixed-point iteration (i.e., the tensor analogue of the matrix power method). Moreover, a second-derivative test based on T(I,I,u)T(I,I,u) can be employed to test for local optimality and rule out other stationary points.

3 Estimation via Orthogonal Tensor Decompositions

We now demonstrate how the moment tensors obtained for various latent variable models in Section 3 can be reduced to an orthogonal form. For concreteness, we take the specific form from the exchangeable single topic model (Theorem 3.1):

(The more general case allows the weights wiw_{i} in M2M_{2} to differ in M3M_{3}, but for simplicity we keep them the same in the following discussion.) We now show how to reduce these forms to an orthogonally decomposable tensor from which the wiw_{i} and μi\mu_{i} can be recovered. See Appendix D for a discussion as to how previous approaches (Mossel and Roch, 2006; Anandkumar et al., 2012c, a; Hsu and Kakade, 2013) achieved this decomposition through a certain simultaneous diagonalization method.

Throughout, we assume the following non-degeneracy condition.

Observe that Condition 4.1 implies that M20M_{2}\succeq 0 is positive semidefinite and has rank kk. This is often a mild condition in applications. When this condition is not met, learning is conjectured to be generally hard for both computational (Mossel and Roch, 2006) and information-theoretic reasons (Moitra and Valiant, 2010). As discussed by Hsu et al. (2012b) and Hsu and Kakade (2013), when the non-degeneracy condition does not hold, it is often possible to combine multiple observations using tensor products to increase the rank of the relevant matrices. Indeed, this observation has been rigorously formulated in very recent works of Bhaskara et al. (2014) and Anderson et al. (2014) using the framework of smoothed analysis (Spielman and Teng, 2009).

As the following theorem shows, the orthogonal decomposition of M~3\widetilde{M}_{3} can be obtained by identifying its robust eigenvectors, upon which the original parameters wiw_{i} and μi\mu_{i} can be recovered. For simplicity, we only state the result in terms of robust eigenvector/eigenvalue pairs; one may also easily state everything in variational form using Theorem 4.2.

Assume Condition 4.1 and take M~3\widetilde{M}_{3} as defined above.

The theorem follows by combining the above discussion with the robust eigenvector characterization of Theorem 4.1. Recall that we have taken as convention that eigenvectors have unit norm, so the μi\mu_{i} are exactly determined from the robust eigenvector/eigenvalue pairs of M~3\widetilde{M}_{3} (together with the pseudoinverse of WW^{\scriptscriptstyle\top}); in particular, the scale of each μi\mu_{i} is correctly identified (along with the corresponding wiw_{i}). Relative to previous works on moment-based estimators for latent variable models (e.g., Anandkumar et al., 2012c, a; Hsu and Kakade, 2013), Theorem 4.3 emphasizes the role of the special tensor structure, which in turn makes transparent the applicability of methods for orthogonal tensor decomposition.

3.2 Local Maximizers of (Cross Moment) Skewness

The variational characterization provides an interesting perspective on the robust eigenvectors for these latent variable models. Consider the exchangeable single topic models (Theorem 3.1), and the objective function

In this case, every local maximizer uu^{*} satisfies M2(I,u)=wiμiM_{2}(I,u^{*})=\sqrt{w_{i}}\mu_{i} for some i[k]i\in[k]. The objective function can be interpreted as the (cross moment) skewness of the random vectors x1,x2,x3x_{1},x_{2},x_{3} along direction uu.

Tensor Power Method

In this section, we consider the tensor power method of Lathauwer et al. (2000, Remark 3) for orthogonal tensor decomposition. We first state a simple convergence analysis for an orthogonally decomposable tensor TT.

When only an approximation T^\hat{T} to an orthogonally decomposable tensor TT is available (e.g., when empirical moments are used to estimate population moments), an orthogonal decomposition need not exist for this perturbed tensor (unlike for the case of matrices), and a more robust approach is required to extract the approximate decomposition. Here, we propose such a variant in Algorithm 1 and provide a detailed perturbation analysis. We note that alternative approaches such as simultaneous diagonalization can also be employed (see Appendix D).

The following lemma establishes the quadratic convergence of the tensor power method—i.e., repeated iteration of (6)—for extracting a single component of the orthogonal decomposition. Note that the initial vector θ0\theta_{0} determines which robust eigenvector will be the convergent point. Computation of subsequent eigenvectors can be computed with deflation, i.e., by subtracting appropriate terms from TT.

That is, repeated iteration of (6) starting from θ0\theta_{0} converges to v1v_{1} at a quadratic rate.

To obtain all eigenvectors, we may simply proceed iteratively using deflation, executing the power method on Tjλjvj3T-\sum_{j}\lambda_{j}v_{j}^{\otimes 3} after having obtained robust eigenvector / eigenvalue pairs {(vj,λj)}\{(v_{j},\lambda_{j})\}.

Proof Let θ0,θ1,θ2,\overline{\theta}_{0},\overline{\theta}_{1},\overline{\theta}_{2},\dotsc be the sequence given by θ0:=θ0\overline{\theta}_{0}:=\theta_{0} and θt:=T(I,θt1,θt1)\overline{\theta}_{t}:=T(I,\theta_{t-1},\theta_{t-1}) for t1t\geq 1. Let ci:=viθ0c_{i}:=v_{i}^{\scriptscriptstyle\top}\theta_{0} for all i[k]i\in[k]. It is easy to check that (i) θt=θt/θt\theta_{t}=\overline{\theta}_{t}/\|\overline{\theta}_{t}\|, and (ii) θt=i=1kλi2t1ci2tvi\overline{\theta}_{t}=\sum_{i=1}^{k}\lambda_{i}^{2^{t}-1}c_{i}^{2^{t}}v_{i}. (Indeed, θt+1=i=1kλi(viθt)2vi=i=1kλi(λi2t1ci2t)2vi=i=1kλi2t+11ci2t+1vi\overline{\theta}_{t+1}=\sum_{i=1}^{k}\lambda_{i}(v_{i}^{\scriptscriptstyle\top}\overline{\theta}_{t})^{2}v_{i}=\sum_{i=1}^{k}\lambda_{i}(\lambda_{i}^{2^{t}-1}c_{i}^{2^{t}})^{2}v_{i}=\sum_{i=1}^{k}\lambda_{i}^{2^{t+1}-1}c_{i}^{2^{t+1}}v_{i}.) Then

Since λ1>0\lambda_{1}>0, we have v1θt>0v_{1}^{\scriptscriptstyle\top}\theta_{t}>0 and hence v1θt2=2(1v1θt)2(1(v1θt)2)\|v_{1}-\theta_{t}\|^{2}=2(1-v_{1}^{\scriptscriptstyle\top}\theta_{t})\leq 2(1-(v_{1}^{\scriptscriptstyle\top}\theta_{t})^{2}) as required.

2 Perturbation Analysis of a Robust Tensor Power Method

Now we consider the case where we have an approximation T^\hat{T} to an orthogonally decomposable tensor TT. Here, a more robust approach is required to extract an approximate decomposition. We propose such an algorithm in Algorithm 1, and provide a detailed perturbation analysis. For simplicity, we assume the tensor T^\hat{T} is of size k×k×kk\times k\times k as per the reduction from Section 4.3. In some applications, it may be preferable to work directly with a n×n×nn\times n\times n tensor of rank knk\leq n (as in Lemma 5.1); our results apply in that setting with little modification.

In our latent variable model applications, T^\hat{T} is the tensor formed by using empirical moments, while TT is the orthogonally decomposable tensor derived from the population moments for the given model. In the context of parameter estimation (as in Section 4.3), EE must account for any error amplification throughout the reduction, such as in the whitening step (see, e.g., Hsu and Kakade, 2013, for such an analysis).

The following theorem is similar to Wedin’s perturbation theorem for singular vectors of matrices (Wedin, 1972) in that it bounds the error of the (approximate) decomposition returned by Algorithm 1 on input T^\hat{T} in terms of the size of the perturbation, provided that the perturbation is small enough.

(Note that the condition on LL holds with L=poly(k)log(1/η)L=\operatorname{poly}(k)\log(1/\eta).) Suppose that Algorithm 1 is iteratively called kk times, where the input tensor is T^\hat{T} in the first call, and in each subsequent call, the input tensor is the deflated tensor returned by the previous call. Let (v^1,λ^1),(v^2,λ^2),,(v^k,λ^k)(\hat{v}_{1},\hat{\lambda}_{1}),(\hat{v}_{2},\hat{\lambda}_{2}),\dotsc,(\hat{v}_{k},\hat{\lambda}_{k}) be the sequence of estimated eigenvector/eigenvalue pairs returned in these kk calls. With probability at least 1η1-\eta, there exists a permutation π\pi on [k][k] such that

The proof of Theorem 5.1 is given in Appendix B.

One important difference from Wedin’s theorem is that this is an algorithm dependent perturbation analysis, specific to Algorithm 1 (since the perturbed tensor need not have an orthogonal decomposition). Furthermore, note that Algorithm 1 uses multiple restarts to ensure (approximate) convergence—the intuition is that by restarting at multiple points, we eventually start at a point in which the initial contraction towards some eigenvector dominates the error EE in our tensor. The proof shows that we find such a point with high probability within L=poly(k)L=\operatorname{poly}(k) trials. It should be noted that for large kk, the required bound on LL is very close to linear in kk.

In general, it is possible, when run on a general symmetric tensor (e.g., T^\hat{T}), for the tensor power method to exhibit oscillatory behavior (Kofidis and Regalia, 2002, Example 1). This is not in conflict with Theorem 5.1, which effectively bounds the amplitude of these oscillations; in particular, if T^=T+E\hat{T}=T+E is a tensor built from empirical moments, the error term EE (and thus the amplitude of the oscillations) can be driven down by drawing more samples. The practical value of addressing these oscillations and perhaps stabilizing the algorithm is an interesting direction for future research (Kolda and Mayo, 2011).

A final consideration is that for specific applications, it may be possible to use domain knowledge to choose better initialization points. For instance, in the topic modeling applications (cf. Section 3.1), the eigenvectors are related to the topic word distributions, and many documents may be primarily composed of words from just single topic. Therefore, good initialization points can be derived from these single-topic documents themselves, as these points would already be close to one of the eigenvectors.

Discussion

In this section, we discuss some practical and application-oriented issues related to the tensor decomposition approach to learning latent variable models.

A number of practical concerns arise when dealing with moment matrices and tensors. Below, we address two issues that are especially pertinent to topic modeling applications (Anandkumar et al., 2012c, a) or other settings where the observations are sparse.

It can be checked that this quantity is equal to

where the sum is over all ordered word triples in the document. A similar expression is easily derived for the contribution of the document to the empirical second-order moment matrix:

Note that the word count vector cc is generally a sparse vector, so this representation allows for efficient multiplication by the moment matrices and tensors in time linear in the size of the document corpus (i.e., the number of non-zero entries in the term-document matrix).

1.2 Dimensionality Reduction

Another serious concern regarding the use of tensor forms of moments is the need to operate on multidimensional arrays with Ω(d3)\Omega(d^{3}) values (it is typically not exactly d3d^{3} due to symmetry). When dd is large (e.g., when it is the size of the vocabulary in natural language applications), even storing a third-order tensor in memory can be prohibitive. Sparsity is one factor that alleviates this problem. Another approach is to use efficient linear dimensionality reduction. When this is combined with efficient techniques for matrix and tensor multiplication that avoid explicitly constructing the moment matrices and tensors (such as the procedure described above), it is possible to avoid any computational scaling more than linear in the dimension dd and the training sample size.

2 Computational Complexity

It is worth noting that the running times differ by roughly a factor of Θ(k1+δ)\Theta(k^{1+\delta}), which can be accounted for by the random restarts. This gap can potentially be alleviated or removed by using a more clever method for initialization. Moreover, using special structure in the problem (as discussed above) can also improve the running time of the tensor power method.

3 Sample Complexity Bounds

Previous work on using linear algebraic methods for estimating latent variable models crucially rely on matrix perturbation analysis for deriving sample complexity bounds (Mossel and Roch, 2006; Hsu et al., 2012b; Anandkumar et al., 2012c, a; Hsu and Kakade, 2013). The learning algorithms in these works are plug-in estimators that use empirical moments in place of the population moments, and then follow algebraic manipulations that result in the desired parameter estimates. As long as these manipulations can tolerate small perturbations of the population moments, a sample complexity bound can be obtained by exploiting the convergence of the empirical moments to the population moments via the law of large numbers. As discussed in Appendix D, these approaches do not directly lead to practical algorithms due to a certain amplification of the error (a polynomial factor of kk, which is observed in practice).

Using the perturbation analysis for the tensor power method, improved sample complexity bounds can be obtained for all of the examples discussed in Section 3. The underlying analysis remains the same as in previous works (e.g., Anandkumar et al., 2012a; Hsu and Kakade, 2013), the main difference being the accuracy of the orthogonal tensor decomposition obtained via the tensor power method. Relative to the previously cited works, the sample complexity bound will be considerably improved in its dependence on the rank parameter kk, as Theorem 5.1 implies that the tensor estimation error (e.g., error in estimating M~3\widetilde{M}_{3} from Section 4.3) is not amplified by any factor explicitly depending on kk (there is a requirement that the error be smaller than some factor depending on kk, but this only contributes to a lower-order term in the sample complexity bound). See Appendix D for further discussion regarding the stability of the techniques from these previous works.

4 Other Perspectives

The tensor power method is simply one approach for extracting the orthogonal decomposition needed in parameter estimation. The characterizations from Section 4.2 suggest that a number of fixed point and variational techniques may be possible (and Appendix D provides yet another perspective based on simultaneous diagonalization). One important consideration is that the model is often misspecified, and therefore approaches with more robust guarantees (e.g., for convergence) are desirable. Our own experience with the tensor power method (as applied to exchangeable topic modeling) is that while model misspecification does indeed affect convergence, the results can be very reasonable even after just a dozen or so iterations (Anandkumar et al., 2012a). Nevertheless, robustness is likely more important in other applications, and thus the stabilization approaches (Kofidis and Regalia, 2002; Regalia and Kofidis, 2003; Erdogan, 2009; Kolda and Mayo, 2011) may be advantageous.

We thank Boaz Barak, Dean Foster, Jon Kelner, and Greg Valiant for helpful discussions. We are also grateful to Hanzhang Hu, Drew Bagnell, and Martial Hebert for alerting us of an issue with Theorem 4.2 and suggesting a simple fix. This work was completed while DH was a postdoctoral researcher at Microsoft Research New England, and partly while AA, RG, and MT were visiting the same lab. AA is supported in part by the NSF Award CCF-1219234, AFOSR Award FA9550-10-1-0310 and the ARO Award W911NF-12-1-0404.

A Fixed-Point and Variational Characterizations of Orthogonal Tensor Decompositions

We give detailed proofs of Theorems 4.1 and 4.2 in this section for completeness.

Let TT have an orthogonal decomposition as given in (4).

The set of robust eigenvectors of TT is {v1,v2,,vk}\{v_{1},v_{2},\dotsc,v_{k}\}.

We now prove the second claim. First, we show that every viv_{i} is a robust eigenvector. Pick any i[k]i\in[k], and note that for a sufficiently small ball around viv_{i}, we have that for all θ\theta in this ball, λiviθ\lambda_{i}v_{i}^{\scriptscriptstyle\top}\theta is strictly greater than λjvjθ\lambda_{j}v_{j}^{\scriptscriptstyle\top}\theta for j[k]{i}j\in[k]\setminus\{i\}. Thus by Lemma 5.1, viv_{i} is a robust eigenvector. Now we show that the viv_{i} are the only robust eigenvectors. Suppose there exists some robust eigenvector uu not equal to viv_{i} for any i[k]i\in[k]. Then there exists a positive measure set around uu such that all points in this set converge to uu under repeated iteration of (6). This contradicts the first claim.

A.2 Proof of Theorem 4.2

Let TT have an orthogonal decomposition as given in (4), and consider the optimization problem

The stationary points are eigenvectors of TT.

A stationary point uu is an isolated local maximizer if and only if u=viu=v_{i} for some i[k]i\in[k].

Now we characterize the isolated local maximizers. Observe that if u0u\neq 0 and T(I,u,u)=λuT(I,u,u)=\lambda u for λ<0\lambda<0, then T(u,u,u)<0T(u,u,u)<0. Therefore u=(1δ)uu^{\prime}=(1-\delta)u for any δ(0,1)\delta\in(0,1) satisfies T(u,u,u)=(1δ)3T(u,u,u)>T(u,u,u)T(u^{\prime},u^{\prime},u^{\prime})=(1-\delta)^{3}T(u,u,u)>T(u,u,u). So such a uu cannot be a local maximizer. Moreover, if u<1\|u\|<1 and T(I,u,u)=λuT(I,u,u)=\lambda u for λ>0\lambda>0, then u=(1+δ)uu^{\prime}=(1+\delta)u for a small enough δ(0,1)\delta\in(0,1) satisfies u1\|u^{\prime}\|\leq 1 and T(u,u,u)=(1+δ)3T(u,u,u)>T(u,u,u)T(u^{\prime},u^{\prime},u^{\prime})=(1+\delta)^{3}T(u,u,u)>T(u,u,u). Therefore a local maximizer must have T(I,u,u)=λuT(I,u,u)=\lambda u for some λ0\lambda\geq 0, and u=1\|u\|=1 whenever λ>0\lambda>0.

The point uu is an isolated local maximum if the above quantity is strictly negative for all unit vectors ww orthogonal to uu. We now consider three cases depending on the cardinality of Ω\Omega and the sign of λ\lambda.

Case 1: Ω=1|\Omega|=1 and λ>0\lambda>0. This means u=viu=v_{i} for some i[k]i\in[k] (as u=viu=-v_{i} implies λ=λi<0\lambda=-\lambda_{i}<0). In this case,

Case 2: Ω2|\Omega|\geq 2 and λ>0\lambda>0. Since Ω2|\Omega|\geq 2, we may pick a strict non-empty subset SΩS\subsetneq\Omega and set

Since ϵδ\epsilon\leq\delta, for small enough δ\delta, the RHS is strictly greater than λ\lambda. This implies that uu is not an isolated local maximizer.

for sufficiently small δ\delta. Thus uu is not an isolated local maximizer.

From these exhaustive cases, we conclude that a stationary point uu is an isolated local maximizer if and only if u=viu=v_{i} for some i[k]i\in[k].

We are grateful to Hanzhang Hu, Drew Bagnell, and Martial Hebert for alerting us of an issue with our original statement of Theorem 4.2 and its proof, and for suggesting a simple fix. The original statement used the optimization constraint u=1\|u\|=1 (rather than u1\|u\|\leq 1), but the characterization of the decomposition with this constraint is then only given by isolated local maximizers uu with the additional constraint that T(u,u,u)>0T(u,u,u)>0—that is, there can be isolated local maximizers with T(u,u,u)0T(u,u,u)\leq 0 that are not vectors in the decomposition. The suggested fix of Hu, Bagnell, and Herbert is to relax to u1\|u\|\leq 1, which eliminates isolated local maximizers with T(u,u,u)0T(u,u,u)\leq 0; this way, the characterization of the decomposition is simply the isolated local maximizers under the relaxed constraint.

B Analysis of Robust Power Method

In this section, we prove Theorem 5.1. The proof is structured as follows. In Appendix B.1, we show that with high probability, at least one out of LL random vectors will be a good initializer for the tensor power iterations. An initializer is good if its projection onto an eigenvector is noticeably larger than its projection onto other eigenvectors. We then analyze in Appendix B.2 the convergence behavior of the tensor power iterations. Relative to the proof of Lemma 5.1, this analysis is complicated by the tensor perturbation. We show that there is an initial slow convergence phase (linear rate rather than quadratic), but as soon as the projection of the iterate onto an eigenvector is large enough, it enters the quadratic convergence regime until the perturbation dominates. Finally, we show how errors accrue due to deflation in Appendix B.3, which is rather subtle and different from deflation with matrix eigendecompositions. This is because when some initial set of eigenvectors and eigenvalues are accurately recovered, the additional errors due to deflation are effectively only lower-order terms. These three pieces are assembled in Appendix B.4 to complete the proof of Theorem 5.1.

There exists an absolute constant c>0c>0 such that if positive integer L2L\geq 2 satisfies

It suffices to show that with probability at least 1/21/2, there is a column j[L]j^{*}\in[L] such that

Since maxj[L]Z1,j\max_{j\in[L]}|Z_{1,j}| is a 11-Lipschitz function of LL independent N(0,1)\mathcal{N}(0,1) random variables, it follows that

Observe that the cumulative distribution function of maxj[L]Z1,j\max_{j\in[L]}Z_{1,j} is given by F(z)=Φ(z)LF(z)=\Phi(z)^{L}, where Φ\Phi is the standard Gaussian CDF. Since F(m)=1/2F(m)=1/2, it follows that m=Φ1(21/L)m=\Phi^{-1}(2^{-1/{L}}). It can be checked that

for some absolute constant c>0c>0. Also, let j:=argmaxj[L]Z1,jj^{*}:=\arg\max_{j\in[L]}|Z_{1,j}|.

Now for each j[L]j\in[L], let Z2:k,j:=max{Z2,j,Z3,j,,Zk,j}|Z_{2:k,j}|:=\max\{|Z_{2,j}|,|Z_{3,j}|,\dotsc,|Z_{k,j}|\}. Again, since Z2:k,j|Z_{2:k,j}| is a 11-Lipschitz function of k1k-1 independent N(0,1)\mathcal{N}(0,1) random variables, it follows that

Since Z2:k,j|Z_{2:k,j}| is independent of Z1,j|Z_{1,j}| for all j[L]j\in[L], it follows that the previous two displayed inequalities also hold with jj replaced by jj^{*}.

Therefore we conclude with a union bound that with probability at least 1/21/2,

Since LL satisfies (10) by assumption, in this event, the jj^{*}-th random vector is γ\gamma-separated.

B.2 Tensor Power Iterations

where {v1,v2,,vk}\{v_{1},v_{2},\dotsc,v_{k}\} is an orthonormal basis, and, without loss of generality,

for all i[k]i\in[k]. Combining (17) and (18) gives

Moreover, by the triangle inequality and Hölder’s inequality,

In terms of Rt+1R_{t+1}, RtR_{t}, γt\gamma_{t}, and δt\delta_{t}, this reads

where the last inequality follows from Proposition B.1.

and γt>2(1+2κρ2)δt\gamma_{t}>2(1+2\kappa\rho^{2})\delta_{t}.

If ri,t22ρ2r_{i,t}^{2}\leq 2\rho^{2}, then r_{i,t+1}\geq|r_{i,t}|\bigl{(}1+\frac{\gamma_{t}}{2}\bigr{)}.

If ρ2<ri,t2\rho^{2}<r_{i,t}^{2}, then ri,t+1min{ri,t2/ρ, 1δt1/ρκδt}r_{i,t+1}\geq\min\{r_{i,t}^{2}/\rho,\ \frac{1-\delta_{t}-1/\rho}{\kappa\delta_{t}}\}.

γt+1min{γt,11/ρ}\gamma_{t+1}\geq\min\{\gamma_{t},1-1/\rho\}.

If Rt1+2κρ2R_{t}\leq 1+2\kappa\rho^{2}, then R_{t+1}\geq R_{t}\bigl{(}1+\frac{\gamma_{t}}{3}\bigr{)}, θ1,t+12θ1,t2\theta_{1,t+1}^{2}\geq\theta_{1,t}^{2}, and δt+1δt\delta_{t+1}\leq\delta_{t}.

Proof Consider two (overlapping) cases depending on ri,t2r_{i,t}^{2}.

Case 1: ri,t22ρ2r_{i,t}^{2}\leq 2\rho^{2}. By (15) from Proposition B.2,

where the last inequality uses the assumption γt>2(1+2κρ2)δt\gamma_{t}>2(1+2\kappa\rho^{2})\delta_{t}. This proves the first claim.

Case 2: ρ2<ri,t2\rho^{2}<r_{i,t}^{2}. We split into two sub-cases. Suppose ri,t2(ρ(1δt)1)/(κδt)r_{i,t}^{2}\leq(\rho(1-\delta_{t})-1)/(\kappa\delta_{t}). Then, by (15),

Now suppose instead ri,t2>(ρ(1δt)1)/(κδt)r_{i,t}^{2}>(\rho(1-\delta_{t})-1)/(\kappa\delta_{t}). Then

Observe that if mini1ri,t2(ρ(1δt)1)/(κδt)\min_{i\neq 1}r_{i,t}^{2}\leq(\rho(1-\delta_{t})-1)/(\kappa\delta_{t}), then ri,t+1ri,tr_{i,t+1}\geq|r_{i,t}| for all i[k]i\in[k], and hence γt+1γt\gamma_{t+1}\geq\gamma_{t}. Otherwise we have γt+1>1κδt1δt1/ρ>11/ρ\gamma_{t+1}>1-\frac{\kappa\delta_{t}}{1-\delta_{t}-1/\rho}>1-1/\rho. This proves the third claim.

If mini1ri,t2>(ρ(1δt)1)/(κδt)\min_{i\neq 1}r_{i,t}^{2}>(\rho(1-\delta_{t})-1)/(\kappa\delta_{t}), then we may apply the inequality (20) from the second sub-case of Case 2 above to get

Finally, for the last claim, if Rt1+2κρ2R_{t}\leq 1+2\kappa\rho^{2}, then by (16) from Proposition B.2 and the assumption γt>2(1+2κρ2)δt\gamma_{t}>2(1+2\kappa\rho^{2})\delta_{t},

This in turn implies that θ1,t+12θ1,t2\theta_{1,t+1}^{2}\geq\theta_{1,t}^{2} via Proposition B.1, and thus δt+1δt\delta_{t+1}\leq\delta_{t}.

Assume 0δt<1/20\leq\delta_{t}<1/2 and γt>0\gamma_{t}>0. Pick any β>α>0\beta>\alpha>0 such that

If Rt1/αR_{t}\geq 1/\alpha, then Rt+11/αR_{t+1}\geq 1/\alpha.

If 1/α>Rt1/β1/\alpha>R_{t}\geq 1/\beta, then Rt+1min{Rt2/(2κ), 1/α}R_{t+1}\geq\min\{R_{t}^{2}/(2\kappa),\ 1/\alpha\}.

Now consider the following cases depending on RtR_{t}.

Case 1: Rt1/αR_{t}\geq 1/\alpha. In this case, we have

by (21) (with c=αc=\alpha) and the condition on α\alpha. Combining this with (16) from Proposition B.2 gives

Case 2: 1/βRt<1/α1/\beta\leq R_{t}<1/\alpha. In this case, we have

by (21) (with c=βc=\beta) and the conditions on α\alpha and β\beta. If δt1/(2+Rt2/κ)\delta_{t}\geq 1/(2+R_{t}^{2}/\kappa), then (16) implies

If instead δt<1/(2+Rt2/κ)\delta_{t}<1/(2+R_{t}^{2}/\kappa), then (16) implies

Proof Assume without loss of generality that i=1i^{*}=1. We consider three phases: (i) iterations before the first time tt such that Rt>1+2κρ2=1+8κR_{t}>1+2\kappa\rho^{2}=1+8\kappa (using ρ:=2\rho:=2), (ii) the subsequent iterations before the first time tt such that Rt1/αR_{t}\geq 1/\alpha (where α\alpha will be defined below), and finally (iii) the remaining iterations.

and hence the preconditions on δt\delta_{t} and γt\gamma_{t} of Lemma B.2 hold for t=0t=0. For all tT1t\in T_{1} satisfying the preconditions, Lemma B.2 implies that δt+1δt\delta_{t+1}\leq\delta_{t} and γt+1min{γt,11/ρ}\gamma_{t+1}\geq\min\{\gamma_{t},1-1/\rho\}, so the next iteration also satisfies the preconditions. Hence by induction, the preconditions hold for all iterations in T1T_{1}. Moreover, for all i[k]i\in[k], we have

iterations in T1T_{1}. As soon as mini1ri,t2>1δt1/ρκδt\min_{i\neq 1}r_{i,t}^{2}>\frac{1-\delta_{t}-1/\rho}{\kappa\delta_{t}}, we have that in the next iteration,

and all the while RtR_{t} is growing at a linear rate (given in Lemma B.2, claim 5). Therefore, there are at most an additional

iterations in T1T_{1} over that counted in (22). Therefore, by combining the counts in (22) and (23), we have that the number of iterations in the first phase satisfies

We now analyze the second phase, i.e., the iterates in T2:={t0:tT1, Rt<1/α}T_{2}:=\{t\geq 0:t\notin T_{1},\ R_{t}<1/\alpha\}. Define

Therefore the total number of iterations before Rt1/αR_{t}\geq 1/\alpha is

After Rt1/αR_{t^{\prime\prime}}\geq 1/\alpha (for t:=max(T1T2)+1t^{\prime\prime}:=\max(T_{1}\cup T_{2})+1), we have

we also replace α\alpha and β\beta with α\overline{\alpha} and β\overline{\beta}, which we set to

Therefore, the preconditions of Lemma B.3 are satisfied for the initial iteration tt^{\prime\prime} in this final phase, and by the same arguments as before, the preconditions hold for all subsequent iterations ttt\geq t^{\prime\prime}. Initially, we have Rt1/α1/βR_{t^{\prime\prime}}\geq 1/\alpha\geq 1/\overline{\beta}, and by Lemma B.3, we have that RtR_{t} increases at a quadratic rate in this final phase until Rt1/αR_{t}\geq 1/\overline{\alpha}. So the number of iterations before Rt1/αR_{t}\geq 1/\overline{\alpha} can be bounded as

Once Rt1/αR_{t}\geq 1/\overline{\alpha}, we have

Since sign(θ1,t)=r1,tr1,t12(1δt1)/(1+κδt1r1,t12)=(1δt1)/(1+κδt1)>0\operatorname{sign}(\theta_{1,t})=r_{1,t}\geq r_{1,t-1}^{2}\cdot(1-\overline{\delta}_{t-1})/(1+\kappa\overline{\delta}_{t-1}r_{1,t-1}^{2})=(1-\overline{\delta}_{t-1})/(1+\kappa\overline{\delta}_{t-1})>0 by Proposition B.2, we have θ1,t>0\theta_{1,t}>0. Therefore we can conclude that

B.3 Deflation

for all i[t]i\in[t], then for any unit vector uSk1u\in S^{k-1},

Proof For any unit vector uu and i[t]i\in[t], the error term

lives in span{vi,v^i}\operatorname{span}\{v_{i},\hat{v}_{i}\}; this space is the same as span{vi,v^i}\operatorname{span}\{v_{i},\hat{v}_{i}^{\perp}\}, where

is the projection of v^i\hat{v}_{i} onto the subspace orthogonal to viv_{i}. Since v^ivi2=2(1viv^i)\|\hat{v}_{i}-v_{i}\|^{2}=2(1-v_{i}^{\scriptscriptstyle\top}\hat{v}_{i}), it follows that

(the inequality follows from the assumption v^ivi2\|\hat{v}_{i}-v_{i}\|\leq\sqrt{2}, which in turn implies 0ci10\leq c_{i}\leq 1). By the Pythagorean theorem and the above inequality for cic_{i},

Later, we will also need the following bound, which is easily derived from the above inequalities and the triangle inequality:

We now express Ei(I,u,u)\mathcal{E}_{i}(I,u,u) in terms of the coordinate system defined by viv_{i} and v^i\hat{v}_{i}^{\perp}, depicted below. Define

(Note that the part of uu living in span{vi,v^i}\operatorname{span}\{v_{i},\hat{v}_{i}^{\perp}\}^{\perp} is irrelevant for analyzing Ei(I,u,u)\mathcal{E}_{i}(I,u,u).) We have

The overall error can also be expressed in terms of the AiA_{i} and BiB_{i}:

where the first inequality uses the fact (x+y)22(x2+y2)(x+y)^{2}\leq 2(x^{2}+y^{2}) and the triangle inequality, and the second inequality uses the orthonormality of the viv_{i} and the triangle inequality.

and therefore (via (x+y)22(x2+y2)(x+y)^{2}\leq 2(x^{2}+y^{2}))

The second term, Bi|B_{i}|, is bounded similarly:

Therefore, using the inequality from (25) and again (x+y)22(x2+y2)(x+y)^{2}\leq 2(x^{2}+y^{2}),

B.4 Proof of the Main Theorem

(Note that the condition on LL holds with L=poly(k)log(1/η)L=\operatorname{poly}(k)\log(1/\eta).) Suppose that Algorithm 1 is iteratively called kk times, where the input tensor is T^\hat{T} in the first call, and in each subsequent call, the input tensor is the deflated tensor returned by the previous call. Let (v^1,λ^1),(v^2,λ^2),,(v^k,λ^k)(\hat{v}_{1},\hat{\lambda}_{1}),(\hat{v}_{2},\hat{\lambda}_{2}),\dotsc,(\hat{v}_{k},\hat{\lambda}_{k}) be the sequence of estimated eigenvector/eigenvalue pairs returned in these kk calls. With probability at least 1η1-\eta, there exists a permutation π\pi on [k][k] such that

Proof We prove by induction that for each i[k]i\in[k] (corresponding to the ii-th call to Algorithm 1), with probability at least 1iη/k1-i\eta/k, there exists a permutation π\pi on [k][k] such that the following assertions hold.

For all jij\leq i, vπ(j)v^j8ϵ/λπ(j)\|v_{\pi(j)}-\hat{v}_{j}\|\leq 8\epsilon/\lambda_{\pi(j)} and λπ(j)λ^j12ϵ|\lambda_{\pi(j)}-\hat{\lambda}_{j}|\leq 12\epsilon.

We actually take i=0i=0 as the base case, so we can ignore the first assertion, and just observe that for i=0i=0,

Now fix some i[k]i\in[k], and assume as the inductive hypothesis that, with probability at least 1(i1)η/k1-(i-1)\eta/k, there exists a permutation π\pi such that two assertions above hold for i1i-1 (call this Eventi1\mathsf{Event}_{i-1}). The ii-th call to Algorithm 1 takes as input

which is intended to be an approximation to

By Lemma B.1, with conditional probability at least 1η/k1-\eta/k given Eventi1\mathsf{Event}_{i-1}, at least one of θ0(τ)\theta_{0}^{(\tau)} for τ[L]\tau\in[L] is γ\gamma-separated relative to π(jmax)\pi(j_{\max}), where jmax:=argmaxjiλπ(j)j_{\max}:=\arg\max_{j\geq i}\lambda_{\pi(j)}, (for γ=0.01\gamma=0.01; call this Eventi\mathsf{Event}_{i}^{\prime}; note that the application of Lemma B.1 determines C3C_{3}). Therefore Pr[Eventi1Eventi]=Pr[EventiEventi1]Pr[Eventi1](1η/k)(1(i1)η/k)1iη/k\Pr[\mathsf{Event}_{i-1}\cap\mathsf{Event}_{i}^{\prime}]=\Pr[\mathsf{Event}_{i}^{\prime}|\mathsf{Event}_{i-1}]\Pr[\mathsf{Event}_{i-1}]\geq(1-\eta/k)(1-(i-1)\eta/k)\geq 1-i\eta/k. It remains to show that Eventi1EventiEventi\mathsf{Event}_{i-1}\cap\mathsf{Event}_{i}^{\prime}\subseteq\mathsf{Event}_{i}; so henceforth we condition on Eventi1Eventi\mathsf{Event}_{i-1}\cap\mathsf{Event}_{i}^{\prime}.

On the other hand, by the triangle inequality,

where j:=argmaxjiλπ(j)θπ(j),Nj^{*}:=\arg\max_{j\geq i}\lambda_{\pi(j)}|\theta_{\pi(j),N}|. Therefore

Squaring both sides and using the fact that θπ(j),N2+θπ(j),N21\theta_{\pi(j^{*}),N}^{2}+\theta_{\pi(j),N}^{2}\leq 1 for any jjj\neq j^{*},

This means that θN\theta_{N} is (1/4)(1/4)-separated relative to π(j)\pi(j^{*}). Also, observe that

Since v^i=θ^\hat{v}_{i}=\hat{\theta} and λ^i=λ^\hat{\lambda}_{i}=\hat{\lambda}, the first assertion of the inductive hypothesis is satisfied, as we can modify the permutation π\pi by swapping π(i)\pi(i) and π(j)\pi(j^{*}) without affecting the values of {π(j):ji1}\{\pi(j):j\leq i-1\} (recall jij^{*}\geq i).

To prove that (27) holds, pick any unit vector uSk1u\in S^{k-1} such that there exists ji+1j^{\prime}\geq i+1 with (uvπ(j))21(168ϵ/λπ(j))2(u^{\scriptscriptstyle\top}v_{\pi(j^{\prime})})^{2}\geq 1-(168\epsilon/\lambda_{\pi(j^{\prime})})^{2}. We have, via the second bound on C1C_{1} in (28) and the corresponding assumed bound ϵC1λmink\epsilon\leq C_{1}\cdot\frac{\lambda_{\min}}{k},

From the last induction step (i=ki=k), it is also clear from (29) that Tj=1kλ^jv^j355ϵ\|T-\sum_{j=1}^{k}\hat{\lambda}_{j}\hat{v}_{j}^{\otimes 3}\|\leq 55\epsilon (in Eventk1Eventk\mathsf{Event}_{k-1}\cap\mathsf{Event}_{k}^{\prime}). This completes the proof of the theorem.

C Variant of Robust Power Method that uses a Stopping Condition

In this section we analyze a variant of Algorithm 1 that uses a stopping condition. The variant is described in Algorithm 2. The key difference is that the inner for-loop is repeated until a stopping condition is satisfied (rather than explicitly LL times). The stopping condition ensures that the power iteration is converging to an eigenvector, and it will be satisfied within poly(k)\operatorname{poly}(k) random restarts with high probability. The condition depends on one new quantity, rr, which should be set to r:=k# deflation steps so farr:=k-\text{\# deflation steps so far} (i.e., the first call to Algorithm 2 uses r=kr=k, the second call uses r=k1r=k-1, and so on).

For a matrix AA, we use AF:=(i,jAi,j2)1/2\|A\|_{F}:=(\sum_{i,j}A_{i,j}^{2})^{1/2} to denote its Frobenius norm. For a third-order tensor AA, we use AF:=(iA(I,I,ei)F2)1/2=(iA(I,I,vi)F2)1/2\|A\|_{F}:=(\sum_{i}\|A(I,I,e_{i})\|_{F}^{2})^{1/2}=(\sum_{i}\|A(I,I,v_{i})\|_{F}^{2})^{1/2}.

Combining these bounds on ϕ1|\phi_{1}| gives

which in turn implies (for α(0,1/20)\alpha\in(0,1/20))

Now we prove the final claim. This is done by (i) showing that θ\theta has a large projection onto u1u_{1}, (ii) using an SVD perturbation argument to show that ±u1\pm u_{1} is close to v1v_{1}, and (iii) concluding that θ\theta has a large projection onto v1v_{1}.

so we may apply Wedin’s theorem to obtain

It remains to show that θ1=v1θ\theta_{1}=v_{1}^{\scriptscriptstyle\top}\theta is large. Indeed, by the triangle inequality, Cauchy-Schwarz, and the above inequalities on (u1v1)2(u_{1}^{\scriptscriptstyle\top}v_{1})^{2} and (u1θ)2(u_{1}^{\scriptscriptstyle\top}\theta)^{2},

so sign(θ1)>1\operatorname{sign}(\theta_{1})>-1, meaning θ1>0\theta_{1}>0. Therefore θ1=θ112α\theta_{1}=|\theta_{1}|\geq 1-2\alpha. This proves the final claim.

To the conclusion of Lemma B.4, it can be added that the stopping condition (31) is satisfied by θ=θt\theta=\theta_{t}.

Proof Without loss of generality, assume i=1i^{*}=1. By the triangle inequality and Cauchy-Schwarz,

Using the definition of the tensor Frobenius norm, we have

Combining this with the above inequality implies

Therefore the stopping condition (31) is satisfied.

C.2 Sketch of Analysis of Algorithm 2

The analysis of Algorithm 2 is very similar to the proof of Theorem 5.1 for Algorithm 1, so here we just sketch the essential differences.

First, the guarantee afforded to Algorithm 2 is somewhat different than Theorem 5.1. Specifically, it is of the following form: (i) under appropriate conditions, upon termination, the algorithm returns an accurate decomposition, and (ii) the algorithm terminates after poly(k)\operatorname{poly}(k) random restarts with high probability.

The conditions on ϵ\epsilon and NN are the same (but for possibly different universal constants C1,C2C_{1},C_{2}). In Lemma C.1 and Lemma C.2, there is reference to a condition on the Frobenius norm of EE, but we may use the inequality EFkEkϵ\|E\|_{F}\leq k\|E\|\leq k\epsilon so that the condition is subsumed by the ϵ\epsilon condition.

Now we outline the differences relative to the proof of Theorem 5.1. The basic structure of the induction argument is the same. In the induction step, we argue that (i) if the stopping condition is satisfied, then by Lemma C.1 (with α=0.05\alpha=0.05 and β=1/2\beta=1/2), we have a vector θN\theta_{N} such that, for some jij^{*}\geq i,

λπ(j)λπ(jmax)/(4k)\lambda_{\pi(j^{*})}\geq\lambda_{\pi(j_{\max})}/(4\sqrt{k});

θN\theta_{N} is (1/4)(1/4)-separated relative to π(j)\pi(j^{*});

and (ii) the stopping condition is satisfied within poly(k)\operatorname{poly}(k) random restarts (via Lemma B.1 and Lemma C.2) with high probability. We now invoke Lemma B.4 to argue that executing another NN power iterations starting from θN\theta_{N} gives a vector θ^\hat{\theta} that satisfies

The main difference here, relative to the proof of Theorem 5.1, is that we use κ:=4k\kappa:=4\sqrt{k} (rather than κ=O(1)\kappa=O(1)), but this ultimately leads to the same guarantee after taking into consideration the condition ϵC1λmin/k\epsilon\leq C_{1}\lambda_{\min}/k. The remainder of the analysis is essentially the same as the proof of Theorem 5.1.

D Simultaneous Diagonalization for Tensor Decomposition

As discussed in the introduction, another standard approach to certain tensor decomposition problems is to simultaneously diagonalize a collection of similar matrices obtained from the given tensor. We now examine this approach in the context of our latent variable models, where

Let V:=[μ1μ2μk]V:=[\mu_{1}|\mu_{2}|\dotsb|\mu_{k}] and D(η):=diag(μ1η,μ2η,,μkη)D(\eta):=\operatorname{diag}(\mu_{1}^{\scriptscriptstyle\top}\eta,\mu_{2}^{\scriptscriptstyle\top}\eta,\dotsc,\mu_{k}^{\scriptscriptstyle\top}\eta), so

Thus, the problem of determining the μi\mu_{i} can be cast as a simultaneous diagonalization problem: find a matrix XX such that XM2XX^{\scriptscriptstyle\top}M_{2}X and XM3(I,I,η)XX^{\scriptscriptstyle\top}M_{3}(I,I,\eta)X (for all η\eta) are diagonal. It is easy to see that if the μi\mu_{i} are linearly independent, then the solution X=VX^{\scriptscriptstyle\top}=V^{{\dagger}} is unique up to permutation and rescaling of the columns.

With exact moments, a simple approach is as follows. Assume for simplicity that d=kd=k, and define

It should be noted, however, that the use of a single random choice of η\eta is quite restrictive, and it is easy to see that a simultaneous diagonalization of M(η)M(\eta) for several choices of η\eta can be beneficial. While the uniqueness of the eigendecomposition of M(η)M(\eta) is only guaranteed when the diagonal entries of D(η)D(\eta) are distinct, the simultaneous diagonalization of M(η(1)),M(η(2)),,M(η(m))M(\eta^{(1)}),M(\eta^{(2)}),\dotsc,M(\eta^{(m)}) for vectors η(1),η(2),,η(m)\eta^{(1)},\eta^{(2)},\dotsc,\eta^{(m)} is unique as long as the columns of

are distinct (i.e., for each pair of column indices i,ji,j, there exists a row index rr such that the (r,i)(r,i)-th and (r,j)(r,j)-th entries are distinct). This is a much weaker requirement for uniqueness, and therefore may translate to an improved perturbation analysis. In fact, using the techniques discussed in Section 4.3, we may even reduce the problem to an orthogonal simultaneous diagonalization, which may be easier to obtain. Furthermore, a number of robust numerical methods for (approximately) simultaneously diagonalizing collections of matrices have been proposed and used successfully in the literature (e.g., Bunse-Gerstner et al., 1993; Cardoso and Souloumiac, 1993; Cardoso, 1994; Cardoso and Comon, 1996; Ziehe et al., 2004). Another alternative and a more stable approach compared to full diagonalization is a Schur-like method which finds a unitary matrix UU which simultaneously triangularizes the respective matrices (Corless et al., 1997). It is an interesting open question whether these techniques can yield similar improved learnability results and also enjoy the attractive computational properties of the tensor power method.

References