Learning mixtures of spherical Gaussians: moment methods and spectral decompositions

Daniel Hsu, Sham M. Kakade

Introduction

The Gaussian mixture model (Pearson, 1894; Titterington et al., 1985) is one of the most well-studied and widely-used models in applied statistics and machine learning. An important special case of this model (the primary focus of this work) restricts the Gaussian components to have spherical covariance matrices; this probabilistic model is closely related to the (non-probabilistic) kk-means clustering problem (MacQueen, 1967).

where hh is the discrete random variable with Pr(h=i)=wi\Pr(h=i)=w_{i} for i[k]i\in[k], and zz is a random vector whose conditional distribution given h=ih=i (for some i[k]i\in[k]) is the multivariate Gaussian N(0,σi2I)\mathcal{N}(0,\sigma_{i}^{2}I) with mean zero and covariance σi2I\sigma_{i}^{2}I.

The estimation task is to accurately recover the model parameters (component means, variances, and mixing weights) {(μi,σi2,wi):i[k]}\{(\mu_{i},\sigma_{i}^{2},w_{i}):i\in[k]\} from independent copies of xx.

This work gives a procedure for efficiently and exactly recovering the parameters using a simple spectral decomposition of low-order moments of xx, under the following condition.

The component means span a kk-dimensional subspace (i.e., the matrix AA has column rank kk), and the vector ww has strictly positive entries.

The proposed estimator is based on a spectral decomposition technique (Chang, 1996; Mossel and Roch, 2006; Anandkumar et al., 2012b), and is easily stated in terms of exact population moments of the observed xx. With finite samples, one can use a plug-in estimator based on empirical moments of xx in place of exact moments. These empirical moments converge to the exact moments at a rate of O(n1/2)O(n^{-1/2}), where nn is the sample size. As discussed in Section 3, sample complexity bounds for accurate parameter estimation can be derived using matrix perturbation arguments (Anandkumar et al., 2012b). Since only low-order moments are required by the plug-in estimator, the sample complexity is polynomial in the relevant parameters of the estimation problem.

The first estimators for the Gaussian mixture models were based on the method-of-moments, as introduced by Pearson (1894) (see also Lindsay and Basak, 1993, and the references therein). Roughly speaking, these estimators are based on finding parameters under which the Gaussian mixture distribution has moments approximately matching the observed empirical moments. Finding these parameters typically involves solving systems of multivariate polynomial equations, which is typically computationally challenging. Besides this, the order of the moments of some of the early moment-based estimators were either growing with the dimension dd or the number of components kk, which is undesirable because the empirical estimates of such high-order moments may only be reliable when the sample size is exponential in dd or kk. Both the computational and sample complexity issues have been addressed in recent years, at least under various restrictions. For instance, several distance-based estimators require that the component means be well-separated in Euclidean space, by at least some large factor times the directional standard deviation of the individual component distributions (Dasgupta, 1999; Arora and Kannan, 2001; Dasgupta and Schulman, 2007; Vempala and Wang, 2002; Chaudhuri and Rao, 2008), but otherwise have polynomial computational and sample complexity. Some recent moment-based estimators avoid the minimum separation condition of distance-based estimators by requiring either computational or data resources exponential in the number of mixing components kk (but not the dimension dd) (Belkin and Sinha, 2010; Kalai et al., 2010; Moitra and Valiant, 2010) or by making a non-degenerate multi-view assumption (Anandkumar et al., 2012b).

By contrast, the moment-based estimator described in this work does not require a minimum separation condition, exponential computational or data resources, or non-degenerate multiple views. Instead, it relies only on the non-degeneracy condition discussed above together with a spherical noise condition. The non-degeneracy condition is much weaker than an explicit minimum separation condition because the parameters can be arbitrarily close to being degenerate, as long as the sample size grows polynomially with a natural quantity measuring this closeness to degeneracy (akin to a condition number). Like other moment-based estimators, the proposed estimator is based on solving multivariate polynomial equations, although these solutions can be found efficiently because the problems are cast as eigenvalue decompositions of symmetric matrices, which are efficient to compute.

Recent work of Moitra and Valiant (2010) demonstrates an information-theoretic barrier to estimation for general Gaussian mixture models. More precisely, they construct a pair of one-dimensional mixtures of Gaussians (with separated component means) such that the statistical distance between the two mixture distributions is exponentially small in the number of components. This implies that in the worst case, the sample size required to obtain accurate parameter estimates must grow exponentially with the number of components, even when the component distributions are non-negligibly separated. A consequence of the present work is that natural non-degeneracy conditions preclude these worst case scenarios. The non-degeneracy condition in this work is similar to one used for bypassing computational (cryptographic) barriers to estimation for hidden Markov models (Chang, 1996; Mossel and Roch, 2006; Hsu et al., 2012a; Anandkumar et al., 2012b).

Finally, it is interesting to note that similar algebraic techniques have been developed for certain models in independent component analysis (ICA) (Comon, 1994; Cardoso and Comon, 1996; Hyvärinen and Oja, 2000; Comon and Jutten, 2010; Arora et al., 2012) and other closely related problems (Frieze et al., 1996; Nguyen and Regev, 2009). In contrast to the ICA setting, handling non-spherical Gaussian noise for mixture models appears to be a more delicate issue. These connections and open problems are further discussed in Section 3.

Moment-based estimation

This section describes a method-of-moments estimator for the spherical Gaussian mixture model.

The following theorem is the main structural result that relates the model parameters to observable moments.

We note that in the special case where σ12=σ22==σk2=σ2\sigma_{1}^{2}=\sigma_{2}^{2}=\dotsb=\sigma_{k}^{2}=\sigma^{2} (i.e., the mixture components share a common spherical covariance matrix), the average variance σˉ2\bar{\sigma}^{2} is simply σ2\sigma^{2}, and M3M_{3} has a simpler form:

There is no need to refer to the eigenvectors of the covariance matrix or M1M_{1}.

Since the vectors μiμˉ\mu_{i}-\bar{\mu} for i[k]i\in[k] are linearly dependent (i=1kwi(μiμˉ)=0\sum_{i=1}^{k}w_{i}(\mu_{i}-\bar{\mu})=0), the positive semidefinite matrix i=1kwi(μiμˉ)(μiμˉ)\sum_{i=1}^{k}w_{i}(\mu_{i}-\bar{\mu})\otimes(\mu_{i}-\bar{\mu}) has rank rk1r\leq k-1. Thus, the drd-r smallest eigenvalues are exactly σˉ2\bar{\sigma}^{2}, while all other eigenvalues are strictly larger than σˉ2\bar{\sigma}^{2}. The strict separation of eigenvalues implies that every eigenvector corresponding to σˉ2\bar{\sigma}^{2} is in the null space of i=1kwi(μiμˉ)(μiμˉ)\sum_{i=1}^{k}w_{i}(\mu_{i}-\bar{\mu})\otimes(\mu_{i}-\bar{\mu}); thus v(μiμˉ)=0v^{\scriptscriptstyle\top}(\mu_{i}-\bar{\mu})=0 for all i[k]i\in[k].

Now we can express M1M_{1}, M2M_{2}, and M3M_{3} in terms of the parameters wiw_{i}, μi\mu_{i}, and σi2\sigma_{i}^{2}. First,

Finally, for M3M_{3}, we first observe that

is diagonalizable (where † denotes the Moore-Penrose pseudoinverse); its non-zero eigenvalue / eigenvector pairs (λ1,v1),(λ2,v2),,(λk,vk)(\lambda_{1},v_{1}),(\lambda_{2},v_{2}),\dotsc,(\lambda_{k},v_{k}) satisfy λi=ημπ(i)\lambda_{i}=\eta^{\scriptscriptstyle\top}\mu_{\pi(i)} and M21/2vi=siwπ(i)μπ(i)M_{2}^{1/2}v_{i}=s_{i}\sqrt{w_{\pi(i)}}\mu_{\pi(i)} for some permutation π\pi on [k][k] and signs s1,s2,,sk{±1}s_{1},s_{2},\dotsc,s_{k}\in\{\pm 1\}. The μi\mu_{i}, σi2\sigma_{i}^{2}, and wiw_{i} are recovered (up to permutation) with

where D1(η):=diag(ημ1,ημ2,,ημk)D_{1}(\eta):=\operatorname{diag}(\eta^{\scriptscriptstyle\top}\mu_{1},\eta^{\scriptscriptstyle\top}\mu_{2},\dotsc,\eta^{\scriptscriptstyle\top}\mu_{k}).

An efficiently computable plug-in estimator can be derived from Theorem 2. We state one such algorithm (called LearnGMM) in Appendix C; for simplicity, we restrict to the case where the components share the same common spherical covariance, i.e., σ12=σ22==σk2=σ2\sigma_{1}^{2}=\sigma_{2}^{2}=\dotsb=\sigma_{k}^{2}=\sigma^{2}. The following theorem provides a sample complexity bound for accurate estimation of the component means. Since only low-order moments are used, the sample complexity is polynomial in the relevant parameters of the estimation problem (in particular, the dimension dd and the number of mixing components kk). It is worth noting that the polynomial is quadratic in the inverse accuracy parameter 1/ε1/\varepsilon; this owes to the fact that the empirical moments converge to the population moments at the usual n1/2n^{-1/2} rate as per the central limit theorem.

There exists a polynomial poly()\operatorname{poly}(\cdot) such that the following holds. Let M2M_{2} be the matrix defined in Theorem 2, and ςt[M2]\varsigma_{t}[M_{2}] be its tt-th largest singular value (for t[k]t\in[k]). Let bmax:=maxi[k]μi2b_{\max}:=\max_{i\in[k]}\|\mu_{i}\|_{2} and wmin:=mini[k]wiw_{\min}:=\min_{i\in[k]}w_{i}. Pick any ε,δ(0,1)\varepsilon,\delta\in(0,1). Suppose the sample size nn satisfies

Then with probability at least 1δ1-\delta over the random sample and the internal randomness of the algorithm, there exists a permutation π\pi on [k][k] such that the {μ^i:i[k]}\{\hat{\mu}_{i}:i\in[k]\} returned by LearnGMM satisfy

The proof of Theorem 3 is given in Appendix C. It is also easy to obtain accuracy guarantees for estimating σ2\sigma^{2} and ww. The role of Condition 1 enters by observing that ςk[M2]=0\varsigma_{k}[M_{2}]=0 if either rank(A)<k\operatorname{rank}(A)<k or wmin=0w_{\min}=0, as M2=Adiag(w)AM_{2}=A\operatorname{diag}(w)A^{\scriptscriptstyle\top}. The sample complexity bound then becomes trivial in this case, as the bound grows with 1/ςk[M2]1/\varsigma_{k}[M_{2}] and 1/wmin1/w_{\min}. Finally, we also note that LearnGMM is just one (easy to state) way to obtain an efficient algorithm based on the structure in Theorem 1. It is also possible to use, for instance, simultaneous diagonalization techniques (Bunse-Gerstner et al., 1993) or orthogonal tensor decompositions (Anandkumar et al., 2012a) to extract the parameters from (estimates of) M2M_{2} and M3M_{3}; these alternative methods are more robust to sampling error, and are therefore recommended for practical implementation.

Discussion

Spectral decomposition approaches for ICA.

In contrast to this ICA model, the spherical Gaussian mixture model is one where hh would take values in {e1,e2,,ek}\{e_{1},e_{2},\dotsc,e_{k}\}, and the covariance of zz (given hh) is spherical.

is diagonalizable; the eigenvalues are (ϕμ1)2(ψμ1)2,(ϕμ2)2(ψμ2)2,,(ϕμk)2(ψμk)2\frac{(\phi^{\scriptscriptstyle\top}\mu_{1})^{2}}{(\psi^{\scriptscriptstyle\top}\mu_{1})^{2}},\frac{(\phi^{\scriptscriptstyle\top}\mu_{2})^{2}}{(\psi^{\scriptscriptstyle\top}\mu_{2})^{2}},\dotsc,\frac{(\phi^{\scriptscriptstyle\top}\mu_{k})^{2}}{(\psi^{\scriptscriptstyle\top}\mu_{k})^{2}} and each have geometric multiplicity one, and the corresponding eigenvectors are μ1,μ2,,μk\mu_{1},\mu_{2},\dotsc,\mu_{k} (up to scaling and permutation).

Again, choosing ϕ\phi and ψ\psi as random unit vectors ensures the distinctness assumption is satisfied almost surely, and a finite sample analysis can be given using standard matrix perturbation techniques (Anandkumar et al., 2012b). A number of related deterministic algorithms based on algebraic techniques are discussed in the text of Comon and Jutten (2010). Recent work of Arora et al. (2012) provides a finite sample complexity analysis for an efficient estimator based on local search.

Non-degeneracy.

The non-degeneracy assumption (Condition 1) is quite natural, and its has the virtue of permitting tractable and consistent estimators. Although previous work has typically tied it with additional assumptions, this work shows that they are largely unnecessary.

One drawback of Condition 1 is that it prevents the straightforward application of these techniques to certain problem domains (e.g., automatic speech recognition (ASR), where the number of mixture components is typically enormous, but the dimension of observations is relatively small; alternatively, the span of the means has dimension <k<k). To compensate, one may require multiple views, which are granted by a number of models, including hidden Markov models used in ASR (Hsu et al., 2012a; Anandkumar et al., 2012b), and combining these views in a tensor product fashion (Allman et al., 2009). This increases the complexity of the estimator, but that may be inevitable as estimation for certain non-singular models is conjectured to be computationally intractable (Mossel and Roch, 2006).

Acknowledgements

We thank Dean Foster and Anima Anandkumar for helpful insights. We also thank Rong Ge and Sanjeev Arora for discussions regarding their recent work on ICA.

References

Appendix A Connection to independent component analysis

By assumption, the diagonal entries of D2(ϕ)D2(ψ)1D_{2}(\phi)D_{2}(\psi)^{-1} are distinct, and therefore

is diagonalizable, and every eigenvalue has geometric multiplicity one. ∎

Appendix B Incoherence and random rotations

Anandkumar et al. (2012b) show that if AA has rank kk and also satisfies a mild incoherence condition, then a random partitioning guarantees that each AtA_{t} has rank kk, and lower-bounds the kk-th largest singular value of each AtA_{t} by that of AA. The condition is similar to the spreading condition of Chaudhuri and Rao (2008).

Define coherence(A):=maxi[d]{eiΠAei}\operatorname{coherence}(A):=\max_{i\in[d]}\{e_{i}^{\scriptscriptstyle\top}\Pi_{A}e_{i}\} to be the largest diagonal entry of the ortho-projector ΠA\Pi_{A} to the range of AA. When AA has rank kk, we have coherence(A)[k/d,1]\operatorname{coherence}(A)\in[k/d,1]; it is maximized when range(A)=span{e1,e2,,ek}\operatorname{range}(A)=\operatorname{span}\{e_{1},e_{2},\dotsc,e_{k}\} and minimized when the range is spanned by a subset of the Hadamard basis of cardinality kk. Roughly speaking, if the matrix of conditional means has low coherence, then its full-rank property is witnessed by many partitions of [d][d]; this is made formal in the following lemma.

Assume AA has rank kk and that coherence(A)(ε2/6)/ln(3k/δ)\operatorname{coherence}(A)\leq(\varepsilon^{2}/6)/\ln(3k/\delta) for some ε,δ(0,1)\varepsilon,\delta\in(0,1). With probability at least 1δ1-\delta, a random partitioning of the dimensions [d][d] into three groups (for each i[d]i\in[d], independently pick t{1,2,3}t\in\{1,2,3\} uniformly at random and put ii in group tt) has the following property. For each t{1,2,3}t\in\{1,2,3\}, the matrix AtA_{t} obtained by selecting the rows of AA in group tt has full column rank, and the kk-th largest singular value of AtA_{t} is at least (1ε)/3\sqrt{(1-\varepsilon)/3} times that of AA.

Take η\eta from Lemma 2 and ε,δ\varepsilon,\delta from Lemma 1 to be constants. Then the incoherence condition of Lemma 1 is satsified provided that dc(klogk)d\geq c\cdot(k\log k) for some positive constant cc.

Appendix C Learning algorithm and finite sample analysis

In this section, we state and analyze a learning algorithm based on the estimator from Theorem 2, which assumed availability of exact moments of xx. The proposed algorithm only uses a finite sample to estimate moments, and also explicitly deals with the eigenvalue separation condition assumed in Theorem 2 via internal randomization.

C.2 Algorithm

The proposed algorithm, called LearnGMM, is described in Figure 1. The algorithm essentially implements the decomposition strategy in Theorem 2 using plug-in moments. To simplify the analysis, we split our sample (say, initially of size 2n2n) in two: we use the first half for empirical moments (μ^\hat{\mu} and M^2\widehat{\mathcal{M}}_{2}) used in constructing σ^2\hat{\sigma}^{2}, M^2\widehat{M}_{2}, W^\widehat{W}, and B^\widehat{B}; and we use the second half for empirical moments (W^μ^\widehat{W}^{\scriptscriptstyle\top}\underline{\hat{\mu}} and M^3[W^,W^,W^]\underline{\widehat{\mathcal{M}}_{3}}[\widehat{W},\widehat{W},\widehat{W}] used in constructing M^3[W^,W^,W^]\widehat{M}_{3}[\widehat{W},\widehat{W},\widehat{W}]. Observe that this ensures M^3\widehat{M}_{3} is independent of W^\widehat{W}.

Let {(xi,hi):i[n]}\{(x_{i},h_{i}):i\in[n]\} be nn i.i.d. copies of (x,h)(x,h), and write S:={x1,x2,,xn}\mathcal{S}:=\{x_{1},x_{2},\dotsc,x_{n}\}. Let S\underline{\mathcal{S}} be an independent copy of S\mathcal{S}. Furthermore, define the following moments and empirical moments:

So S\mathcal{S} represents the first half of the sample, and S\underline{\mathcal{S}} represents the second half of the sample.

C.3 Structure of the moments

We first recall the basic structure of the moments μ\mu, M2\mathcal{M}_{2}, and M3\mathcal{M}_{3} as established in Theorem 2; for simplicity, we restrict to the special case where σ12=σ22==σk2=σ2\sigma_{1}^{2}=\sigma_{2}^{2}=\dotsb=\sigma_{k}^{2}=\sigma^{2}.

C.4 Concentration behavior of empirical quantities

In this subsection, we prove concentration properties of empirical quantities based on S\mathcal{S}; clearly the same properties hold for S\underline{\mathcal{S}}.

Let Si:={xjS:hj=i}\mathcal{S}_{i}:=\{x_{j}\in\mathcal{S}:h_{j}=i\} and w^i:=Si/S\hat{w}_{i}:=|\mathcal{S}_{i}|/|\mathcal{S}| for i[k]i\in[k]. Also, define the following (empirical) conditional moments:

Pick any δ(0,1/2)\delta\in(0,1/2). With probability at least 12δ1-2\delta,

The first inequality follows from Bernstein’s inequality and a union bound. The second inequality follows from a simple application of McDiarmid’s inequality (see Hsu et al., 2012a, Proposition 19). ∎

First-order moments: with probability at least 1δ1-\delta,

Second-order moments: with probability at least 1δ1-\delta,

Third-order moments: with probability at least 1δ1-\delta,

First-order moments. Observe that (w^in/σ2)U(μ^iμi)22(\hat{w}_{i}n/\sigma^{2})\|U^{\scriptscriptstyle\top}(\hat{\mu}_{i}-\mu_{i})\|_{2}^{2} is distributed as the sum of rr independent χ2\chi^{2} random variables, each with one degree of freedom. Thus, Lemma 18 and union bounds imply

Second-order moments. Since M2,i=σ2I+μiμi\mathcal{M}_{2,i}=\sigma^{2}I+\mu_{i}\mu_{i}^{\scriptscriptstyle\top}, it follows by the triangle and Cauchy-Schwarz inequalities that

A tail bound for the first term follows from Lemma 19, combined with a union bound:

Third-order moments. It can be checked that

Therefore, by the triangle and Cauchy-Schwarz inequalities,

A tail bound for the first term is given by Lemma 21, combined with a union bound:

The other terms are handled as per above. ∎

We just show the third claimed inequalitiy, as the others are similar. Write as shorthand T^i:=M^3,i[R,R,R]\widehat{T}_{i}:=\widehat{\mathcal{M}}_{3,i}[R,R,R] and Ti:=M3,i[R,R,R]T_{i}:=\mathcal{M}_{3,i}[R,R,R]. Then

where the first and second steps use the triangle inequality, and the second step uses Hölder’s inequality. ∎

The covariance matrix can be written as M2μμ\mathcal{M}_{2}-\mu\mu^{\scriptscriptstyle\top}, and the empirical covariance matrix can be written as M^2μ^μ^\widehat{\mathcal{M}}_{2}-\hat{\mu}\hat{\mu}^{\scriptscriptstyle\top}. Recall that the estimate of σ2\sigma^{2}, denoted by σ^2\hat{\sigma}^{2}, is given by the kk-th largest eigenvalue of the empirical covariance matrix M^2μ^μ^\widehat{\mathcal{M}}_{2}-\hat{\mu}\hat{\mu}^{\scriptscriptstyle\top}; and that the estimate of M2M_{2}, denoted by M^2\widehat{M}_{2}, is the best rank-kk approximation to M^2σ^2I\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I. Of course, the singular values of a positive semi-definite matrix are the same as its eigenvalues; in particular, σ^2=ςk[M^2μ^μ^]\hat{\sigma}^{2}=\varsigma_{k}[\widehat{\mathcal{M}}_{2}-\hat{\mu}\hat{\mu}^{\scriptscriptstyle\top}].

σ^2σ2M^2M22+2μ2μ^μ2+μ^μ22|\hat{\sigma}^{2}-\sigma^{2}|\leq\|\widehat{\mathcal{M}}_{2}-\mathcal{M}_{2}\|_{2}+2\|\mu\|_{2}\|\hat{\mu}-\mu\|_{2}+\|\hat{\mu}-\mu\|_{2}^{2}.

M^2M224M^2M22+4μ2μ^μ2+2μ^μ22\|\widehat{M}_{2}-M_{2}\|_{2}\leq 4\|\widehat{\mathcal{M}}_{2}-\mathcal{M}_{2}\|_{2}+4\|\mu\|_{2}\|\hat{\mu}-\mu\|_{2}+2\|\hat{\mu}-\mu\|_{2}^{2}.

Using Weyl’s inequality (Stewart and Sun, 1990, Theorem 4.11, p. 204), we obtain ςk[M^2μ^μ^]ςk[M2μμ](M^2μ^μ^)(M2μμ)2M^2M22+2μ2μ^μ2+μ^μ22|\varsigma_{k}[\widehat{\mathcal{M}}_{2}-\hat{\mu}\hat{\mu}^{\scriptscriptstyle\top}]-\varsigma_{k}[\mathcal{M}_{2}-\mu\mu^{\scriptscriptstyle\top}]|\leq\|(\widehat{\mathcal{M}}_{2}-\hat{\mu}\hat{\mu}^{\scriptscriptstyle\top})-(\mathcal{M}_{2}-\mu\mu^{\scriptscriptstyle\top})\|_{2}\leq\|\widehat{\mathcal{M}}_{2}-\mathcal{M}_{2}\|_{2}+2\|\mu\|_{2}\|\hat{\mu}-\mu\|_{2}+\|\hat{\mu}-\mu\|_{2}^{2}. The first claim then follows by observing that ςk[M2μμ]=σ2\varsigma_{k}[\mathcal{M}_{2}-\mu\mu^{\scriptscriptstyle\top}]=\sigma^{2} as per Theorem 1.

For the second claim, observe that ςk+1(M2σ2I)=0\varsigma_{k+1}(\mathcal{M}_{2}-\sigma^{2}I)=0 as M2σ2I\mathcal{M}_{2}-\sigma^{2}I has rank kk. Therefore ςk+1(M^2σ^2I)=ςk+1(M^2σ^2I)ςk+1(M2σ2I)(M^2σ^2I)(M2σ2I)2\varsigma_{k+1}(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)=|\varsigma_{k+1}(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)-\varsigma_{k+1}(\mathcal{M}_{2}-\sigma^{2}I)|\leq\|(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)-(\mathcal{M}_{2}-\sigma^{2}I)\|_{2}, again using Weyl’s inequality. Since M^2\widehat{M}_{2} is the best rank-kk approximation to M^2σ^2I\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I, it follows that M^2(M^2σ^2I)2ςk+1(M^2σ^2I)(M^2σ^2I)(M2σ2I)2\|\widehat{M}_{2}-(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)\|_{2}\leq\varsigma_{k+1}(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)\leq\|(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)-(\mathcal{M}_{2}-\sigma^{2}I)\|_{2}. Therefore M^2M22(M^2σ^2I)(M2σ2I)2+M^2(M^2σ^2I)22(M^2σ^2I)(M2σ2I)22M^2M22+2σ^2σ24M^2M22+4μ2μ^μ2+2μ^μ22\|\widehat{M}_{2}-M_{2}\|_{2}\leq\|(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)-(\mathcal{M}_{2}-\sigma^{2}I)\|_{2}+\|\widehat{M}_{2}-(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)\|_{2}\leq 2\|(\widehat{\mathcal{M}}_{2}-\hat{\sigma}^{2}I)-(\mathcal{M}_{2}-\sigma^{2}I)\|_{2}\leq 2\|\widehat{\mathcal{M}}_{2}-\mathcal{M}_{2}\|_{2}+2|\hat{\sigma}^{2}-\sigma^{2}|\leq 4\|\widehat{\mathcal{M}}_{2}-\mathcal{M}_{2}\|_{2}+4\|\mu\|_{2}\|\hat{\mu}-\mu\|_{2}+2\|\hat{\mu}-\mu\|_{2}^{2}. ∎

Recall that the estimate of M3M_{3}, denoted by M^3\widehat{M}_{3}, is given by M^3σ^2i=1d(μ^eiei+eiμ^ei+eieiμ^)\underline{\widehat{\mathcal{M}}_{3}}-\hat{\sigma}^{2}\sum_{i=1}^{d}(\underline{\hat{\mu}}\otimes e_{i}\otimes e_{i}+e_{i}\otimes\underline{\hat{\mu}}\otimes e_{i}+e_{i}\otimes e_{i}\otimes\underline{\hat{\mu}}).

Let G^:=i=1d(μ^eiei+eiμ^ei+eieiμ^\widehat{G}:=\sum_{i=1}^{d}(\underline{\hat{\mu}}\otimes e_{i}\otimes e_{i}+e_{i}\otimes\underline{\hat{\mu}}\otimes e_{i}+e_{i}\otimes e_{i}\otimes\underline{\hat{\mu}} and G:=i=1d(μeiei+eiμei+eieiμ)G:=\sum_{i=1}^{d}(\mu\otimes e_{i}\otimes e_{i}+e_{i}\otimes\mu\otimes e_{i}+e_{i}\otimes e_{i}\otimes\mu). Then

Observe that by the triangle inequality, G[R,R,R]23R22Rμ2\|G[R,R,R]\|_{2}\leq 3\|R\|_{2}^{2}\|R^{\scriptscriptstyle\top}\mu\|_{2} and (G^G)[R,R,R]23R22R(μ^μ)2\|(\widehat{G}-G)[R,R,R]\|_{2}\leq 3\|R\|_{2}^{2}\|R^{\scriptscriptstyle\top}(\underline{\hat{\mu}}-\mu)\|_{2}. Furthermore, by Lemma 7, we have σ^2σ2M^2M22+2μ2μ^μ2+μ^μ22|\hat{\sigma}^{2}-\sigma^{2}|\leq\|\widehat{\mathcal{M}}_{2}-\mathcal{M}_{2}\|_{2}+2\|\mu\|_{2}\|\underline{\hat{\mu}}-\mu\|_{2}+\|\underline{\hat{\mu}}-\mu\|_{2}^{2}. Therefore the claim follows. ∎

C.6 Properties of projection and whitening operators

Define EM2:=M^2M22/ςk[M2]\mathcal{E}_{M_{2}}:=\|\widehat{M}_{2}-M_{2}\|_{2}/\varsigma_{k}[M_{2}].

Assume EM21/3\mathcal{E}_{M_{2}}\leq 1/3. Then

(1+EM2)SU^M^2U^=S^(1EM2)S0(1+\mathcal{E}_{M_{2}})S\succeq\widehat{U}^{\scriptscriptstyle\top}\widehat{M}_{2}\widehat{U}=\widehat{S}\succeq(1-\mathcal{E}_{M_{2}})S\succ 0.

ςk[U^U]1(9/4)EM22>0\varsigma_{k}[\widehat{U}^{\scriptscriptstyle\top}U]\geq\sqrt{1-(9/4)\mathcal{E}_{M_{2}}^{2}}>0.

ςk[U^M2U^](1(9/4)EM22)ςk[M2]>0\varsigma_{k}[\widehat{U}^{\scriptscriptstyle\top}M_{2}\widehat{U}]\geq(1-(9/4)\mathcal{E}_{M_{2}}^{2})\varsigma_{k}[M_{2}]>0.

(IU^U^)UU2(3/2)EM2\|(I-\widehat{U}\widehat{U}^{\scriptscriptstyle\top})UU^{\scriptscriptstyle\top}\|_{2}\leq(3/2)\mathcal{E}_{M_{2}}.

By the assumptions that EM21/3\mathcal{E}_{M_{2}}\leq 1/3 and M2M_{2} is symmetric positive definite, we have

by Weyl’s inequality. Therefore M^2\widehat{M}_{2} is symmetric positive definite, and

Recall that W^=U^(U^M^2U^)1/2\widehat{W}=\widehat{U}(\widehat{U}^{\scriptscriptstyle\top}\widehat{M}_{2}\widehat{U})^{{\dagger}1/2}.

Define W:=W^(W^M2W^)1/2W:=\widehat{W}(\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W})^{{\dagger}1/2}. Assume EM21/3\mathcal{E}_{M_{2}}\leq 1/3. Then

W^M2W^\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W} is symmetric positive definite, WM2W=IW^{\scriptscriptstyle\top}M_{2}W=I, and WAdiag(w)1/2W^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2} is orthogonal.

W^21(1EM2)ςk[M2]\|\widehat{W}\|_{2}\leq\frac{1}{\sqrt{(1-\mathcal{E}_{M_{2}})\varsigma_{k}[M_{2}]}}.

(W^M2W^)1/2I2(3/2)EM2\|(\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W})^{1/2}-I\|_{2}\leq(3/2)\mathcal{E}_{M_{2}}, (W^M2W^)1/2I2(3/2)EM2\|(\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W})^{-1/2}-I\|_{2}\leq(3/2)\mathcal{E}_{M_{2}}, W^Adiag(w)1/221+(3/2)EM2\|\widehat{W}^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2}\|_{2}\leq\sqrt{1+(3/2)\mathcal{E}_{M_{2}}}, (W^W)Adiag(w)1/22(3/2)1+(3/2)EM2EM2\|(\widehat{W}-W)^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2}\|_{2}\leq(3/2)\sqrt{1+(3/2)\mathcal{E}_{M_{2}}}\mathcal{E}_{M_{2}}.

By Lemma 9 (first and third claims), the matrices U^M^2U^\widehat{U}^{\scriptscriptstyle\top}\widehat{M}_{2}\widehat{U} and U^M2U^\widehat{U}^{\scriptscriptstyle\top}M_{2}\widehat{U} are symmetric positive definite. Therefore

Since M2=Adiag(w)AM_{2}=A\operatorname{diag}(w)A^{\scriptscriptstyle\top}, it follows from the third equation above that WAdiag(w)1/2W^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2} is orthogonal. Thus the first claim is established.

where the last inequality follows from Lemma 9 (first claim).

To show the third claim, we first bound W^M2W^I2\|\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W}-I\|_{2} as

where the second inequality follows from the first claim. This implies that every eigenvalue of W^M2W^\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W} is contained in the interval of radius (3/2)EM2(3/2)\mathcal{E}_{M_{2}} around 11. Because (1+x)1/21x|(1+x)^{-1/2}-1|\leq|x| for all x1/2|x|\leq 1/2, the same is true of the eigenvalues of (W^M2W^)1/2(\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W})^{-1/2}:

where the vectors {WAdiag(w)1/2ei:i[k]}\{W^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2}e_{i}:i\in[k]\} are orthonormal. Furthermore, the eigenvectors of T[u]T[u] are {WAdiag(w)1/2ei:i[k]}\{W^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2}e_{i}:i\in[k]\} and the corresponding eigenvalues are {uWμi:i[k]}\{u^{\scriptscriptstyle\top}W^{\scriptscriptstyle\top}\mu_{i}:i\in[k]\}.

The structure of TT follows from Lemma 3, and the orthogonality of {WAdiag(w)1/2ei:i[k]}\{W^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2}e_{i}:i\in[k]\} follows from Lemma 10 (first claim). The eigendecomposition of T[u]T[u] is then readily seen from the structure of TT. ∎

Assume EM21/3\mathcal{E}_{M_{2}}\leq 1/3. Then

By Lemma 11, T=i=1kwi1/2viviviT=\sum_{i=1}^{k}w_{i}^{-1/2}v_{i}\otimes v_{i}\otimes v_{i} for some orthonormal vectors {vi:i[k]}\{v_{i}:i\in[k]\}, so T21/wmin\|T\|_{2}\leq 1/\sqrt{w_{\min}}. By Lemma 10 (first and third claims), W^=W(W^M2W^)1/2\widehat{W}=W(\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W})^{1/2} and (W^M2W^)1/2I2(3/2)EM2\|(\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W})^{1/2}-I\|_{2}\leq(3/2)\mathcal{E}_{M_{2}}. Therefore

Thus we can bound T^T2\|\widehat{T}-T\|_{2} using the triangle inequality and the above bound:

C.7 Eigendecomposition analysis

where the probability is taken with respect to the distribution of θ\theta.

By Lemma 17, with probability at least 1/21/2,

Now fix any iji\neq j. By Lemma 10 (first claim), WAdiag(w)1/2W^{\scriptscriptstyle\top}A\operatorname{diag}(w)^{1/2} is orthogonal, so WA(eiej)2=diag(w)1/2(eiej)2=ei/wiej/wj2=1/wi+1/wj\|W^{\scriptscriptstyle\top}A(e_{i}-e_{j})\|_{2}=\|\operatorname{diag}(w)^{-1/2}(e_{i}-e_{j})\|_{2}=\|e_{i}/\sqrt{w_{i}}-e_{j}/\sqrt{w_{j}}\|_{2}=\sqrt{1/w_{i}+1/w_{j}}. Similarly, for any i[k]i\in[k], WAei2=1/wi\|W^{\scriptscriptstyle\top}Ae_{i}\|_{2}=\sqrt{1/w_{i}}. Therefore minqQWAq2=mini[k]1/wi\min_{q\in Q}\|W^{\scriptscriptstyle\top}Aq\|_{2}=\min_{i\in[k]}\sqrt{1/w_{i}}. ∎

Pick any δ(0,1)\delta\in(0,1). If tlog2(1/δ)t\geq\log_{2}(1/\delta), then with probability at least 1δ1-\delta, the trial τ^:=argmaxt[t]Δ^(t)\hat{\tau}:=\arg\max_{t^{\prime}\in[t]}\widehat{\Delta}(t^{\prime}) satisfies

For each t[t]t^{\prime}\in[t], the eigenvalues of λ^1,λ^2,,λ^k\hat{\lambda}_{1},\hat{\lambda}_{2},\dotsb,\hat{\lambda}_{k} of T^[θt]\widehat{T}[\theta_{t^{\prime}}] (arranged in non-increasing order) satisfy

where the second inequality follows from Weyl’s inequality. Similarly,

By Lemma 11, the eigenvalues of T[v]T[v] are vWAeiv^{\scriptscriptstyle\top}W^{\scriptscriptstyle\top}Ae_{i} for i[k]i\in[k]. Thus, by Lemma 13, the probability that some τ[t]\tau\in[t] has Δ(τ)>γ\Delta(\tau)>\gamma is at least 1δ1-\delta. In this event, (2) implies that trial τ\tau has Δ^(τ)Δ(τ)2ETγ\widehat{\Delta}(\tau)\geq\Delta(\tau)-2\mathcal{E}_{T}\gamma, and hence Δ^(τ^)=maxt[t]Δ^(t)γ2ETγ\widehat{\Delta}(\hat{\tau})=\max_{t^{\prime}\in[t]}\widehat{\Delta}(t^{\prime})\geq\gamma-2\mathcal{E}_{T}\gamma. ∎

We now just consider the trial τ^\hat{\tau} retained by the algorithm. Let {(vi,λi):i[k]}\{(v_{i},\lambda_{i}):i\in[k]\} be the eigenvector/eigenvalue pairs of T[θτ^]T[\theta_{\hat{\tau}}], and let {(v^i,λ^i):i[k]}\{(\hat{v}_{i},\hat{\lambda}_{i}):i\in[k]\} be the eigenvector/eigenvalue pairs of T^[θτ^]\widehat{T}[\theta_{\hat{\tau}}].

Assume the 1δ1-\delta probability event in Lemma 14 holds, and also assume that ET1/4\mathcal{E}_{T}\leq 1/4. Then there exists a permutation π\pi on [k][k] and signs s1,s2,,sk{±1}s_{1},s_{2},\dotsc,s_{k}\in\{\pm 1\} such that, for all i[k]i\in[k],

To simplify notation, assume the eigenvalues of T^[θ]\widehat{T}[\theta] and T[θ]T[\theta] are already sorted in non-increasing order. Observe that for all iji\neq j,

where the second-to-last inequality follows by the assumption Δ^(τ^)γ2ETγ\widehat{\Delta}(\hat{\tau})\geq\gamma-2\mathcal{E}_{T}\gamma and by Weyl’s inequality. Therefore, the interval of radius γ/4\gamma/4 surronding each eigenvalue λ^i\hat{\lambda}_{i} of T^[θτ^]\widehat{T}[\theta_{\hat{\tau}}] contains only one eigenvalue λi\lambda_{i} of T[θτ^]T[\theta_{\hat{\tau}}]. By the Davis-Kahan sin(Θ)\sin(\Theta) theorem (Stewart and Sun, 1990, Theorem 3.4, p. 250), we have that

Therefore, for si:=sign(viv^i)s_{i}:=\operatorname{sign}(v_{i}^{\scriptscriptstyle\top}\hat{v}_{i}),

The bound λiλ^iETγ|\lambda_{i}-\hat{\lambda}_{i}|\leq\mathcal{E}_{T}\gamma follows simply from Weyl’s inequality. ∎

C.8 Overall error analysis

Assume the 1δ1-\delta probability event of Lemma 14 holds, and also assume that EM21/3\mathcal{E}_{M_{2}}\leq 1/3, ET1/4\mathcal{E}_{T}\leq 1/4, and ϵ11/3\epsilon_{1}\leq 1/3. Then there exists a permutation π\pi on [k][k] such that

To simplify notation, we assume throughout that the permutation π\pi from Lemma 15 is the identity permutation. Let V:=[v1v2vk]V:=[v_{1}|v_{2}|\dotsb|v_{k}]. We first bound B^siv^iwiμi\widehat{B}s_{i}\hat{v}_{i}-\sqrt{w_{i}}\mu_{i}. This quantity can be split into two parts: the part in the range of U^\widehat{U}, and the rest. The part in the range of U^\widehat{U} is bounded as

where the third inequality follows from Lemma 9 and Lemma 15. To bound the second term in the last step, recall that W=W^(W^M2W^)1/2W=\widehat{W}(\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W})^{-1/2} (using Lemma 10 to guarantee the positive definiteness of W^M2W^\widehat{W}^{\scriptscriptstyle\top}M_{2}\widehat{W}), so we may write U^AV\widehat{U}^{\scriptscriptstyle\top}AV^{\scriptscriptstyle\top} as

where the last inequality uses Lemma 9 and Lemma 10. Thus

Now consider the part not in the range of U^\widehat{U}. This is simply bounded as

using Lemma 9. Therefore, overall, we have

Since the actual estimate of μi\mu_{i} is (λ^i/θv^i)B^v^i(\hat{\lambda}_{i}/\theta^{\scriptscriptstyle\top}\hat{v}_{i})\widehat{B}\hat{v}_{i}, we need to show that θv^i\theta^{\scriptscriptstyle\top}\hat{v}_{i} is approximately siwiλ^is_{i}\sqrt{w_{i}}\hat{\lambda}_{i}. Indeed,

where the last inequality uses Lemma 10 and Lemma 15. Therefore

where the fourth inequality uses Lemma 15. We thus conclude that

We can now prove Theorem 3, stated below with the explicit polynomial sample complexity bound (up to constants).

There exists a constant C>0C>0 such that the following holds. Let bmax:=maxi[k]μi2b_{\max}:=\max_{i\in[k]}\|\mu_{i}\|_{2}. Pick any ε,δ(0,1)\varepsilon,\delta\in(0,1). Suppose the sample size 2n2n satisfies

(as defined in Lemma 13). Then with probability at least 13δ1-3\delta over the random sample and the internal randomness of the algorithm, there exists a permutation π\pi on [k][k] such that

Throughout, C1,c1,C2,c2,C_{1},c_{1},C_{2},c_{2},\dotsc will represent absolute positive constants. First, observe that the sample size bound

and Lemma 4 ensure that Ew1\mathcal{E}_{w}\leq 1 (where Ew\mathcal{E}_{w} is defined in Lemma 6). Therefore, from Lemma 5 and Lemma 6 (together with union bounds), with probability at least 1δ1-\delta,

Now condition on the above event and take W^\widehat{W} as given. By Lemma 10,

Therefore, Lemma 5 and Lemma 6 imply that with probability at least 1δ1-\delta,

Using the inequalities (6), (7), and (4) with (5) gives

Since t=log2(1/δ)t=\log_{2}(1/\delta), the inequality from Lemma 14 holds with probability at least 1δ1-\delta over the internal randomness of the algorithm. Therefore, using (4) and (8) with Lemma 16 gives in this event:

for all i[k]i\in[k], for some permutation π\pi on [k][k]. ∎

Appendix D Probability tail inequalities

We recall and derive some probability tail inequalities used in the analysis.

Let z12,z22,,zm2z_{1}^{2},z_{2}^{2},\dotsc,z_{m}^{2} be i.i.d. χ2\chi^{2} random variables, each with one degree of freedom. Then for any δ(0,1)\delta\in(0,1),

Let z1,z2,,zmz_{1},z_{2},\dotsc,z_{m} be i.i.d. N(0,1)N(0,1) random variables. Then for any δ(0,1)\delta\in(0,1),

We use Markov’s inequality via the pp-th moment to derive the tail inequality. Pick some even integer p2p\geq 2, and observe that

By the independence of the ziz_{i}’s, a term in the summation is zero if any index i[m]i\in[m] is selected an odd number of times (i.e., {j[p]:ij=i}|\{j\in[p]:i_{j}=i\}| is odd, for any i[m]i\in[m]). Therefore the summation can be written as

The bound is at most δ\delta for t:=e27emln(1/δ)3t:=e\sqrt{27em\lceil\ln(1/\delta)\rceil^{3}} and p:=ln(1/δ)p:=\lceil\ln(1/\delta)\rceil. ∎

which implies Y2ϵ/(13ϵ0)\|Y\|_{2}\leq\epsilon/(1-3\epsilon_{0}). This proves that