Learning Mixtures of Gaussians in High Dimensions

Rong Ge, Qingqing Huang, Sham M. Kakade

Introduction

Learning mixtures of Gaussians is a fundamental problem in statistics and learning theory, whose study dates back to Pearson (1894). Gaussian mixture models arise in numerous areas including physics, biology and the social sciences (McLachlan and Peel (2004); Titterington et al. (1985)), as well as in image processing (Reynolds and Rose (1995)) and speech (Permuter et al. (2003)).

In a Gaussian mixture model, there are kk unknown nn-dimensional multivariate Gaussian distributions. Samples are generated by first picking one of the kk Gaussians, then drawing a sample from that Gaussian distribution. Given samples from the mixture distribution, our goal is to estimate the means and covariance matrices of these underlying Gaussian distributions This is different from the problem of density estimation considered in Feldman et al. (2006); Chan et al. (2014).

This problem has a long history in theoretical computer science. The seminal work of Dasgupta (1999) gave an algorithm for learning spherical Gaussian mixtures when the means are well separated. Subsequent works (Dasgupta and Schulman (2000); Sanjeev and Kannan (2001); Vempala and Wang (2004); Brubaker and Vempala (2008)) developed better algorithms in the well-separated case, relaxing the spherical assumption and the amount of separation required.

When the means of the Gaussians are not separated, after several works (Belkin and Sinha (2009); Kalai et al. (2010)), Belkin and Sinha (2010) and Moitra and Valiant (2010) independently gave algorithms that run in polynomial time and with polynomial number of samples for a fixed number of Gaussians. However, both running time and sample complexity depend super exponentially on the number of components kk In fact, it is in the order of O(eO(k)k)O({e^{O(k)}}^{k}) as shown in Theorem 11.3 in Valiant (2012). . Their algorithm is based on the method of moments introduced by Pearson (1894): first estimate the O(k)O(k)-order moments of the distribution, then try to find the parameters that agree with these moments. Moitra and Valiant (2010) also show that the exponential dependency of the sample complexity on the number of components is necessary, by constructing an example of two mixtures of Gaussians with very different parameters, yet with exponentially small statistical distance.

Recently, Hsu and Kakade (2013) applied spectral methods to learning mixture of spherical Gaussians. When nk+1n\geq k+1 and the means of the Gaussians are linearly independent, their algorithm can learn the model in polynomial time and with polynomial number of samples. This result suggests that the lower bound example in Moitra and Valiant (2010) is only a degenerate case in high dimensional space. In fact, most (in general position) mixture of spherical Gaussians are easy to learn. This result is also based on the method of moments, and only uses second and third moments. Several follow-up works (Bhaskara et al. (2014); Anderson et al. (2013)) use higher order moments to get better dependencies on nn and kk.

However, the algorithm in Hsu and Kakade (2013) as well as in the follow-ups all make strong requirements on the covariance matrices. In particular, most of them only apply to learning mixture of spherical Gaussians. For mixture of Gaussians with general covariance matrices, the best known result is still Belkin and Sinha (2010) and Moitra and Valiant (2010), which algorithms are not polynomial in the number of components kk. This leads to the following natural question:

Question: Is it possible to learn most mixture of Gaussians in polynomial time using a polynomial number of samples?

In this paper, we give an algorithm that learns most mixture of Gaussians in high dimensional space (when nΩ(k2)n\geq\Omega(k^{2})), and the argument is formalized under the smoothed analysis framework first proposed in Spielman and Teng (2004).

In the smoothed analysis framework, the adversary first choose an arbitrary mixture of Gaussians. Then the mean vectors and covariance matrices of this Gaussian mixture are randomly perturbed by a small amount ρ\rho See Definition 3.2 in Section 3.1 for the details.. The samples are then generated from the Gaussian mixture model with the perturbed parameters. The goal of the algorithm is to learn the perturbed parameters from the samples.

The smoothed analysis framework is a natural bridge between worst-case and average-case analysis. On one hand, it is similar to worst-case analysis, as the adversary chooses the initial instance, and the perturbation allowed is small. On the other hand, even with small perturbation, we may hope that the instance be different enough from degenerate cases. A successful algorithm in the smoothed analysis setting suggests that the bad instances must be very “sparse” in the parameter space: they are highly unlikely in any small neighborhood of any instance. Recently, the smoothed analysis framework has also motivated several research work (Kalai et al. (2009) Bhaskara et al. (2014)) in analyzing learning algorithms.

In the smoothed analysis setting, we show that it is easy to learn most Gaussian mixtures:

(informal statement of Theorem 3.4) In the smoothed analysis setting, when nΩ(k2)n\geq\Omega(k^{2}), given samples from the perturbed nn-dimensional Gaussian mixture model with kk components, there is an algorithm that learns the correct parameters up to accuracy ϵ\epsilon with high probability, using polynomial time and number of samples.

An important step in our algorithm is to learn Gaussian mixture models whose components all have mean zero, which is also a problem of independent interest (Zoran and Weiss (2012)). Intuitively this is also a “hard” case, as there is no separation in the means. Yet algebraically, this case gives rise to a novel tensor decomposition algorithm. The ideas for solving this decomposition problem are then generalized to tackle the most general case.

(informal statement of Theorem 3.5) In the smoothed analysis setting, when nΩ(k2)n\geq\Omega(k^{2}), given samples from the perturbed mixture of zero-mean nn-dimensional Gaussian mixture model with kk components, there is an algorithm that learns the parameters up to accuracy ϵ\epsilon with high probability, using polynomial running time and number of samples.

Organization

The main part of the paper will focus on learning mixtures of zero-mean Gaussians. The proposed algorithm for this special case contains most of the new ideas and techniques. In Section 2 we introduce the notations for matrices and tensors which are used to handle higher order moments throughout the discussion. Then in Section 3 we introduce the smoothed analysis model for learning mixture of Gaussians and discuss the moment structure of mixture of Gaussians, then we formally state our main theorems. Section 4 outlines our algorithm for learning zero-mean mixture of Gaussians. The details of the steps are presented in Section 5. The detailed proofs for the correctness and the robustness are deferred to Appendix (Sections B to D). In Section 6 we briefly discuss how the ideas for zero-mean case can be generalized to learning mixture of nonzero Gaussians, for which the detailed algorithm and the proofs are deferred to Appendix F.

Notations

We use [n][n] to denote the set {1,2,...,n}\{1,2,...,n\} and [n]×[n][n]\times[n] to denote the set {(i,j):i,j[n]}\{(i,j):i,j\in[n]\}. These are often used as indices of matrices.

Symmetric matrices

Linear subspaces

Tensors

In particular, the vector given by X(ei,ej,I)X(\mathbf{e}_{i},\mathbf{e}_{j},I) is the one-dimensional slice of the 3-way array, with the index for the first dimension to be ii and the second dimension to be jj.

Matrix Products

Main results

In this section, we first formally introduce the smoothed analysis framework for our problem and state our main theorems. Then we will discuss the structure of the moments of Gaussian mixtures, which is crucial for understanding our method of moments based algorithm.

As an interesting special case of the general model, we also consider the mixture of “zero-mean” Gaussians, which has μ(i)=0\mu^{(i)}=0 for all components i[k]i\in[k].

A sample xx from a mixture of Gaussians is generated in two steps:

Sample h[k]h\in[k] from a multinomial distribution, with probability Pr[h=i]=ωi\Pr[h=i]=\omega_{i} for i[k]i\in[k].

The learning problem asks to estimate the parameters of the underlying mixture of Gaussians:

Given NN samples x1,x2,...,xNx_{1},x_{2},...,x_{N} drawn i.i.d. from a mixture of Gaussians G={(ωi,μ(i),Σ(i))}i[k]\mathcal{G}=\{(\omega_{i},\mu^{(i)},\Sigma^{(i)})\}_{i\in[k]}, an algorithm learns the mixture of Gaussians with accuracy ϵ\epsilon, if it outputs an estimation G^={(ω^i,μ^(i),Σ^(i))}i[k]\widehat{\mathcal{G}}=\{(\widehat{\omega}_{i},\widehat{\mu}^{(i)},\widehat{\Sigma}^{(i)})\}_{i\in[k]} such that there exists a permutation π\pi on [k][k], and for all i[k]i\in[k], we have ω^iωπ(i)ϵ|\widehat{\omega}_{i}-\omega_{\pi(i)}|\leq\epsilon, μ^(i)μ(π(i))ϵ\|\widehat{\mu}^{(i)}-\mu^{(\pi(i))}\|\leq\epsilon and Σ^(i)Σ(π(i))ϵ\|\widehat{\Sigma}^{(i)}-\Sigma^{(\pi(i))}\|\leq\epsilon.

In the worst case, learning mixture of Gaussians is a information theoretically hard problem (Moitra and Valiant (2010)). There exists worst-case examples where the number of samples required for learning the instance is at least exponential in the number of components kk (McLachlan and Peel (2004)). The non-convexity arises from the hidden variable hh: without knowing hh we cannot determine which Gaussian component each sample comes from.

The smoothed analysis framework provides a way to circumvent the worst case instances, yet still studying this problem in its most general form. The basic idea is that, with high probability over the small random perturbation to any instance, the instance will not be a “worst-case” instance, and actually has reasonably good condition for the algorithm.

Next, we show how the parameters of the mixture of Gaussians are perturbed in our setup.

For ρ<1/n\rho<1/n, a ρ\rho-smooth nn-dimensional kk-component mixture of Gaussians G~={(ω~i,μ~(i),Σ~(i))}i[k]Gn,k\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k} is generated as follows:

Choose an arbitrary (could be adversarial) instance G={(ωi,μ(i),Σ(i))}i[k]Gn,k\mathcal{G}=\{(\omega_{i},\mu^{(i)},\Sigma^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}. Scale the distribution such that 0Σ(i)12In0\preceq\Sigma^{(i)}\preceq{1\over 2}I_{n} and μ(i)12\|\mu^{(i)}\|\leq{1\over 2} for all i[k]i\in[k].

Set ω~i=ωi\widetilde{\omega}_{i}=\omega_{i}, μ~(i)=μ(i)+δi\widetilde{\mu}^{(i)}=\mu^{(i)}+\delta_{i}, Σ~(i)=Σ(i)+Δi\widetilde{\Sigma}^{(i)}=\Sigma^{(i)}+\Delta_{i}.

Choose the diagonal entries of Σ~(i)\widetilde{\Sigma}^{(i)} arbitrarily, while ensuring the positive semi-definiteness of the covariance matrix Σ~(i)\widetilde{\Sigma}^{(i)}, and the diagonal entries are upper bounded by 11. The perturbation procedure fails if this step is infeasible Note that by standard random matrix theory, with high probability the 4-th step is feasible and the perturbation procedure in Definition 3.2 succeeds. Also, with high probability we have μ~(i)1\|\widetilde{\mu}^{(i)}\|\leq 1 and 0Σ~(i)In0\preceq\widetilde{\Sigma}^{(i)}\preceq I_{n} for all i[k]i\in[k]. .

A ρ\rho-smooth zero-mean mixture of Gaussians is generated using the same procedure, except that we set μ~(i)=μ(i)=0\widetilde{\mu}^{(i)}=\mu^{(i)}=0, for all i[k]i\in[k].

When the original matrix is of low rank, a simple random perturbation may not lead to a positive semidefinite matrix, which is why our procedure of perturbation is more restricted in order to guarantee that the perturbed matrix is still a valid covariance matrix.

There could be other ways of locally perturbing the covariance matrix. Our procedure actually gives more power to the adversary as it can change the diagonals after observing the perturbations for other entries. Note that with high probability if we just let the new diagonal to be 5nρ5\sqrt{n}\rho larger than the original ones, the resulting matrix is still a valid covariance matrix. In other words, the adversary can always keep the perturbation small if it wants to.

Instead of the worst-case problem in Definition 3.1, our algorithms work on the smoothed instance. Here the model first gets perturbed to G~={(ω~i,μ~(i),Σ~(i))}i[k]\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)})\}_{i\in[k]}, the samples are drawn according to the perturbed model, and the algorithm tries to learn the perturbed parameters. We give a polynomial time algorithm in this case:

Consider a ρ\rho-smooth mixture of Gaussians G~={(ω~i,μ~(i),Σ~(i))}i[k]Gn,k\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k} for which the number of components is at least Note that the algorithms of Belkin and Sinha (2010) and Moitra and Valiant (2010) run in polynomial time for fixed kk. kC0k\geq C_{0} and the dimension nC1k2n\geq C_{1}k^{2}, for some fixed constants C0C_{0} and C1C_{1}. Suppose that the mixing weights ω~iωo\widetilde{\omega}_{i}\geq\omega_{o} for all i[k]i\in[k]. Given NN samples drawn i.i.d. from G~\widetilde{\mathcal{G}}, there is an algorithm that learns the parameters of G~\widetilde{\mathcal{G}} up to accuracy ϵ\epsilon, with high probability over the randomness in both the perturbation and the samples. Furthermore, the running time and number of samples NN required are both upper bounded by poly(n,k,1/ωo,1/ϵ,1/ρ)\text{poly}(n,k,1/\omega_{o},1/\epsilon,1/\rho).

To better illustrate the algorithmic ideas for the general case, we first present an algorithm for learning mixtures of zero-mean Gaussians. Note that this is not just a special case of the general case, as with the smoothed analysis, the zero mean vectors are not perturbed.

Consider a ρ\rho-smooth mixture of zero-mean Gaussians G~={(ω~i,0,Σ~(i))}i[k]Gn,k\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},0,\widetilde{\Sigma}^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k} for which the number of components is at least kC0k\geq C_{0} and the dimension nC1k2n\geq C_{1}k^{2}, for some fixed constants C0C_{0} and C1C_{1}. Suppose that the mixing weights ω~iωo\widetilde{\omega}_{i}\geq\omega_{o} for all i[k]i\in[k]. Given NN samples drawn i.i.d. from G~\widetilde{\mathcal{G}}, there is an algorithm that learns the parameters of G~\widetilde{\mathcal{G}} up to accuracy ϵ\epsilon, with high probability over the randomness in both the perturbation and the samples. Furthermore, the running time and number of samples NN are both upper bounded by poly(n,k,1/ωo,1/ϵ,1/ρ)\text{poly}(n,k,1/\omega_{o},1/\epsilon,1/\rho).

Throughout the paper we always assume that nC1k2n\geq C_{1}k^{2} and ω~iωo\widetilde{\omega}_{i}\geq\omega_{o}.

2 Moment Structure of Mixture of Gaussians

Our algorithm is also based on the method of moments, and we only need to estimate the 33-rd, the 44-th and the 66-th order moments. In this part we briefly discuss the structure of 44-th and 66-th moments in the zero-mean case (33-rd moment is always 0 in the zero-mean case). These structures are essential to the proposed algorithm. For more details, and discussions on the general case see Appendix A.

where y(i)y^{(i)} corresponds to the nn-dimensional zero-mean Gaussian distribution N(0,Σ(i))\mathcal{N}(0,\Sigma^{(i)}). The moments for each Gaussian component are characterized by Isserlis’s theorem as below:

Let (y1,,y2t)(y_{1},\dots,y_{2t}) be a multivariate zero-mean Gaussian random vector N(0,Σ)\mathcal{N}(0,\Sigma), then

where the summation is taken over all distinct ways of partitioning y1,,y2ty_{1},\dots,y_{2t} into tt pairs, which correspond to all the perfect matchings in a complete graph.

Ideally, we would like to obtain the following quantities (recall n2=(n+12)n_{2}={n+1\choose 2}):

Note that the entries in X4X_{4} and X6X_{6} are quadratic and cubic monomials of the covariance matrices, respectively. If we have X4X_{4} and X6X_{6}, the tensor decomposition algorithm in Anandkumar et al. (2014) can be immediately applied to recover ωi\omega_{i}’s and Σ(i)\Sigma^{(i)}’s under mild conditions. It is easy to verify that those conditions are indeed satisfied with high probability in the smoothed analysis setting.

By Isserlis’s theorem, the entries of the moments M4M_{4} and M6M_{6} are indeed quadratic and cubic functions of the covariance matrices, respectively. However, the structure of the true moments M4M_{4} and M6M_{6} have more symmetries, consider for example,

Note that due to symmetry, the number of distinct entries in M4M_{4} ( (n+34)n4/24{n+3\choose 4}\approx n^{4}/24) is much smaller than the number of distinct entries in X4X_{4} ((n2+12)n4/8{n_{2}+1\choose 2}\approx n^{4}/8). Similar observation can be made about M6M_{6} and X6X_{6}.

Therefore, it is not immediate how to find the desired X4X_{4} and X6X_{6} based on M4M_{4} and M6M_{6}. We call the moments M4,M6M_{4},M_{6} the folded moments as they have more symmetry, and the corresponding X4,X6X_{4},X_{6} the unfolded moments. One of the key steps in our algorithm is to unfold the true moments M4,M6M_{4},M_{6} to get X4,X6X_{4},X_{6} by exploiting special structure of M4,M6M_{4},M_{6}.

In some cases, it is easier to restrict our attention to the entries in M4M_{4} with indices corresponding to distinct variables. In particular, we define

Moreover F4\mathcal{F}_{4} is a projection from a (n2+12){n_{2}+1\choose 2}-dimensional subspace to a n4n_{4}-dimensional subspace, and F6\mathcal{F}_{6} is a projection from a (n2+23){n_{2}+2\choose 3}-dimensional subspace to a n6{n_{6}}-dimensional subspace.

Algorithm Outline for Learning Mixture of Zero-Mean Gaussians

In this section, we present our algorithm for learning zero-mean Gaussian mixture model. The algorithmic ideas and the analysis are at the core of this paper. Later we show that it is relatively easy to generalize the basic ideas and the techniques to handle the general case.

For simplicity we state our algorithm using the exact moments M~4\widetilde{M}_{4} and M~6\widetilde{M}_{6}, while in implementation the empirical moments M^4\widehat{M}_{4} and M^6\widehat{M}_{6} obtained with the samples are used. In later sections, we verify the correctness of the algorithm and show that it is robust: the algorithm learns the parameters up to arbitrary accuracy using polynomial number of samples.

Span Finding: Find the span of covariance matrices .

For a set of indices H[n]\mathcal{H}\subset[n] of size H=n|\mathcal{H}|=\sqrt{n}, find the span:

Find the span of the covariance matrices with the columns projected onto S\mathcal{S}^{\perp}, namely,

For two disjoint sets of indices H1\mathcal{H}_{1} and H2\mathcal{H}_{2}, repeat Step 1 (a) and Step 1 (b) to obtain U1\mathcal{U}_{1} and U2\mathcal{U}_{2}, namely the span of covariance matrices projected onto two subspaces S1\mathcal{S}_{1}^{\perp} and S2\mathcal{S}_{2}^{\perp}. Merge U1\mathcal{U}_{1} and U2\mathcal{U}_{2} to obtain the span of covariance matrices U\mathcal{U}:

The unfolded moments X~4,X~6\widetilde{X}_{4},\widetilde{X}_{6} are then given by X~4=UY~4U,X~6=Y~6(U,U,U).\widetilde{X}_{4}=U\widetilde{Y}_{4}U^{\top},\widetilde{X}_{6}=\widetilde{Y}_{6}(U^{\top},U^{\top},U^{\top}).

Tensor Decomposition: learn ω~i\widetilde{\omega}_{i} and Σ~(i)\widetilde{\Sigma}^{(i)} from Y~4\widetilde{Y}_{4} and Y~6\widetilde{Y}_{6}. Given UU, and given Y~4\widetilde{Y}_{4} and Y~6\widetilde{Y}_{6} which are relate to the parameters as follows:

we apply tensor decomposition techniques to recover Σ~(i)\widetilde{\Sigma}^{(i)}’s and ω~i\widetilde{\omega}_{i}’s.

Implementing the Steps for Mixture of Zero-Mean Gaussians

In this part we show how to accomplish each step of the algorithm outlined in Section 4 and sketch the proof ideas.

For each step, we first explain the detailed algorithm, and list the deterministic conditions on the underlying parameters as well as on the exact moments for the step to work correctly. Then we show that these deterministic conditions are satisfied with high probability over the ρ\rho-perturbation of the parameters in the smoothed analysis setting. In order to analyze the sample complexity, we further show that when we are given the empirical moments which are close to the exact moments, the output of the step is also close to that in the exact case.

In particular we show the correctness and the stability of each step in the algorithm with two main lemmas: the first lemma shows that with high probability over the random perturbation of the covariance matrices, the exact moments satisfy the deterministic conditions that ensure the correctness of each step; the second lemma shows that when the algorithm for each step works correctly, it is actually stable even when the moments are estimated from finite samples and have only inverse polynomial accuracy to the exact moments.

The detailed proofs are deferred to Section B to D in the appendix.

Given the 4-th order moments M~4\widetilde{M}_{4}, Step 1 finds the span of covariance matrices U\mathcal{U} as defined in (6). Note that by definition of the unfolded moments X~4\widetilde{X}_{4} in (1), the subspace U\mathcal{U} coincides with the column span of the matrix X~4\widetilde{X}_{4}.

By Lemma 3.7, we know that the entries in M~4\widetilde{M}_{4} are linear mappings of entries in X~4\widetilde{X}_{4}. Since the matrix X~4\widetilde{X}_{4} is of low rank (kn2k\ll n_{2}), this corresponds to the matrix sensing problem first studied in Recht et al. (2010). In general, matrix sensing problems can be hard even when we have many linear observations (Hardt et al. (2014b)). Previous works (Recht et al. (2010); Hardt et al. (2014a); Jain et al. (2013)) showed that if the linear mapping satisfy matrix RIP property, one can uniquely recover X~4\widetilde{X}_{4} from M~4\widetilde{M}_{4}.

However, properties like RIP do not hold in our setting where the linear mapping is determined by Isserlis’ Theorem. We can construct two different mixtures of Gaussians with different unfolded moments X~4\widetilde{X}_{4}, but the same folded moment M~4\widetilde{M}_{4} (see Section A.3). Therefore the existing matrix recovery algorithm cannot be applied, and we need to develop new tools by exploiting the special moment structure of Gaussian mixtures.

Step 1 (a). Find the Span of a Subset of Columns of the Covariance Matrices.

The key observation for this step is that if we hit M~4\widetilde{M}_{4} with three basis vectors, we get a vector that lies in the span of the columns of the covariance matrices:

For a mixture of zero-mean Gaussians G={(ωi,0,Σ(i))}i[k]Gn,k\mathcal{G}=\{(\omega_{i},0,\Sigma^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}, the one-dimensional slices of the 4-th order moments M4M_{4} are given by:

In particular, if we pick the indices j1,j2,j3j_{1},j_{2},j_{3} in the index set H\mathcal{H}, the vector M4(ej1,ej2,ej3,I)M_{4}(\mathbf{e}_{j_{1}},\mathbf{e}_{j_{2}},\mathbf{e}_{j_{3}},I) lies in the desired span S={Σ[:,j](i):i[k],jH}\mathcal{S}=\left\{\Sigma^{(i)}_{[:,j]}:i\in[k],j\in\mathcal{H}\right\}.

We shall partition the set H\mathcal{H} into three disjoint subsets H(i)\mathcal{H}^{(i)} of equal size n/3\sqrt{n}/3, and pick jiH(i)j_{i}\in H^{(i)} for i=1,2,3i=1,2,3. In this way, we have (H/3)3=Ω(n1.5)(|\mathcal{H}|/3)^{3}=\Omega(n^{1.5}) such one-dimensional slices of M4M_{4}, which all lie in the desired subspace S\mathcal{S}. Moreover, the dimension of the subspace S\mathcal{S} is at most kHn1.5k|\mathcal{H}|\ll n^{1.5}. Therefore, with the ρ\rho-perturbed parameters Σ~(i)\widetilde{\Sigma}^{(i)}’s, we can expect that with high probability the slices of M~4\widetilde{M}_{4} span the entire subspace S\mathcal{S}.

We first show that this deterministic condition is satisfied with high probability by bounding the kHk|\mathcal{H}|-th singular value of Q~S\widetilde{Q}_{S} with smoothed analysis.

Given the exact 4-th order moments M~4\widetilde{M}_{4}, for any index set H\mathcal{H} of size H=n|\mathcal{H}|=\sqrt{n}, With high probability, the kHk|\mathcal{H}|-th singular value of Q~S\widetilde{Q}_{S} is at least Ω(ωoρ2n)\Omega(\omega_{o}\rho^{2}n).

The proof idea involves writing the matrix Q~S\widetilde{Q}_{S} as a product of three matrices, and using the results on spectral properties of random matrices Rudelson and Vershynin (2009) to show that with high probability the smallest singular value of each factor is lower bounded.

Since this step only involves the singular value decomposition of the matrix Q~S\widetilde{Q}_{S}, we then use the standard matrix perturbation theory to show that this step is stable:

Note that we need the high dimension assumption (nkn\gg k) to guarantee the correctness of this step: in order to span the subspace S\mathcal{S}, the number of distinct vectors should be equal or larger than the dimension of the subspace, namely H3kH|\mathcal{H}|^{3}\geq k|\mathcal{H}|; and the subspace should be non-trivial, namely kH<nk|\mathcal{H}|<n. These two inequalities suggest that we need nΩ(k1.5)n\geq\Omega(k^{1.5}). However, we used the stronger assumption nΩ(k2)n\geq\Omega(k^{2}) to obtain the lower bound of the smallest singular value in the proof.

Step 1 (b). Find the Span of Projected Covariance Matrices.

In this step, we continue to use the structural properties of the 4-th order moments. In particular, we look at the two-dimensional slices of M4M_{4} obtained by hitting it with two basis vectors:

For a mixture of zero-mean Gaussians G={(ωi,0,Σ(i))}i[k]Gn,k\mathcal{G}=\{(\omega_{i},0,\Sigma^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}, the two-dimensional slices of the 4-th order moments M4M_{4} are given by:

Note that if we take the indices j1j_{1} and j2j_{2} in the index set H\mathcal{H}, the slice M4(ej1,ej2,I,I)M_{4}(\mathbf{e}_{j_{1}},\mathbf{e}_{j_{2}},I,I) is almost in the span of the covariance matrices, except 2k2k additive rank-one terms in the form of Σ[:,j1](i)(Σ[:,j2](i))\Sigma^{(i)}_{[:,j_{1}]}(\Sigma^{(i)}_{[:,j_{2}]})^{\top}. These rank-one terms can be eliminated by projecting the slice to the subspace S\mathcal{S}^{\perp} obtained in Step 1 (a), namely,

and this projected two-dimensional slice lies in the desired span US\mathcal{U}_{S} as defined in (5). Moreover, there are (H+12)=Ω(n){|\mathcal{H}|+1\choose 2}=\Omega(n) such projected two-dimensional slices, while the dimension of the desired span US\mathcal{U}_{S} is at most kk.

We show that this deterministic condition is satisfied by bounding the kk-th singular value of Q~US\widetilde{Q}_{U_{S}} in the smoothed analysis setting:

Given the exact 4-th order moments M~4\widetilde{M}_{4}, with high probability, the kk-th singular value of Q~US\widetilde{Q}_{U_{S}} is at least Ω(ωoρ2n1.5)\Omega(\omega_{o}\rho^{2}n^{1.5}).

Similar to Lemma 5.3, the proof is based on writing the matrix QUSQ_{U_{S}} as a product of three matrices, then bound their kk-th singular values using random matrix theory. The stability analysis also relies on the matrix perturbation theory.

Given the empirical 4-th order moments M^4=M~4+E4\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}, assume that the absolute value of entries of E4E_{4} are at most δ2\delta_{2}. Also, given the output ProjS^\text{Proj}_{\widehat{S}^{\perp}} from Step 1 (a), and assume that ProjS^ProjS~δ1\|\text{Proj}_{\widehat{S}^{\perp}}-\text{Proj}_{\widetilde{S}^{\perp}}\|\leq\delta_{1}. When δ1\delta_{1} and δ2\delta_{2} are inverse polynomially small, we have ProjU^SProjU~SO(n2.5(δ2+2δ1)/σk(Q~US))\|\text{Proj}_{\widehat{U}_{S}}-\text{Proj}_{\widetilde{U}_{S}}\|\leq O\left({n^{2.5}\left(\delta_{2}+2\delta_{1}\right)/\sigma_{k}(\widetilde{Q}_{U_{S}})}\right).

Note that for a given index set H\mathcal{H}, the span US\mathcal{U}_{S} obtained in Step 1 (b) only gives partial information about the span of the covariance matrices. The idea of getting the span of the full covariance matrices is to obtain two sets of such partial information and then merge them.

In order to achieve that, we repeat Step 1 (a) and Step 1 (b) for two disjoint sets H1\mathcal{H}_{1} and H2\mathcal{H}_{2}, each of size n\sqrt{n}. The two subspace S1S_{1} and S2S_{2} thus correspond to the span of two disjoint sets of covariance matrix columns. Therefore, we can hope that U1U_{1} and U2U_{2}, the span of covariance matrices projected to S1S_{1}^{\perp} and S2S_{2}^{\perp} contain enough information to recover the full span UU.

In particular, we prove the following claim:

The main idea in the proof is that since ss is not too large, the two subspaces S1S_{1}^{\perp} and S2S_{2}^{\perp} have a large intersection. Using this intersection we can “align” the two basis V1V_{1} and V2V_{2} and obtain V1V2V_{1}^{\dagger}V_{2}, and then it is easy to merge the two projections of the same matrix (instead of a subspace).

Moreover, we show that when applying this result to the projected span of covariance matrices, we have s=kHn/3s=k|\mathcal{H}|\leq n/3, and the two deterministic conditions σ2s([S1,S2])>0\sigma_{2s}([S_{1},S_{2}])>0 and σk(ProjS3V1)>0\sigma_{k}(\text{Proj}_{S_{3}}V_{1})>0 are indeed satisfied with high probability over the parameter perturbation. The detailed smoothed analysis (Lemma B.13 and B.14) and the stability analysis (Lemma B.11) are provided in Section B.3 in the appendix.

We show that given the span of covariance matrices U\mathcal{U} obtained from Step 1, finding the unfolded moments X~4\widetilde{X}_{4}, X~6\widetilde{X}_{6} is reduced to solving two systems of linear equations.

Note that once we know Y~4\widetilde{Y}_{4} and Y~6\widetilde{Y}_{6}, the unfolded moments X~4\widetilde{X}_{4} and X~6\widetilde{X}_{6} are given by X~4=UY~4U\widetilde{X}_{4}=U\widetilde{Y}_{4}U^{\top} and X~6=Y~6(U,U,U)\widetilde{X}_{6}=\widetilde{Y}_{6}(U^{\top},U^{\top},U^{\top}). Therefore, after changing the variable, we need to solve the two linear equation systems given in (7) with the variables Y~4\widetilde{Y}_{4} and Y~6\widetilde{Y}_{6}.

This change of variable significantly reduces the number of unknown variables. Note that the number of distinct entries in Y~4\widetilde{Y}_{4} and Y~6\widetilde{Y}_{6} are k2=(k+12)k_{2}={k+1\choose 2} and k3=(k+23)k_{3}={k+2\choose 3}, respectively. Since k2n4k_{2}\leq n_{4} and k3n6k_{3}\leq n_{6}, we can expect that the linear mapping from Y~4\widetilde{Y}_{4} to M~4\overline{\widetilde{M}}_{4} and the one from Y~6\widetilde{Y}_{6} to M~6\overline{\widetilde{M}}_{6} are linearly invertible. This argument is formalized below.

We show with smoothed analysis that the smallest singular value of the two coefficient matrices are lower bounded with high probability:

With high probability over the parameter random perturbation, the k2k_{2}-th singular value of the coefficient matrix H~4\widetilde{H}_{4} is at least Ω(ρ2n/k)\Omega(\rho^{2}n/k), and the k3k_{3}-th singular value of the coefficient matrix H~6\widetilde{H}_{6} is at least Ω(ρ3(n/k)1.5)\Omega(\rho^{3}(n/k)^{1.5}).

To prove this lemma we rewrite the coefficient matrix as product of two matrices and bound their smallest singular values separately. One of the two matrices corresponds to a projection of the Kronecker product Σ~krΣ~\widetilde{\Sigma}\otimes_{kr}\widetilde{\Sigma}. In the smoothed analysis setting, this matrix is not necessarily incoherent. In order to provide a lower bound to its smallest singular value, we further apply a carefully designed projection to it, and then we use the concentration bounds for Gaussian chaoses to show that after the projection its columns are incoherent, finally we apply Gershgorin’s Theorem to bound the smallest singular value Note that the idea of unfolding using system of linear equations also appeared in the work of Jain and Oh (2014). However, in order to show the system of linear equations in their setup is robust, i.e., the coefficient matrix has full rank, they heavily rely on the incoherence assumption, which we do not impose in the smoothed analysis setting. .

When implementing this step with the empirical moments, we solve two least squares problems instead of solving the system of linear equations. Again using results in matrix perturbation theory and using the lower bound of the smallest singular values of the two coefficient matrices, we show the stability of the solution to the least squares problems:

Given the empirical moments M^4=M~4+E4\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}, M^6=M~6+E6\widehat{M}_{6}=\widetilde{M}_{6}+E_{6}, and suppose that the absolute value of entries of E4E_{4} and E6E_{6} are at most δ1\delta_{1}. Let U^\widehat{U}, the output of Step 1, be the estimation for the span of the covariance matrices, and suppose that U^U~δ2\|\widehat{U}-\widetilde{U}\|\leq\delta_{2}. Let Y^4\widehat{Y}_{4} and Y^6\widehat{Y}_{6} be the least squares solution respectively. When δ1\delta_{1} and δ2\delta_{2} are inverse polynomially small, we have Y~4Y^4FO(n4(δ1+δ2/σmin(H~4))\|\widetilde{Y}_{4}-\widehat{Y}_{4}\|_{F}\leq O(\sqrt{n_{4}}(\delta_{1}+\delta_{2}/\sigma_{min}(\widetilde{H}_{4})) and Y~6Y^6FO(n6(δ1+δ2/σmin(H~6))\|\widetilde{Y}_{6}-\widehat{Y}_{6}\|_{F}\leq O(\sqrt{n_{6}}(\delta_{1}+\delta_{2}/\sigma_{min}(\widetilde{H}_{6})).

Step 3. Tensor Decomposition.

Given Y~4\widetilde{Y}_{4}, Y~6\widetilde{Y}_{6} and U~\widetilde{U}, the symmetric tensor decomposition algorithm can correctly and robustly find the mixing weights ω~i\widetilde{\omega}_{i}’s and the vectors σ~i\widetilde{\sigma}_{i}’s, up to some unknown permutation over [k][k], with high probability over both the randomized algorithm and the parameter perturbation.

The algorithm and its analysis mostly follow the algorithm of symmetric tensor decomposition in Anandkumar et al. (2014), and the details are provided in Section D in the appendix.

Proof Sketch for the Main Theorem of Zero-mean Case.

Theorem 3.5 follows from the previous smoothed analysis and stability analysis lemmas for each step.

First, exploiting the randomness of parameter perturbation, the smoothed analysis lemmas show that the deterministic conditions, which guarantee the correctness of each step, are satisfied with high probability. Then using concentration bounds of Gaussian variables, we show that with high probability over the random samples, the empirical moments M^4\widehat{M}_{4} and M^6\widehat{M}_{6} are entrywise δ\delta-close to the exact moments M~4\widetilde{M}_{4} and M~6\widetilde{M}_{6}. In order to achieve ϵ\epsilon accuracy in the parameter estimation, we choose δ\delta to be inverse polynomially small, and therefore the number of samples required will be polynomial in the relevant parameters. The stability lemmas show how the errors propagate only “polynomially” through the steps of the algorithm, which is visualized in Figure 1.

A more detailed illustration is provided in Section E in the appendix.

Algorithm Outline for Learning Mixture of General Gaussians

In this section, we briefly discuss the algorithm for learning mixture of general Gaussians. Figure 2 shows the inputs and outputs of each step in this algorithm. Many steps share similar ideas to those of the algorithm for the zero-mean case in previous sections. We only highlight the basic ideas and defer the details to Section F in the appendix.

Similar to Step 1 in the zero-mean case, this step makes use of the structure of the 4-th order moments M~4\widetilde{M}_{4}, and is achieved in three small steps:

For a subset H[n]\mathcal{H}\subset[n] of size H=n|\mathcal{H}|=\sqrt{n}, find the span:

Find the span of the covariance matrices with the columns projected onto S\mathcal{S}^{\perp}, namely,

For disjoint subsets H1\mathcal{H}_{1} and H2\mathcal{H}_{2}, repeat Step 1 (a) and Step 1 (b) to obtain U1\mathcal{U}_{1} and U2\mathcal{U}_{2}, the span of the covariance matrices projected onto the subspaces S1\mathcal{S}_{1}^{\perp} and S2\mathcal{S}_{2}^{\perp}. The intersection of the two subspaces U1\mathcal{U}_{1} and U2\mathcal{U}_{2} gives the span of the mean vectors Z~=span{μ~(i),i[k]}\widetilde{Z}=\text{span}\left\{\widetilde{\mu}^{(i)},i\in[k]\right\}. Merge the two subspaces U1\mathcal{U}_{1} and U2\mathcal{U}_{2} to obtain the span of the covariance matrices projected to the subspace orthogonal to Z~\widetilde{Z}, namely Σ~o=span{ProjZ~Σ~(i)ProjZ~:i[k]}\widetilde{\Sigma}_{o}=\text{span}\left\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}:i\in[k]\right\}.

The key observation of this step is that when the samples are projected to the subspace orthogonal to all the mean vectors, they are equivalent to samples from a mixture of zero-mean Gaussians with covariance matrices Σ~o(i)=ProjZ~Σ~(i)ProjZ~\widetilde{\Sigma}^{(i)}_{o}=\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}} and with the same mixing weights ω~i\widetilde{\omega}_{i}’s. Therefore, projecting the samples to Z~\widetilde{Z}^{\perp}, the subspace orthogonal to the mean vectors, and use the algorithm for the zero-mean case, we can obtain Σ~o(i)\widetilde{\Sigma}^{(i)}_{o}’s, the covariance matrices projected to this subspace, as well as the mixing weights ω~i\widetilde{\omega}_{i}’s.

Step 3. Find the means

With simple algebra, this step extracts the projected covariance matrices Σ~o(i)\widetilde{\Sigma}_{o}^{(i)}’s from the 33-rd order moments M~3\widetilde{M}_{3}, the mixing weights ω~i\widetilde{\omega}_{i} and the projected covariance matrices Σ~o(i)\widetilde{\Sigma}_{o}^{(i)}’s obtained in Step 2.

Step 4. Find the full covariance matrices

In Step 2, we obtained Σ~o(i)\widetilde{\Sigma}^{(i)}_{o}, the covariance matrices projected to the subspace orthogonal to all the means. Note that they are equal to matrices (Σ~(i)+μ~(i)(μ~(i)))(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}) projected to the same subspace. We claim that if we can find the span of these matrices ((Σ~(i)+μ~(i)(μ~(i)))(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})’s), we can get each matrix (Σ~(i)+μ~(i)(μ~(i)))(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}), and then subtracting the known rank-one component to find the covariance matrix Σ~(i)\widetilde{\Sigma}^{(i)}. This is similar to the idea of merging two projections of the same subspace in Step 1 (c) for the zero-mean case.

The idea of finding the desired span is to construct a 44-th order tensor:

which corresponds to the 4-th order moments of a mixture of zero-mean Gaussians with covariance matrices Σ~(i)+μ~(i)(μ~(i))\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top} and the same mixing weights ω~i\widetilde{\omega}_{i}’s. Then we can then use Step 1 of the algorithm for the zero-mean case to obtain the span of the new covariance matrices, i.e. span{Σ~(i)+μ~(i)(μ~(i)):i[k]}span\{\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}:i\in[k]\}.

Conclusion

In this paper we give the first efficient algorithm for learning mixture of general Gaussians in the smoothed analysis setting. In the algorithm we developed new ways of extracting information from lower-order moment structure. This suggests that although the method of moments often involves solving systems of polynomial equations that are intractable in general, for natural models there is still hope of utilizing their special structure to obtain algebraic solution.

Smoothed analysis is a very useful way of avoiding degenerate examples in analyzing algorithms. In the analysis, we proved several new results for bounding the smallest singular values of structured random matrices. We believe the lemmas and techniques can be useful in more general settings.

Our algorithm uses only up to 66-th order moments. We conjecture that using higher order moments can reduce the number of dimension required to nΩ(k1+ϵ)n\geq\Omega(k^{1+\epsilon}), or maybe even nΩ(kϵ)n\geq\Omega(k^{\epsilon}).

We thank Santosh Vempala for many insights and for help in earlier attempts at solving this problem.

References

Appendix A Moment Structures

In this section we characterize the structure of the 3-rd, 4-th and 6-th moments of Gaussians mixtures.

where y(i)y^{(i)} corresponds to the nn-dimensional Gaussian distribution N(μ(i),Σ(i))\mathcal{N}(\mu^{(i)},\Sigma^{(i)}).

Gaussian distribution is a highly symmetric distribution, and in the zero-mean case the higher moments are well-understood by Isserlis’ Theorem:

Let y=(y1,,y2t)\mathbf{y}=(y_{1},\dots,y_{2t}) be a multivariate Gaussian random vector with mean zero and covariance Σ\Sigma, then

where the summation is taken over all distinct ways of partitioning y1,,y2ty_{1},\dots,y_{2t} into tt pairs, which correspond to all the perfect matchings in a complete graph. Thus there are (2t1)!!(2t-1)!! terms in the sum, and each summand is a product of tt terms.

The non-zero mean case is a direct corollary using Isserlis’ Theorem and linearity of expectation.

Let y=(y1,,yt)\mathbf{y}=(y_{1},\dots,y_{t}) be a multivariate Gaussian random vector with mean μ\mu and covariance Σ\Sigma, then

where the summation is taken over all distinct ways of partitioning y1,,yty_{1},\dots,y_{t} into pp pairs of (u,v)(u,v) and ss singletons of (w)(w), where p0p\geq 0, s0s\geq 0 and 2p+s=t2p+s=t.

We shall first prove Lemma 3.7 in Section 3.2. Recall that this lemma shows that for mixture of zero-mean Gaussians, the 4-th moments M4\overline{M}_{4} and the 6-th moments M6\overline{M}_{6} with distinct indices can be viewed as a linear projection of the unfolded moment X4X_{4} and X6X_{6} defined in (1).

By Isserlis Theorem A.1, the mapping 3F4\sqrt{3}\mathcal{F}_{4} is characterized by: (1j1<j2<j3<j4n\forall 1\leq j_{1}<j_{2}<j_{3}<j_{4}\leq n)

Therefore, with the normalization constant 3\sqrt{3}, the (j1,j2,j3,j4)(j_{1},j_{2},j_{3},j_{4})-th mapping of F4\mathcal{F}_{4} is a projection of the three elements in X4X_{4}. Similarly, we have for 15F6\sqrt{15}\mathcal{F}_{6}: (1j1<j2<<j6n\forall 1\leq j_{1}<j_{2}<\dots<j_{6}\leq n)

Thus with the normalization constant 15\sqrt{15}, the mapping F6\mathcal{F}_{6} is a linear projection. ∎

A.2 Slices of Moments

Next we shall characterize the slices of the moments of mixture of Gaussians.

For mixture of zero-mean Gaussians, a one-dimensional slice of the 4th moment tensor is a vector in the span of corresponding columns of the covariance matrices:

For a mixture of zero-mean Gaussians, the one-dimensional slices of the 4-th moments M4M_{4} are given by:

By the definition of multilinear map, M4(ej1,ej2,ej3,I)M_{4}(e_{j_{1}},e_{j_{2}},e_{j_{3}},I) is a vector whose pp-th entry is equal to M4(ej1,ej2,ej3,ep)M_{4}(e_{j_{1}},e_{j_{2}},e_{j_{3}},e_{p}). We can compute this entry by Isserlis’ Theorem:

For mixture of zero-mean Gaussians, a two-dimensional slice of the 4th moment M4M_{4} is a matrix, and it is a linear combination of the covariance matrices with some additive rank one matrices:

For a mixture of zero-mean Gaussians, the two-dimensional slices of the 4-th moment M4M_{4} are given by:

Again this follows from Isserlis’ theorem. By definition of multilinear map this is a matrix whose (p,q)(p,q)-th entry is equal to

Similarly, for mixture of general Gaussians, we prove the following claims:

For a mixture of general Gaussians, the (j1,j2,j3)(j_{1},j_{2},j_{3})-th one-dimensional slice of M4M_{4} is given by:

This is very similar to Claim 5.1 and follows from the corollary of Isserlis’s theorem (Corollary A.2). There are 10 ways to partition the indices {j1,j2,j3,j4}\{j_{1},j_{2},j_{3},j_{4}\} into pairs and singletons: ((j1),(j2),(j3),(j4))((j_{1}),(j_{2}),(j_{3}),(j_{4})), ((j1,j2),(j3),(j4))((j_{1},j_{2}),(j_{3}),(j_{4})), ((j1,j3),(j2),(j4))((j_{1},j_{3}),(j_{2}),(j_{4})), ((j1,j4),(j2),(j3))((j_{1},j_{4}),(j_{2}),(j_{3})), ((j2,j3),(j1),(j4))((j_{2},j_{3}),(j_{1}),(j_{4})), ((j2,j4),(j1),(j3))((j_{2},j_{4}),(j_{1}),(j_{3})), ((j3,j4),(j1),(j2))((j_{3},j_{4}),(j_{1}),(j_{2})), ((j1,j2),(j3,j4))((j_{1},j_{2}),(j_{3},j_{4})), ((j1,j3),(j2,j4))((j_{1},j_{3}),(j_{2},j_{4})), ((j1,j4),(j2,j3))((j_{1},j_{4}),(j_{2},j_{3})). From this enumeration, we can specify each element in the vector of the one-dimensional slice. ∎

Since M4M_{4} gives linear observations on the symmetric low rank matrix X4X_{4}, it is natural to wonder whether we can use matrix completion techniques to recover X4X_{4} from M4M_{4}. Here we show this is impossible by giving a counter example: there are two mixture of Gaussians that generates the same 4th moment M4M_{4}, but has different X4X_{4} (even the span of Σ(i)\Sigma^{(i)}’s are different).

By ((a,b),(c,d))((a,b),(c,d)) we denote a 5×55\times 5 matrix AA which has 22’s on diagonals, and the only nonzero off-diagonal entries are Aa,b=Ab,a=Ac,d=Ad,c=1A_{a,b}=A_{b,a}=A_{c,d}=A_{d,c}=1. For example, ((1,2),(4,5))((1,2),(4,5)) will be the following matrix:

where all the missing entries are 0’s. Now we construct two mixtures of 3 Gaussians, all with mean 0 and weight 1/31/3. The covariance matrices are ((1,2),(4,5)),((1,3),(2,5)),((1,4),(3,5))((1,2),(4,5)),((1,3),(2,5)),((1,4),(3,5)) for the first mixture and ((1,2),(3,5)),((1,3),(4,5)),((1,4),(2,5))((1,2),(3,5)),((1,3),(4,5)),((1,4),(2,5)) for the second mixture. These are clearly different mixtures with different span of Σ(i)\Sigma^{(i)}’s: in the first mixture, Σ1,2(i)=Σ4,5(i)\Sigma^{(i)}_{1,2}=\Sigma^{(i)}_{4,5} for all matrices, but this is not true for the second mixture.

These two mixture of Gaussians have the same 4th moment M4M_{4}. This can be checked by using Isserlis’ theorem to compute the moments. Intuitively, this is true because all the pairs (1,i)(1,i) and (i,5)(i,5) appeared exactly twice in the covariance matrices for both mixtures; also, every 4-tuple (1,i,j,5)(1,i,j,5) appeared exactly once in the covariance matrices for both mixtures.

Appendix B Step 1: Span Finding

Recall that in Step 1 of the algorithm for learning mixture of zero-mean Gaussians, we find the span of the covariance matrices in three small steps. In this section, we prove the correctness and the robustness of each step with smoothed analysis.

For completeness we restate each substep and highlight the key properties we need, followed by the detailed proofs.

In Step 1 (a), for any set H\mathcal{H} of size n\sqrt{n}, we want to show that the one-dimensional slices of M4M_{4} span the entire subspace S=span{Σ~[:,j](i):i[k],jH}\mathcal{S}=\text{span}\left\{\widetilde{\Sigma}^{(i)}_{[:,j]}:i\in[k],j\in\mathcal{H}\right\}, which is the span of a subset of the columns in the covariance matrices.

This in particular means when j1,j2,j3Hj_{1},j_{2},j_{3}\in\mathcal{H}, the vector M~4(ej1,ej2,ej3,I)\widetilde{M}_{4}(e_{j_{1}},e_{j_{2}},e_{j_{3}},I) is in S\mathcal{S}. We need to show that the columns of the matrix QQ indeed span the entire subspace S\mathcal{S}.

It is sufficient to show that a subset of the column span the entire subspace. Form a three-way even partition of the set H\mathcal{H}, i.e., H(1)=H(2)=H(3)=H/3=n/3|\mathcal{H}^{(1)}|=|\mathcal{H}^{(2)}|=|\mathcal{H}^{(3)}|=|\mathcal{H}|/3=\sqrt{n}/3, and only consider the one-dimensional slices of M~4\widetilde{M}_{4} corresponding to the indices jiH(i)j_{i}\in\mathcal{H}^{(i)} for i=1,2,3i=1,2,3. In particular, we define matrix Q~S\widetilde{Q}_{S} with these one-dimensional slices of M~4\widetilde{M}_{4}:

Define matrix P~S\widetilde{P}_{S} with the corresponding columns of the covariance matrices, forming a basis (although not orthogonal) of the desired subspace S\mathcal{S}:

In the following two lemmas, we show that with high probability over the random perturbation, the column span of Q~S\widetilde{Q}_{S} is exactly equal to the column span of P~S\widetilde{P}_{S}, and robustly so.

Given M~4\widetilde{M}_{4}, the exact 4-th order moment of the ρ\rho-smooth mixture of zero-mean Gaussians, for any subset H[n]\mathcal{H}\in[n] with cardinality H=n|\mathcal{H}|=\sqrt{n}, let Q~S\widetilde{Q}_{S} be the matrix defined as in (12) with the one-dimensional slices of M~4\widetilde{M}_{4}. For any ϵ>0\epsilon>0, and for some absolute constant C1,C2,C3>0C_{1},C_{2},C_{3}>0, with probability at least 1(C1ϵ)C2n1-(C_{1}\epsilon)^{C_{2}n}, the kHk|\mathcal{H}|-th singular value of Q~S\widetilde{Q}_{S} is bounded below by:

In order to prove this lemma, we first write Q~S\widetilde{Q}_{S} as the product of three matrices.

Under the same assumptions of Lemma B.1, the matrix Q~S\widetilde{Q}_{S} can be written as

With some fixed and known row permutation π(2)\pi^{(2)} and π(3)\pi^{(3)}, the matrix B~(2)\widetilde{B}^{(2)} and B~(3)\widetilde{B}^{(3)} can be made block diagonal with identical blocks equal to Σ~H(1),H(3)\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(3)}} and Σ~H(1),H(2)\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(2)}}, respectively. Note that the three parts B~(1),B~(2),B~(3)\widetilde{B}^{(1)},\widetilde{B}^{(2)},\widetilde{B}^{(3)} do not have any common entry, nor do they involve any diagonal entry of the covariance matrices, therefore the three parts are independent when the covariances are randomly perturbed in the smoothed analysis.

It is easier to understand the structure by picture, see Figure 3. The rows of the matrix should be indexed by (j1,j2,j3)H(1)×H(2)×H(3)(j_{1},j_{2},j_{3})\in\mathcal{H}^{(1)}\times\mathcal{H}^{(2)}\times\mathcal{H}^{(3)}, which can also be interpreted as a cube (in the right). The block structure in the first part B~(1)\widetilde{B}^{(1)} correspond to a slice in H(2)×H(3)\mathcal{H}^{(2)}\times\mathcal{H}^{(3)} direction (for each block, the element in H(1)\mathcal{H}^{(1)} is fixed, the elements in H(2)\mathcal{H}^{(2)} and H(3)\mathcal{H}^{(3)} take all possible values). Similarly for B~(2)\widetilde{B}^{(2)} and B~(3)\widetilde{B}^{(3)} (as shown in figure).

(of Claim B.2 ) The proof of this claim is using Claim 5.1, the definition of matrices and the rule of matrix multiplication. Consider the column in Q~S\widetilde{Q}_{S} corresponding to the index (j1,j2,j3)(j_{1},j_{2},j_{3}) for j1H(1),j2H(2),j3H(3)j_{1}\in\mathcal{H}^{(1)},j_{2}\in\mathcal{H}^{(2)},j_{3}\in\mathcal{H}^{(3)}, and the row of B~S\widetilde{B}_{S} together with the mixing wights specifies how this column is formed as a linear combination of 3k3k columns of P~S\widetilde{P}_{S}. By the structure of M4M_{4} in Claim 5.1, the (j1,j2,j3)(j_{1},j_{2},j_{3})-th row of B~(1)\widetilde{B}^{(1)} has exactly kk entries corresponding to Σ~j2,j3(i)\widetilde{\Sigma}^{(i)}_{j_{2},j_{3}} for i[k]i\in[k], these entries are multiplied by ω~i\widetilde{\omega}_{i} in the middle (diagonal) matrix. Therefore, these directly correspond to the kk terms in Claim 5.1. Similarly the entries in B~(2)\widetilde{B}^{(2)} and B~(3)\widetilde{B}^{(3)} correspond to the other 2k2k terms. ∎

Using Claim B.2, we need to bound the smallest singular value for each of the matrices in order to bound the kHk|\mathcal{H}|-th singular value of Q~S\widetilde{Q}_{S}, this is deferred to the end of this part. The most important tool is a corollary (Lemma G.16) of the random matrix result proved in Rudelson and Vershynin (2009), which gives a lowerbound on the smallest singular value of perturbed rectangular matrices.

By Lemma B.1, we know Q~S\widetilde{Q}_{S} has exactly rank kHk|\mathcal{H}|, and is robust in the sense that its kHk|\mathcal{H}|-th singular value is large (polynomial in the amount of perturbation ρ\rho). By standard matrix perturbation theory, if we get Q^S\widehat{Q}_{S} close to Q~S\widetilde{Q}_{S} up to a high accuracy (inverse polynomial in the relevant parameters), the top kHk|\mathcal{H}| singular vectors will span a subspace that is very close to the span of Q~S\widetilde{Q}_{S}. We formalize this in the following lemma.

Note that the columns of SS are the leading left singular vectors of QSQ_{S}. We apply the standard matrix perturbation bound of singular vectors. Recall that SS is defined to be the first kHk|\mathcal{H}| left singular vector of QSQ_{S}, and we have

Therefore by Wedin’s Theorem (in particular the corollary Lemma G.5), we can conclude (16). ∎

We first use Claim B.2 to write Q~S=P~S(Dω~krIH)(B~S)\widetilde{Q}_{S}=\widetilde{P}_{S}\left(D_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|}\right)(\widetilde{B}_{S})^{\top}, note that the matrix (Dω~krIH)(D_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|}) has dimension kH×kHk|\mathcal{H}|\times k|\mathcal{H}|, therefore we just need to show with high probability each of the three factor matrix has large kHk|\mathcal{H}|-th singular value, and that implies a bound on the kHk|\mathcal{H}|-th singular value of Q~S\widetilde{Q}_{S} by union bound. The smallest singular value of P~S\widetilde{P}_{S} and B~S\widetilde{B}_{S} are bounded below by the following two Claims.

With high probability σkH(P~S)Ω(ρn)\sigma_{k|\mathcal{H}|}(\widetilde{P}_{S})\geq\Omega(\rho\sqrt{n}).

Now P~S\widetilde{P}^{\prime}_{S} is a randomly perturbed rectangular matrix, whose smallest singular value can be lower bounded using Lemma G.16, and we conclude that with probability at least 1(Cϵ)0.25n1-(C\epsilon)^{0.25n},

Next, we bound the smallest singular value of B~S\widetilde{B}_{S}.

With high probability σkH(B~S)Ω(ρn)\sigma_{k|\mathcal{H}|}(\widetilde{B}_{S})\geq\Omega(\rho\sqrt{n}).

We make use of the special structure of the three blocks of B~S\widetilde{B}_{S} to lower bound its smallest singular value.

First, we prove that the block diagonal matrix B~(1)\widetilde{B}^{(1)} has large singular values, even after projecting to the orthogonal subspace of the column span of B~(2)\widetilde{B}^{(2)} and B~(3)\widetilde{B}^{(3)}. This idea appeared several times in our proof and is abstracted in Lemma G.12. Apply the lemma and we have:

where the jj-th block of [B~(2),B~(3)][\widetilde{B}^{(2)},\widetilde{B}^{(3)}] has dimension (H/3)2×2kH/3(|\mathcal{H}|/3)^{2}\times 2k|\mathcal{H}|/3. Since

this means for each block, even after projection it has more than 3k3k rows. Note that by definition the three blocks B~(1)\widetilde{B}^{(1)}, B~(2)\widetilde{B}^{(2)} and B~(3)\widetilde{B}^{(3)} are independent and do not involve any diagonal elements of the covariance matrices, so each block after the two projections is again a rectangular random matrix. We can apply Lemma G.15, for any jj, for some absolute constant C1,C2,C3C_{1},C_{2},C_{3} (not fixed throughout the discussion), with probability at least 1(C1ϵ)C2n1-(C_{1}\epsilon)^{C_{2}n} over the randomness of Σ~H(2),H(3)\widetilde{\Sigma}_{\mathcal{H}^{(2)},\mathcal{H}^{(3)}}, we have:

Now we can take a union bound over the blocks and conclude that with high probability, the smallest singular value of each block is large.

In order to bound σk(2H/3)([B~(2),B~(3)])\sigma_{k(2|\mathcal{H}|/3)}([\widetilde{B}^{(2)},\widetilde{B}^{(3)}]), we use the same strategy. Note that B~(2)\widetilde{B}^{(2)} also has a block structure that corresponds to the H(1)×H(3)\mathcal{H}^{(1)}\times\mathcal{H}^{(3)} faces (see Figure 3). Again check the condition on dimension (H/3)2kkH/3Ω(n)>3k(|\mathcal{H}|/3)^{2}-k-k|\mathcal{H}|/3\geq\Omega(n)>3k, we can apply Lemma G.12 again to show that for any jj, with probability at least 1(C1ϵ)C2n1-(C_{1}\epsilon)^{C_{2}n} over the randomness of Σ~H(1),H(3)\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(3)}}, we have:

Again by Lemma G.15, for any jj, with probability at least 1(C1ϵ)C2n1-(C_{1}\epsilon)^{C_{2}n} over the randomness of Σ~H(1),H(3)\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(3)}}, we have:

Finally, for B~(3)\widetilde{B}^{(3)} it is a block diagonal structure with blocks correspond to H(1)×H(2)\mathcal{H}^{(1)}\times\mathcal{H}^{(2)} faces (see Figure 3). Each block is a perturbed rectangular matrix, therefore we apply Lemma G.15 to have that with high probability over the randomness of Σ~H(1),H(2)\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(2)}},

Now plug in the lower bounds in (18) (20) (21) into the inequalities in (17) and (19). By union bound we conclude that with high probability:

Finally, the diagonal matrix in the middle is given by the Kronecker product of IHI_{|\mathcal{H}|} and Dω~D_{\widetilde{\omega}}. Recall that Dω~D_{\widetilde{\omega}} is the diagonal matrix with the mixing weights ω~i\widetilde{\omega}_{i}’s on its diagonal. By property of Kronecker product and the assumption on the mixing weights, the smallest diagonal element of Dω~krIHD_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|} is at least ω0\omega_{0}. Therefore σkH(Dω~krIH)ω0\sigma_{k|\mathcal{H}|}(D_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|})\geq\omega_{0}.

We have shown that the smallest singular value of all the three factor matrices are large with high probability. Therefore, apply union bound, we conclude that with probability at least 1exp(Ω(n))1-\exp(-\Omega(n)),

In Step 1 (b), given the subset of indices H\mathcal{H} and the subspace S\mathcal{S} obtained in Step 1 (a), we want to show that the projected two-dimensional slices of M~4\widetilde{M}_{4} span the subspace US\mathcal{U}_{S} defined in (5), which is the span of the covariance matrices with the columns projected the subspace S\mathcal{S}^{\perp}:

Recall that in Claim 5.6, we characterized the two dimensional slices of the 4-th moments M4M_{4} of mixture of zero-mean Gaussians as below:

In the following claim, we show that the columns of Q~US\widetilde{Q}_{U_{S}} indeed lie in the column span of P~US\widetilde{P}_{U_{S}}:

Given SS obtained in Step 1(a), the span of Σ~[:,j](i)\widetilde{\Sigma}_{[:,j]}^{(i)} for jHj\in\mathcal{H} and for all ii, then for j1,j2Hj_{1},j_{2}\in\mathcal{H}, we have:

Similar as in Step 1(a), in the next lemma we show that the columns of Q~US\widetilde{Q}_{U_{S}} indeed span the entire column span of P~US\widetilde{P}_{U_{S}}. Since the dimension of the column span of P~US\widetilde{P}_{U_{S}} is no larger than kk, it is enough to the kk-th singular value of Q~US\widetilde{Q}_{U_{S}}:

Given M~4\widetilde{M}_{4}, the exact 4-th order moment of the ρ\rho-smooth mixture of Gaussians , define the matrix Q~US\widetilde{Q}_{U_{S}} as in (23) with the two-dimensional slices of M~4\widetilde{M}_{4}. For any ϵ>0\epsilon>0, and for some absolute constant C1,C2,C3>0C_{1},C_{2},C_{3}>0, with probability at least 12(C1ϵ)C2n1-2(C_{1}\epsilon)^{C_{2}n}, the kk-th singular value of Q~US\widetilde{Q}_{U_{S}} is bounded below by:

Similar as before, we first examine the structure of the matrix Q~US\widetilde{Q}_{U_{S}}:

Under the same assumption as Lemma B.7, we can write Q~US\widetilde{Q}_{U_{S}} in the following matrix product form:

This claim follows from Claim B.6, and the rule of matrix product. The coefficients ω~iΣ~j1,j2(i)\widetilde{\omega}_{i}\widetilde{\Sigma}_{j_{1},j_{2}}^{(i)} for the linear combinations of vec(ProjSΣ~(i))\text{vec}(\text{Proj}_{\mathcal{S}^{\perp}}\widetilde{\Sigma}^{(i)}) are given by the columns of the product Dω~Σ~JD_{\widetilde{\omega}}\widetilde{\Sigma}_{J}^{\top}. The coefficients are then multiplied by P~US\widetilde{P}_{U_{S}} to select the correct columns. ∎

To prove Lemma B.7, similar to the proof ideas of Lemma B.1, we lower bound the kk-th singular value of all the three factors.

By the structural Claim B.8, we know the matrix Q~US\widetilde{Q}_{U_{S}} can be written as a product of the three matrices as Q~US=P~USDω~Σ~J\widetilde{Q}_{U_{S}}=\widetilde{P}_{U_{S}}D_{\widetilde{\omega}}{\widetilde{\Sigma}_{J}}^{\top}.

We lower bound the kk-th singular value of each of the three factors. It is easy for the last two matrices. Note that by assumption σk(Dω~)ωo\sigma_{k}(D_{\widetilde{\omega}})\geq\omega_{o}, and since Σ~J{\widetilde{\Sigma}_{J}}^{\top} is just a perturbed rectangular matrix, we can apply Lemma G.15 and with high probability we have σk(Σ~J)Ω(ρn)\sigma_{k}({\widetilde{\Sigma}_{J}})\geq\Omega(\rho\sqrt{n}).

However, we cannot apply the same trick to directly bound the smallest singular value of DSD_{S^{\perp}} and ProjDSΣ~\text{Proj}_{D_{S^{\perp}}}\widetilde{\Sigma} separately. The problem here is that DSD_{\mathcal{S}^{\perp}} and Σ~\widetilde{\Sigma} are not independent, as the subspace SS obtained in Step 1(a) also depends on the perturbation on Σ~\widetilde{\Sigma}, therefore ProjDSΣ~\text{Proj}_{D_{S^{\perp}}}\widetilde{\Sigma} is not simply a projected perturbed matrix. Instead, we show that even conditioned on the part of randomness that is common in SS and Σ~\widetilde{\Sigma}, Σ~\widetilde{\Sigma} still has sufficient randomness due to the high dimensions, and we can still extract a tall random matrix out of it. This is elaborated in the following claim:

Under the assumptions of Lemma B.7, with high probability the matrix P~US=DSΣ~\widetilde{P}_{U_{S}}=D_{\mathcal{S}^{\perp}}\widetilde{\Sigma} has smallest singular value at least Ω(ρn)\Omega(\rho n).

Let L\mathcal{L} be the set of the (j1,j2)(j_{1},j_{2})-th entries of Σ~(i)\widetilde{\Sigma}^{(i)} for all ii and one of j1,j2j_{1},j_{2} is in the set H\mathcal{H}. By Step 1(a), the subspace S=span(S,ej:jH)\mathcal{S}^{\prime}=\text{span}(S,e_{j}:j\in\mathcal{H}) is only dependent on the entries in L\mathcal{L}. Here we need to include the span of eje_{j}’s for jHj\in\mathcal{H} because the diagonal entries can depend on the other random perturbations. By adding the span of the vector eje_{j}’s for jHj\in\mathcal{H} the subspace remains invariant no matter how the diagonal entries change.

Let Z=span(Σ,SkrIn)\mathcal{Z}=\text{span}(\Sigma,S^{\prime}\otimes_{kr}I_{n}), and recall that the columns of Σ\Sigma are the factorization of the unperturbed covariance matrices. The subspace Z\mathcal{Z} has dimension no larger than H(k+1)n+kn2/10|\mathcal{H}|(k+1)n+k\leq n^{2}/10, and depends on the randomness of L\mathcal{L}.

Let Σ~=Σ+E\widetilde{\Sigma}=\Sigma+E where EE is the random perturbation matrix. Now we condition on the randomness in L\mathcal{L}. By definition the subspace Z\mathcal{Z} is deterministic conditional on L\mathcal{L}. However, even if we only consider entries of E\LE\backslash\mathcal{L} there are still at least (nkH2)n2/4{n-k|\mathcal{H}|\choose 2}\geq n^{2}/4 independent random variables. We shall show the randomness is enough to guarantee that the smallest singular value of ProjDSΣ~\text{Proj}_{D_{\mathcal{S}^{\perp}}}\widetilde{\Sigma} is lower bounded with high probability conditioned on L\mathcal{L}:

Here we used the fact that projection to a subspace cannot increase the singular values (Lemma G.11).

Conditioned on the randomness of entries in L\mathcal{L}, E\LE\backslash\mathcal{L} still has at least n2/4n^{2}/4 random directions, while the dimension of the deterministic subspace Z\mathcal{Z} is at most n2/10n^{2}/10. Therefore we can apply Lemma G.15 again to argue that conditionally, for every ϵ>0\epsilon>0, with probability at least 1(C1ϵ)C2n21-(C_{1}\epsilon)^{C_{2}n^{2}} we have:

In summary, apply union bound and we can conclude that with probability at least 1(C1ϵ)C2n1-(C_{1}\epsilon)^{C_{2}n},

Next, we again use matrix perturbation bounds to prove the robustness of this step, which depends on the singular value decomposition of the matrix Q~US\widetilde{Q}_{U_{S}}.

Given the empirical 4-th order moments M^4=M~4+E4\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}, and given the output ProjS^\text{Proj}_{\widehat{S}^{\perp}} from Step 1 (a). Suppose that ProjS^ProjS~δ1\|\text{Proj}_{\widehat{S}^{\perp}}-\text{Proj}_{\widetilde{S}^{\perp}}\|\leq\delta_{1}, and suppose that the absolute value of entries of E4E_{4} are at most δ2\delta_{2} for δ2Q~USF/n3\delta_{2}\leq\|\widetilde{Q}_{U_{S}}\|_{F}/\sqrt{n^{3}}. Conditioned on the high probability event σk(Q~US)>0\sigma_{k}(\widetilde{Q}_{U_{S}})>0, we have:

Proof of Lemma B.10

Note that the columns of USU_{S} are the leading left singular vectors of Q~US\widetilde{Q}_{U_{S}}. We want to apply the perturbation bound of singular vectors.

Similar to the proof of Lemma B.3, we first need to bound the spectral distance between Q^US\widehat{Q}_{U_{S}} and Q~US\widetilde{Q}_{U_{S}}. In fact we will even bound the Frobenius norm difference:

where we used the assumption Σ~(i)1\|\widetilde{\Sigma}^{(i)}\|\leq 1 to bound Q~U0F\|\widetilde{Q}_{U_{0}}\|_{F}, used the upperbound on Q^U0Q~U0F\|\widehat{Q}_{U_{0}}-\widetilde{Q}_{U_{0}}\|_{F} to bound the term (D^SD~S)(Q^U0Q~U0)F(D^SD~S)Fδ2n2J(D^SD~S)FQ~U0F\|(\widehat{D}_{S^{\perp}}-\widetilde{D}_{S^{\perp}})(\widehat{Q}_{U_{0}}-\widetilde{Q}_{U_{0}})\|_{F}\leq\|(\widehat{D}_{S^{\perp}}-\widetilde{D}_{S^{\perp}})\|_{F}\delta_{2}\sqrt{n^{2}|\mathcal{J}|}\leq\|(\widehat{D}_{S^{\perp}}-\widetilde{D}_{S^{\perp}})\|_{F}\|\widetilde{Q}_{U_{0}}\|_{F}, and used the fact that Frobenius norm is sub-multiplicative. Apply Wedin’s Theorem (in particular the corollary Lemma G.5), we can conclude (26). ∎

B.3 Step 1 (c). Finding 𝒰𝒰\mathcal{U} by Merging the Two Projected Span

Pick two disjoint sets of indices H1,H2\mathcal{H}_{1},\mathcal{H}_{2}, and repeat Step 1 (a) and Step 1 (b) on each of them to get S~j\widetilde{S}_{j}^{\perp} and U~j\widetilde{U}_{j} for j=1,2j=1,2. In Step 1 (c), we merge the two span U~1\widetilde{U}_{1} and U~2\widetilde{U}_{2} to get U\mathcal{U}.

If we are given two projections ProjS1U\text{Proj}_{S_{1}^{\perp}}U and ProjS2U\text{Proj}_{S_{2}^{\perp}}U of a matrix UU, and if the union of the two subspaces S1S_{1}^{\perp} and S2S_{2}^{\perp} have full rank, namely dim(S1S2)=n\text{dim}(S_{1}\cup S_{2})=n, then we can recover UU by:

However, it is slightly different if we are given two projections of a subspace U\mathcal{U}, since a subspace can be equivalently represented by different orthonormal basis up to linear transformation.

In particular, in our setting for j=1,2,j=1,2, we can write U~j=(ProjSjkrIn)Σ~Wj\widetilde{U}_{j}=(\text{Proj}_{S_{j}^{\perp}}\otimes_{kr}I_{n})\widetilde{\Sigma}W_{j} for some fixed but unknown full rank matrix WjW_{j} (which makes the columns of matrix Σ~Wj\widetilde{\Sigma}W_{j} an orthonormal basis of U\mathcal{U}). Recall that we define Σ~[vec(Σ~(i)):i[k]]\widetilde{\Sigma}\equiv[\text{vec}(\widetilde{\Sigma}^{(i)}):i\in[k]], and DSjProjSjkrInD_{S_{j}^{\perp}}\equiv\text{Proj}_{S_{j}^{\perp}}\otimes_{kr}I_{n} for j=1,2j=1,2.

The following Lemma shows that we can still robustly recover the subspace U\mathcal{U} if the two projections have sufficiently large overlapping. The basic idea is to use the overlapping part to align the two basis of the subspace which the two projections act on.

This is the detailed statement of Condition 5.10.

For two ss-dimensional known subspaces S1S_{1} and S2S_{2}, Let the columns of AA be the first 2s2s singular vectors of [S1,S2][S_{1},S_{2}], and let the columns of S3S_{3} correspond to the first (n2s)(n-2s) singular vectors of (InProjA)(I_{n}-\text{Proj}_{A}), therefore S3(S1S2)S_{3}\subset(S_{1}\cup S_{2})^{\perp}. Suppose that σk(ProjS3U)>0\sigma_{k}(\text{Proj}_{S_{3}}U)>0 and that σ2s([S1,S2])>0\sigma_{2s}([S_{1},S_{2}])>0. Define matrices U1=ProjS1V1U_{1}=\text{Proj}_{S_{1}^{\perp}}V_{1} and U2=ProjS2V2U_{2}=\text{Proj}_{S_{2}^{\perp}}V_{2} and we know that U1U1=U2U2=IkU_{1}^{\top}U_{1}=U_{2}^{\top}U_{2}=I_{k}.

We are given S^1,S^2\widehat{S}_{1},\widehat{S}_{2} and U^1,U^2\widehat{U}_{1},\widehat{U}_{2}, and suppose that for j=1,2,j=1,2, we have S^jSjFδs\|\widehat{S}_{j}-S_{j}\|_{F}\leq\delta_{s} and U^jUjFδu\|\widehat{U}_{j}-U_{j}\|_{F}\leq\delta_{u}, for δs1,δu1\delta_{s}\leq 1,\delta_{u}\leq 1.

If σk(ProjS3U)>0\sigma_{k}(\text{Proj}_{S_{3}}U)>0 and σ2s([S1,S2])>0\sigma_{2s}([S_{1},S_{2}])>0, then for some absolute constant CC we have:

The proof will proceed in two steps, we first show that if we are given the exact inputs, namely δs=δu=0\delta_{s}=\delta_{u}=0, then the column span of U^\widehat{U} defined in (28) is identical to the desired subspace U\mathcal{U}. Then we give a stability result using matrix perturbation bounds.

1. Solving the problem using exact inputs.

Given the exact inputs S1,S2S_{1},S_{2}, U1,U2U_{1},U_{2}, first we show that under the conditions σ2s([S1,S2])>0\sigma_{2s}([S_{1},S_{2}])>0 and σk(ProjS3U)>0\sigma_{k}(\text{Proj}_{S_{3}}U)>0, then the column span of the matrix [U2, U1(S3U1)(S3U2)]\left[U_{2},\ U_{1}(S_{3}^{\top}U_{1})^{\dagger}(S_{3}^{\top}U_{2})\right] is indeed identical to U=span(V1)=span(V2)\mathcal{U}=span(V_{1})=span(V_{2}).

Given V=V1V2V=V_{1}^{\dagger}V_{2}, then U1V=ProjS1V1V=ProjS1V2U_{1}V=\text{Proj}_{S_{1}^{\perp}}V_{1}V=\text{Proj}_{S_{1}^{\perp}}V_{2}. Recall that by definition U2=ProjS2V2U_{2}=\text{Proj}_{S_{2}^{\perp}}V_{2}, then the problem is now reduced to the simple problem of merging two projections (U2=ProjS2V2U_{2}=\text{Proj}_{S_{2}^{\perp}}V_{2} and U1V=ProjS1V2U_{1}V=\text{Proj}_{S_{1}^{\perp}}V_{2}) of the same matrix (V2V_{2}). Therefore, to show that the columns of U0=[U2,U1V]U_{0}=[U_{2},U_{1}V] indeed span V2V_{2} and thus the desired subspace UU, we only need to show that [ProjS1,ProjS2][\text{Proj}_{S_{1}^{\perp}},\text{Proj}_{S_{2}^{\perp}}] has full column span. We show this by bounding the smallest singular value of it:

where the last inequality is by the assumption that σ2s([S1,S2])>0\sigma_{2s}([S_{1},S_{2}])>0. ∎

Next, we show that in the exact case, the matrix V=V1V2V=V_{1}^{{\dagger}}V_{2} can be computed by V=(S3U1)(S3U2)V=(S_{3}^{\top}U_{1})^{\dagger}(S_{3}^{\top}U_{2}). The basic idea is to use the overlapping part of the two projections U1U_{1} and U2U_{2} to align the two basis V1V_{1} and V2V_{2}. Recall that by its construction, S3=(S1S2)=S1S2S_{3}=(S_{1}\cup S_{2})^{\perp}=S_{1}^{\perp}\cap S_{2}^{\perp}, and ProjS3=ProjS1S2\text{Proj}_{S_{3}}=\text{Proj}_{S_{1}^{\perp}\cap S_{2}^{\perp}}. Then for j=1j=1 and 22, we have:

Moreover, since Uj=ProjSjVjU_{j}=\text{Proj}_{S_{j}^{\perp}}V_{j} is an orthonormal matrix, we have that all singular values of VjV_{j} are equal or greater than 1. Also note that UU is an orthonormal matrix, so we have that σk(ProjS3Vj)σk(ProjS3U)>0\sigma_{k}(\text{Proj}_{S_{3}}V_{j})\geq\sigma_{k}(\text{Proj}_{S_{3}}U)>0. In other words, S3VjS_{3}^{\top}V_{j} has full column rank kk. Therefore,

Given S^1,S^2\widehat{S}_{1},\widehat{S}_{2} and U^1,U^2\widehat{U}_{1},\widehat{U}_{2} which are close to the exact S1,S2,U1S_{1},S_{2},U_{1} and U2U_{2}, we then need to bound the distance ProjU^ProjU\|\text{Proj}_{\widehat{U}}-\text{Proj}_{U}\|. This follows the standard perturbation analysis. In order to apply Lemma G.5 we need to bound the distance between U^U0F\|\widehat{U}-U_{0}\|_{F}, and lower bound the smallest singular value of U0U_{0}, namely σk(U0)\sigma_{k}(U_{0}). Recall that we define U0U_{0} same as in (28) for the exact case with δs=δu=0\delta_{s}=\delta_{u}=0.

First, we bound U^U0F\|\widehat{U}-U_{0}\|_{F}. Note that we can write U0U_{0}^{\top} as U0=U2BU_{0}^{\top}=U_{2}B, where B=[I,U1(S3U1)S3]B=[I,\quad U_{1}(S_{3}^{\top}U_{1})^{{\dagger}}S_{3}]^{\top}.

Recall that S3=(S1S2)S_{3}=(S_{1}\cup S_{2})^{\perp}, apply Lemma G.5 and we have:

Next, note that S^3S3<1\|{\widehat{S}_{3}}-{S_{3}}\|<1 and U^1U1δu<1\|\widehat{U}_{1}-U_{1}\|\leq\delta_{u}<1, apply Lemma G.6 we have:

Next, note that σk(S3U1)=σk(ProjS3V1)>0\sigma_{k}(S_{3}^{\top}U_{1})=\sigma_{k}(\text{Proj}_{S_{3}}V_{1})>0 by assumption. Apply Lemma G.8, we have:

Next, apply Lemma G.6 again we can bound the perturbation of matrix product:

where CC is some absolute constant, and the last inequality summarizes the previous three inequalities, and used the fact that σk(ProjS3V1)<1\sigma_{k}(\text{Proj}_{S_{3}}V_{1})<1. Note that U^U0FkU^U0\|\widehat{U}-U_{0}\|_{F}\leq\sqrt{k}\|\widehat{U}-U_{0}\|.

We are left to bound σk(U0)\sigma_{k}(U_{0}). Recall that σk(V2)σk(U2)=1\sigma_{k}(V_{2})\geq\sigma_{k}(U_{2})=1, and we have shown that in the exact case U0=[ProjS2V2,ProjS1V2]U_{0}=[\text{Proj}_{S_{2}^{\perp}}V_{2},\quad\text{Proj}_{S_{1}^{\perp}}V_{2}]. Then we can bound the smallest singular value of U0U_{0} following the inequality in (47):

Finally we can apply Lemma G.5 to bound the distance between the projections by:

In Step 1 (c), we are given the output U~1\widetilde{U}_{1} and U~2\widetilde{U}_{2} from Step 1 (b), as well as the output S~1\widetilde{S}_{1}^{\perp} and S~2\widetilde{S}_{2}^{\perp} from Step 1 (a). Recall that U=span{vec(Σ~(i)):i[k]}\mathcal{U}=span\{\text{vec}(\widetilde{\Sigma}^{(i)}):i\in[k]\}, and for j=1,2j=1,2, the matrix U~j\widetilde{U}_{j} given by Step 1 (b) corresponds to the subspace U\mathcal{U} projected to the subspace B~j=S~jkrIn\widetilde{B}_{j}=\widetilde{S}_{j}^{\perp}\otimes_{kr}I_{n}.

Let matrix S~3=S~1S~2=(S~1S~2)\widetilde{S}_{3}=\widetilde{S}_{1}^{\perp}\cap\widetilde{S}_{2}^{\perp}=(\widetilde{S}_{1}\cup\widetilde{S}_{2})^{\perp} (obtained by taking the singular vectors of (InAA)(I_{n}-AA^{\top}), where AA corresponds to the first 2kH2k|\mathcal{H}| singular vectors of [S~1,S~2][\widetilde{S}_{1},\widetilde{S}_{2}]), and denote B~3=S~3krIn\widetilde{B}_{3}=\widetilde{S}_{3}\otimes_{kr}I_{n}. Define the matrix Q~U\widetilde{Q}_{U} to be:

and similarly define the perturbed version Q^U\widehat{Q}_{U} to be:

Now we want to apply Lemma B.11 to show that ProjQ~U=ProjΣ~\text{Proj}_{\widetilde{Q}_{U}}=\text{Proj}_{\widetilde{\Sigma}} and bound the distance ProjQ^UProjΣ~\|\text{Proj}_{\widehat{Q}_{U}}-\text{Proj}_{\widetilde{\Sigma}}\|. In order to use the lemma, we first use smoothed analysis to show (in Lemma B.13 and Lemma B.14 )that the conditions required by the lemma are all satisfied with high probability over the ρ\rho-perturbation of the covariance matrices, then conclude the robustness of Step 1 (c) in Lemma B.15.

With high probability, for some constant CC

This is in fact exactly the same as Claim B.9.

Given Σ~=Σ+E{\widetilde{\Sigma}}=\Sigma+E, by the definition of S~3\widetilde{S}_{3} and B~3\widetilde{B}_{3} we know that B~3\widetilde{B}_{3} only depends on the randomness of PJEP_{J}E for i=1,2i=1,2, where

and PJP_{J} denotes the mapping that only keeps the coordinates corresponding to the set J\mathcal{J}. Therefore, we have:

Note that the rank of B~3\widetilde{B}_{3}^{\perp} is 2nkH)2nk|\mathcal{H}|) and J=2nH|\mathcal{J}|=2n|\mathcal{H}|, thus n2J2nkHk=Ω(n2)>2kn_{2}-|\mathcal{J}|-2nk|\mathcal{H}|-k=\Omega(n^{2})>2k. So we can apply Lemma G.15 to conclude that for some absolute constants C1,C2,C3C_{1},C_{2},C_{3}, with probability at least 1(C1ϵ)C2n21-(C_{1}\epsilon)^{C_{2}n^{2}}, σk(B~3Σ~)ϵρC3n2.\sigma_{k}(\widetilde{B}_{3}^{\top}{\widetilde{\Sigma}})\geq\epsilon\rho\sqrt{C_{3}n^{2}}.

With high probability, for some constant CC,

For i=1,2i=1,2, recall that S~i\widetilde{S}_{i} is the singular vectors of Q~Si\widetilde{Q}_{S_{i}}, where Q~Si\widetilde{Q}_{S_{i}} is defined with the set Hi\mathcal{H}_{i} as in (12). We can write the singular value decomposition of Q~Si\widetilde{Q}_{S_{i}} as Q~Si=S~iD~iV~i\widetilde{Q}_{S_{i}}=\widetilde{S}_{i}\widetilde{D}_{i}\widetilde{V}_{i}^{\top} for some diagonal matrix D~i\widetilde{D}_{i} and orthonormal matrix V~i\widetilde{V}_{i}, and

Note that we can write [Q~S1,Q~S2]=[P~S1,P~S2](diag(BS~1,BS~2))[\widetilde{Q}_{S_{1}},\widetilde{Q}_{S_{2}}]=[\widetilde{P}_{S_{1}},\widetilde{P}_{S_{2}}](\text{diag}(B_{\widetilde{S}_{1}},B_{\widetilde{S}_{2}}))^{\top}, and following almost exactly with the proof of Lemma B.1, we can argue that, with probability at least 1(C1ϵ)C2n1-(C_{1}\epsilon)^{C_{2}n},

Moreover, by the structure of M4M_{4} and the bounds on Σ~(i)12I\widetilde{\Sigma}^{(i)}\prec{1\over 2}I, we can bound Q~Si3n(H/3)3\|\widetilde{Q}_{S_{i}}\|\leq 3\sqrt{n(|\mathcal{H}|/3)^{3}}, and thus:

Therefore, we can conclude that, for some absolute constant CC, we have:

In the next lemma, we apply Lemma B.11 to show that under perturbation, with high probability the column span of ProjQ~U=ProjΣ~\text{Proj}_{\widetilde{Q}_{U}}=\text{Proj}_{\widetilde{\Sigma}} and this step is robust.

Note that σ2kHn([B~1,B~2])=σ2kH([S~1,S~2])\sigma_{2k|\mathcal{H}|n}([\widetilde{B}_{1},\widetilde{B}_{2}])=\sigma_{2k|\mathcal{H}|}([\widetilde{S}_{1},\widetilde{S}_{2}]), and for i=1,2i=1,2, we have B^iB~iFnS^iS~iFnδs\|\widehat{B}_{i}-\widetilde{B}_{i}\|_{F}\leq\sqrt{n}\|\widehat{S}_{i}-\widetilde{S}_{i}\|_{F}\leq\sqrt{n}\delta_{s}. Therefore, with the above two smoothed analysis Lemmas showing polynomial bound of σ2kH([S~1,S~2])\sigma_{2k|\mathcal{H}|}([\widetilde{S}_{1},\widetilde{S}_{2}]) and σk(ProjB~3(Σ~))\sigma_{k}(\text{Proj}_{\widetilde{B}_{3}}(\widetilde{\Sigma})), the proof of Lemma B.15 follows by applying Lemma B.11.

Appendix C Step 2. Unfolding the Moments

In the second step of the algorithm, we solve two systems of linear equations to recover the unfolded moments.

Recall the first system of linear equations is

Since UU is the column span matrix of Σ~\widetilde{\Sigma}, there must exist a Y4Y_{4} such that X4=Σ~Dω~Σ~=UY4UX_{4}=\widetilde{\Sigma}D_{\widetilde{\omega}}\widetilde{\Sigma}^{\top}=UY_{4}U^{\top} (recall that Dω~D_{\widetilde{\omega}} is the diagonal matrix with entries ω~i\widetilde{\omega}_{i}), so the system must have at least one solution.

The main theorem of this section shows that with high probability over the ρ\rho-perturbation the system has a unique solution:

With high probability over the ρ\rho-perturbation of Σ~\widetilde{\Sigma}, the smallest singular value of the coefficient matrix H~4\widetilde{H}_{4} is lower bounded by σmin(H~4)Ω(ρ2n/k)\sigma_{min}(\widetilde{H}_{4})\geq\Omega(\rho^{2}n/k). As a corollary, the system has a unique solution.

In order to prove this theorem, we first need the following structural lemma:

Next we need to prove the bounds on the smallest singular values for A~4\widetilde{A}_{4} and B~4\widetilde{B}_{4}. The first matrix A~4\widetilde{A}_{4} is essentially a projection of the Kronecker product (Σ~krΣ~)(\widetilde{\Sigma}\otimes_{kr}\widetilde{\Sigma}). In particular, this projection satisfy the “symmetric off-diagonal” property defined below:

Since symmetric off-diagonal is a property on the structure of rows of the basis PP. If one basis of the subspace P\mathcal{P} is symmetric off-diagonal, then any basis is too. Moreover, any orthogonal basis of the subspace P\mathcal{P} will still be symmetric off-diagonal.

In the following main lemma, we show even after projection to any symmetric off-diagonal space with sufficiently many dimensions, the “unique” columns of a Kronecker product of random matrices still has good condition number.

Let us first see how Theorem C.1 follows from the two lemmas (Lemma C.2 and Lemma C.5 ).

(of Theorem C.1) Using the structural Lemma C.2, we know we only need to bound the smallest singular value of A~4\widetilde{A}_{4} and B~4\widetilde{B}_{4} separately. The following two claims directly imply the theorem.

σmin(A~4)Ω(ρ2n2).\sigma_{min}(\widetilde{A}_{4})\geq\Omega(\rho^{2}n_{2}).

σmin(B~4)1/(4Σ~2)1/(4nk).\sigma_{min}(\widetilde{B}_{4})\geq 1/(4\|\widetilde{\Sigma}\|^{2})\geq 1/(4nk).

We apply Lemma C.5 to prove Claim C.6. Note that the ρ\rho-perturbed covariances Σ~\widetilde{\Sigma} is not a random Gaussian matrix, yet it is equal to the unperturbed matrix Σ\Sigma plus a random Gaussian matrix EΣ=ρEE_{\Sigma}=\rho ENote that the diagonal entries are then arbitrarily perturbed, but we will project on a symmetric off-diagonal subspace so changes on diagonal entries do not change the result.. Since we consider arbitrary Σ\Sigma, the columns of Σ~\widetilde{\Sigma} as well as the columns A~4\widetilde{A}_{4} may not be incoherent.

Instead, we project A~4\widetilde{A}_{4} to a subspace to strip away the terms involving the original matrix Σ\Sigma. Let SS be the range space corresponding to the projection F4\mathcal{F}_{4}. Recall that S=n4=Ω(n22)|S|=n_{4}=\Omega(n_{2}^{2}), and by the definition of F4\mathcal{F}_{4}, SS is symmetric off-diagonal. Define the subspace S=\mboxspan(S,ΣkrIn2,In2krΣ)S^{\prime}=\mbox{span}(S^{\perp},\Sigma\otimes_{kr}I_{n_{2}},I_{n_{2}}\otimes_{kr}\Sigma). Let P=(S)P=(S^{\prime})^{\perp}. By construction PS2kn2=Ω(n22)|P|\geq|S|-2kn_{2}=\Omega(n_{2}^{2}). Also, since P=(S)P=(S^{\prime})^{\perp} is a subspace of SS, it must also be symmetric off-diagonal (see Remark C.4). After projecting A~4\widetilde{A}_{4} to PP, we know that the (i,j)(i,j)-th column (1ijk)(1\leq i\leq j\leq k) of PA~4P^{\top}\widetilde{A}_{4} is given by:

Thus in PA~4P^{\top}\widetilde{A}_{4} all the terms involving Σ\Sigma disappears. Therefore

where the first inequality is because the smallest singular value cannot become larger after projection, the first equality is by definition, the second equality is by the property of PP, and the final step uses Lemma C.5Note that although diagonal entries are not perturbed, we also have P[i,i]=0P_{[i,i]}=0 so we can still apply the lemma..

In this part we prove the structural Lemma C.2.

(of Lemma C.2) First, assume we know the true Σ~\widetilde{\Sigma} matrix, then in order to get the unfolded moments X4X_{4}, we only need to solve the equation F4(Σ~D4Σ~)=M4\mathcal{F}_{4}(\widetilde{\Sigma}D_{4}\widetilde{\Sigma}^{\top})=\overline{M}_{4} with the k×kk\times k symmetric variable D4D_{4}, and the solution should be equal to the diagonal matrix Dω~D_{\widetilde{\omega}}.

However, we only know UU which is the column span of Σ~\widetilde{\Sigma}, so we can only use UY4UUY_{4}U^{\top} and let UY4U=Σ~D4Σ~UY_{4}U^{\top}=\widetilde{\Sigma}D_{4}\widetilde{\Sigma}^{\top}. Note that there is a one-to-one correspondence between Y4Y_{4} and D4D_{4}. In particular we know D4=(Σ~U)Y4(Σ~U)D_{4}=(\widetilde{\Sigma}^{\dagger}U)Y_{4}(\widetilde{\Sigma}^{\dagger}U)^{\top}, this is exactly the second part B~4\widetilde{B}_{4}.

Now the first matrix A~4\widetilde{A}_{4} should map \mboxvec(D4)\mbox{vec}(D_{4}) to M4M_{4}. By construction, the (i,j)(i,j)-th column (i<j)(i<j) of A~4\widetilde{A}_{4} is equal to F4(Σ~(i)Σ~(j)+Σ~(j)Σ~(i))=2F4(Σ~(i)Σ~(j))\mathcal{F}_{4}(\widetilde{\Sigma}^{(i)}\odot\widetilde{\Sigma}^{(j)}+\widetilde{\Sigma}^{(j)}\odot\widetilde{\Sigma}^{(i)})=2\mathcal{F}_{4}(\widetilde{\Sigma}^{(i)}\odot\widetilde{\Sigma}^{(j)}), since F4\mathcal{F}_{4} is symmetric off-diagonal we know F4(v1v2)=F4(v2v1)\mathcal{F}_{4}(v_{1}\odot v_{2})=\mathcal{F}_{4}(v_{2}\odot v_{1}) for any two vectors v1,v2v_{1},v_{2}. For the (i,i)(i,i)-th column, by construction they are equal to F4(Σ~(i)Σ~(i))\mathcal{F}_{4}(\widetilde{\Sigma}^{(i)}\odot\widetilde{\Sigma}^{(i)}) as we wanted. ∎

Main Lemma on Projection of Kronecker Product

The singular values of Kronecker Product between two matrices are well-understood: they are just the products of the singular values of the two matrices. Therefore, the Kronecker product of two rank kk matrices will have rank k2k^{2}. However, in our case the problem becomes more complicated because we only look at a projection of the resulting matrix. The projected Kronecker product may no longer have rank k2k^{2} because of symmetry. Here we are able to show that even with projection to a low dimensional space, the rank of the new matrix is still as large as (k+12){k+1\choose 2}.

The basic idea of the proof is to consider the inner-products between columns, and show that the columns are incoherent even after projection.

(of Lemma C.5) Consider the matrix (EkrE)uniqPP(EkrE)uniq(E\otimes_{kr}E)_{uniq}^{\top}PP^{\top}(E\otimes_{kr}E)_{uniq}, we shall show the matrix is diagonally dominant and hence its smallest singular value must be large. In order to do that we need to prove the following two claims:

For any i,jki,j\leq k, iji\leq j, with high probability P(E[:,i]E[:,j])2Ω(n22).\|P^{\top}(E_{[:,i]}\odot E_{[:,j]})\|^{2}\geq\Omega(n_{2}^{2}).

For any i,jki,j\leq k, iji\leq j, with high probability

With this two claims, we can apply Gershgorin’s Disk Theorem G.9 to conclude that σmin((EkrE)uniqPP(EkrE)uniq)Ω(n22)\sigma_{min}((E\otimes_{kr}E)_{uniq}^{\top}PP^{\top}(E\otimes_{kr}E)_{uniq})\geq\Omega(n_{2}^{2}). Therefore σmin(P(EkrE)uniq)Ω(n2)\sigma_{min}(P^{\top}(E\otimes_{kr}E)_{uniq})\geq\Omega(n_{2}).

Now we prove the two claims. For Claim C.8, it essentially says the projection of a random vector to a fixed subspace should have large norm. If the vector has independent entries, this is first shown in Tao and Vu (2006). Recently Vu and Wang (2013) generalized the result to KK-concentrated vectors, see Lemma G.18. By Lemma G.19 we know conditioned on E[:,i],E[:,j]2n2\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}, (E[:,i]E[:,j])p,q(pq)(E_{[:,i]}\odot E_{[:,j]})_{p,q}(p\neq q) is O(n2)O(\sqrt{n_{2}})-concentrated. By assumption PP ignores all the (E[:,i]E[:,j])p,p(E_{[:,i]}\odot E_{[:,j]})_{p,p} entries. Therefore Pr[P(E[:,i]E[:,j])2d22td2+t2]CeΩ(t2/n2)+eΩ(n2)\Pr[|\|P^{\top}(E_{[:,i]}\odot E_{[:,j]})\|^{2}-d_{2}|\geq 2t\sqrt{d_{2}}+t^{2}]\leq Ce^{-\Omega(t^{2}/n_{2})}+e^{-\Omega(n_{2})}. We then pick t=d2/5Ω(n2)t=\sqrt{d_{2}}/5\geq\Omega(n_{2}), which implies Pr[P(E[:,i]E[:,j])2d2/2]CeΩ(n2)\Pr[\|P(E_{[:,i]}\odot E_{[:,j]})\|^{2}\leq d_{2}/2]\leq Ce^{-\Omega(n_{2})}. This is what we need for Claim C.8.

For Claim C.9, we need to bound terms of the form <P(E[:,i]E[:,j]),P(E[:,i]E[:,j])>\left<P^{\top}(E_{[:,i]}\odot E_{[:,j]}),P^{\top}(E_{[:,i^{\prime}]}\odot E_{[:,j^{\prime}]})\right>. These are degree-4 Gaussian chaoses and are well-studied in Latała et al. (2006).

We break the terms according to how many of i,ji^{\prime},j^{\prime} appears in i,ji,j.

Case 1: i,j∉{i,j}i^{\prime},j^{\prime}\not\in\{i,j\}. In this case we first randomly pick E[:,i],E[:,j]E_{[:,i]},E_{[:,j]}, and condition on the high probability event that E[:,i],E[:,j]2n2\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}. In this case the inner-product can be rewritten as <PP(E[:,i]E[:,j]),(E[:,i]E[:,j])>\left<PP^{\top}(E_{[:,i]}\odot E_{[:,j]}),(E_{[:,i^{\prime}]}\odot E_{[:,j^{\prime}]})\right>, and we know PP(E[:,i]E[:,j])4n2\|PP^{\top}(E_{[:,i]}\odot E_{[:,j]})\|\leq 4n_{2}. Also, since PP is symmetric off-diagonal we know in this degree-2 Gaussian chaos (only E[:,i]E_{[:,i^{\prime}]} and E[:,j]E_{[:,j^{\prime}]} are random now) there are no “diagonal” terms. Therefore the Decoupling Theorem G.23 shows without loss of generality we can assume iji^{\prime}\neq j^{\prime}. Apply Theorem G.21 we know this term is bounded by O(n21+ϵ)O(n_{2}^{1+\epsilon}) with high probability for any ϵ>0\epsilon>0.

Case 2: One of i,ji^{\prime},j^{\prime} is in {i,j}\{i,j\}. Without loss of generality assume i{i,j}i^{\prime}\in\{i,j\} (the other case is symmetric). Again we first randomly pick E[:,i],E[:,j]E_{[:,i]},E_{[:,j]} and condition on the high probability event that E[:,i],E[:,j]2n2\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}} (but this will also determine E[:,i]E_{[:,i^{\prime}]}). After the conditioning, only E[:,j]E_{[:,j^{\prime}]} is still random, and the inner-product can be rewritten as <\mboxmat(PP(E[:,i]E[:,j])E[:,i],E[:,j]>\left<\mbox{mat}(PP^{\top}(E_{[:,i]}\odot E_{[:,j]})E_{[:,i^{\prime}]},E_{[:,j^{\prime}]}\right> where the fixed vector \mboxmat(PP(E[:,i]E[:,j]))E[:,i]\mbox{mat}(PP^{\top}(E_{[:,i]}\odot E_{[:,j]}))E_{[:,i^{\prime}]} has norm bounded by PP(E[:,i]E[:,j])E[:,i]8n23/2\|PP^{\top}(E_{[:,i]}\odot E_{[:,j]})\|\|E_{[:,i^{\prime}]}\|\leq 8n_{2}^{3/2}. By property of Gaussian with high probability the inner-product is bounded by O(n23/2+ϵ)O(n_{2}^{3/2+\epsilon}) for any ϵ>0\epsilon>0.

Case 3: i,j{i,j}i^{\prime},j^{\prime}\in\{i,j\}. Since i,ji^{\prime},j^{\prime} cannot be equal to i,ji,j, there is only one possibility: i,ji^{\prime},j^{\prime} are both equal to one of i,ji,j and iji\neq j. Without loss of generality assume i=j=iji^{\prime}=j^{\prime}=i\neq j. We can swap i,ji,j with i,ji^{\prime},j^{\prime} and this actually becomes Case 2. By the same argument we know this term is bounded by O(n23/2+ϵ)O(n_{2}^{3/2+\epsilon}) for any ϵ>0\epsilon>0.

There are O(k2)O(k^{2}) terms in Case 1, O(k)O(k) terms in Case 2 and O(1)O(1) terms in Case 3. Therefore by union bound we know the sum is bounded by O(kn23/2+ϵ+k2n21+ϵ)O(kn_{2}^{3/2+\epsilon}+k^{2}n_{2}^{1+\epsilon}) with high probability. Recall we are assuming n2k2+Cn_{2}\geq k^{2+C} (which only requires nk1+C/2n\geq k^{1+C/2}). Choose ϵ\epsilon to be a small enough constant depending on CC gives the result. ∎

C.2 Unfolding 666-th Order Moments

Recall the second system of linear equations is

The second system of linear equations tries to unfold the 66-th order moment M6\overline{M}_{6} to get Y6Y_{6}. Similar to Theorem C.1 the following theorem guarantees that with high probability over the perturbation the system has a unique solution.

With high probability over the perturbation, the coefficient matrix H~6\widetilde{H}_{6} has smallest singular value σmin(H~6)Ω(ρ3(n/k)1.5)\sigma_{min}(\widetilde{H}_{6})\geq\Omega(\rho^{3}(n/k)^{1.5}). As a corollary, the system has a unique solution.

The proof of this theorem is very similar to the proof of Theorem C.1. Here we list the important steps and highlight the differences.

As before the theorem relies on a structural lemma (Lemma C.11), and a main lemma about the symmetric off-diagonal projection of a Kronecker product of three identical matrices (Lemma C.13).

Before stating the main lemma, we update the definition of symmetric off-diagonal subspace.

It is easy to verify that since the moments in M6\overline{M}_{6} all have indices corresponding to distinct variables, the projection F6\mathcal{F}_{6} is indeed symmetric off-diagonal. The constraints in this definition is closely related to the decoupling Theorem G.23 of Gaussian chaoses.

The proofs of Theorem C.10 are based on the above two lemmas. The proof of Lemma C.11 is essentially the same as Lemma C.2. The proof of Lemma C.13 is very similar to that of Lemma C.5, and we highlight the only different case below:

As before we try to prove that the columns of P(EkrEkrE)uniqP^{\top}(E\otimes_{kr}E\otimes_{kr}E)_{uniq} are incoherent. Recall we needed the following two claims:

For any 1i1i2i3k1\leq i_{1}\leq i_{2}\leq i_{3}\leq k, with high probability P(E[:,i1]E[:,i2]E[:,i3])2Ω(n23).\|P^{\top}(E_{[:,i_{1}]}\odot E_{[:,i_{2}]}\odot E_{[:,i_{3}]})\|^{2}\geq\Omega(n_{2}^{3}).

For any 1i1i2i3k1\leq i_{1}\leq i_{2}\leq i_{3}\leq k, with high probability

The first claim can still be proved by the projection Lemma G.18, except the vector E[:,i1]E[:,i2]E[:,i3]E_{[:,i_{1}]}\odot E_{[:,i_{2}]}\odot E_{[:,i_{3}]} is now O(n2)O(n_{2})-concentrated (the proof is an immediate generalization of Lemma G.19).

The second claim can be proved using similar ideas, however there is one new case. We again separate the terms according to the number of i1,i2,i3i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime} that do not appear in {i1,i2,i3}\{i_{1},i_{2},i_{3}\}.

Case 1: At least one of i1,i2,i3i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime} does not appear in {i1,i2,i3}\{i_{1},i_{2},i_{3}\}. Suppose there are tt of i1,i2,i3i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime} that do not appear in {i1,i2,i3}\{i_{1},i_{2},i_{3}\}, similar to before we first sample Ei1,Ei2,Ei3E_{i_{1}},E_{i_{2}},E_{i_{3}} and condition on the event that they all have norm at most 2n22\sqrt{n_{2}}. The inner-product then becomes an order tt Gaussian chaos with Frobenius norm n26t/2n_{2}^{6-t/2}. By Theorem G.23 and Theorem G.21 we know with high probability all these terms are bounded by n26t/2+ϵn_{2}^{6-t/2+\epsilon} for any constant ϵ>0\epsilon>0.

Case 2: All of i1,i2,i3i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime} appear in {i1,i2,i3}\{i_{1},i_{2},i_{3}\}. In the previous proof (of Lemma C.5), there was only one possibility and it reduces to Case 1. However for 66-th moment we have a new case: i=i1=i2=i1<i2=i3=i3=ji=i_{1}=i_{2}=i_{1}^{\prime}<i_{2}^{\prime}=i_{3}^{\prime}=i_{3}=j (and the symmetric case i1=i1=i2<i2=i3=i3i_{1}=i_{1}^{\prime}=i_{2}^{\prime}<i_{2}=i_{3}=i_{3}^{\prime}). For this we will treat T=PPT=PP^{\top} as a 66-th order tensor with Frobenius norm at most n23/2n_{2}^{3/2} (as a matrix it has spectral norm 1, and rank at most n23n_{2}^{3}). The tensor is applied to the vectors E[:,i]E_{[:,i]} and E[:,j]E_{[:,j]} as T(E[:,i],E[:,i],E[:,j],E[:,i],E[:,j],E[:,j])T(E_{[:,i]},E_{[:,i]},E_{[:,j]},E_{[:,i]},E_{[:,j]},E_{[:,j]}). First we sample E[:,i]E_{[:,i]}, by Lemma G.24 we know with high probability what remains will be a 33-rd order tensor T(E[:,i],E[:,i],I,E[:,i],I,I)T(E_{[:,i]},E_{[:,i]},I,E_{[:,i]},I,I) with Frobenius norm bounded by O(n22+ϵ)O(n_{2}^{2+\epsilon}). Notice that here it is important that Lemma G.24 can handle diagonal entries, because E[:,i]E_{[:,i]} appears on the 1,2,41,2,4-th coordinate (instead of the first three). We the apply Lemma G.24 again on T(E[:,i],E[:,i],I,E[:,i],I,I)(E[:,j],E[:,j],E[:,j])T(E_{[:,i]},E_{[:,i]},I,E_{[:,i]},I,I)(E_{[:,j]},E_{[:,j]},E_{[:,j]})The notation might be confusing here: T(E[:,i],E[:,i],I,E[:,i],I,I)T(E_{[:,i]},E_{[:,i]},I,E_{[:,i]},I,I) is a 33rd order tensor, and we are applying it to E[:,j],E[:,j],E[:,j]E_{[:,j]},E_{[:,j]},E_{[:,j]}. The whole expression is equal to T(E[:,i],E[:,i],E[:,j],E[:,i],E[:,j],E[:,j])T(E_{[:,i]},E_{[:,i]},E_{[:,j]},E_{[:,i]},E_{[:,j]},E_{[:,j]})., and conclude that with high probability the term is bounded by O(n22.5+2ϵ)O(n_{2}^{2.5+2\epsilon}) which is still much smaller than n23n_{2}^{3}.

Finally we take the sum over all terms and choose ϵ\epsilon to be small enough (depending on CC), then when k2+Cn2k^{2+C}\leq n_{2} the sum is a lower-order term. ∎

C.3 Stability Bounds

For the two linear equation systems in (7), we can write them in canonical form with coefficient matrices H~4,H~6\widetilde{H}_{4},\widetilde{H}_{6} and the unknown variable vec(Y4),vec(Y6)\text{vec}(Y_{4}),\text{vec}(Y_{6}), corresponding to the k2,k3k_{2},k_{3} distinct elements in symmetric Y4,Y6Y_{4},Y_{6}, namely:

When M^4,M^6\widehat{M}_{4},\widehat{M}_{6}, the empirical moment estimations for M~4,M~6\widetilde{M}_{4},\widetilde{M}_{6}, are used throughout the algorithm, both the coefficient matrices H~4,H~6\widetilde{H}_{4},\widetilde{H}_{6} and the constant terms M4,M6\overline{M}_{4},\overline{M}_{6} are affected by the noise from empirical estimation. In practice, instead of solving systems of linear equations, we solve the least square problem:

and the solution to the least square problems are given by: vec(Y^4)=H^4M^4\text{vec}(\widehat{Y}_{4})=\widehat{H}_{4}^{\dagger}\overline{\widehat{M}}_{4} and vec(Y^6)=H^6M^6\text{vec}(\widehat{Y}_{6})=\widehat{H}_{6}^{\dagger}\overline{\widehat{M}}_{6}.

Given the empirical 4-th and 6-th order moments M^4=M~4+E4\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}, M^6=M~6+E6\widehat{M}_{6}=\widetilde{M}_{6}+E_{6}, and suppose that the absolute value of entries in E4E_{4} and E6E_{6} are at most δ1\delta_{1}. Let U^\widehat{U} be the output of Step 1 for the span of the covariance matrices, and suppose that U^U~δ2\|\widehat{U}-\widetilde{U}\|\leq\delta_{2}. Suppose that δ1min{M~4F/n4,M~6F/n6}\delta_{1}\leq\min\{\|\widetilde{M}_{4}\|_{F}/\sqrt{n_{4}},\|\widetilde{M}_{6}\|_{F}/\sqrt{n_{6}}\}, and δ2min{1,σk2(H~4)/2,σk3(H~6)/2}\delta_{2}\leq\min\{1,\sigma_{k_{2}}(\widetilde{H}_{4})/2,\sigma_{k_{3}}(\widetilde{H}_{6})/2\}. Then, conditioned on the high probability event that both σk2(H~4),σk3(H~6)\sigma_{k_{2}}(\widetilde{H}_{4}),\sigma_{k_{3}}(\widetilde{H}_{6}) are bounded below, we have:

Recall that the coefficient matrix H~4\widetilde{H}_{4} corresponds to the composition of two linear mappings F4(UY4U)\mathcal{F}_{4}(UY_{4}U^{\top}) on the variable Y4Y_{4}. Since we have showed that F4\mathcal{F}_{4} is a projection determined by the Isserlis’ Theorem and independent of the empirical estimation of the moments, we can bound the perturbation on the coefficient matrices by:

Similarly, we have H^6H~6U^3U~37δ2H~6\|\widehat{H}_{6}-\widetilde{H}_{6}\|\leq\|\widehat{U}\odot^{3}-\widetilde{U}\odot^{3}\|\leq 7\delta_{2}\leq\|\widetilde{H}_{6}\|.

Therefore we can analyze the stability of the solution to the least square problems in (51) as follows:

where the first inequality is by applying Lemma G.6 and note that (M^4M~4)Fδ1n4M~4F\|(\overline{\widehat{M}}_{4}-\overline{\widetilde{M}}_{4})\|_{F}\leq\delta_{1}\sqrt{n_{4}}\leq\|\overline{\widetilde{M}}_{4}\|_{F}, the second inequality is because M~4FO(n4)\|\overline{\widetilde{M}}_{4}\|_{F}\leq O(\sqrt{n_{4}}), the third inequality is by applying the perturbation bound of pseudo-inverse in Theorem G.7, the fourth inequality is by the assumption that δ2\delta_{2} is sufficiently small compared to the smallest singular value of H~4\widetilde{H}_{4} thus σk2(H^4)=O(σk2(H~4))\sigma_{k_{2}}(\widehat{H}_{4})=O(\sigma_{k_{2}}(\widetilde{H}_{4})).

Appendix D Step 3: Tensor Decomposition

Given the estimations of the unfolded moments Y4Y_{4} and Y6Y_{6} from Step 2, and given the span of covariance matrices UU from Step 1, Step 3 use tensor decomposition to robustly find the parameters of the mixture of zero-mean Gaussians.

Recall that in the coordinate system with basis UU, the covariance matrices (vectorized with distinct entries) are given by Σ~(i)=U~σ~(i)\widetilde{\Sigma}^{(i)}=\widetilde{U}\widetilde{\sigma}^{(i)} for all ii. The unfolded moments in the same coordinate system are:

We will apply tensor decomposition algorithm to find the σ~(i)\widetilde{\sigma}^{(i)}’s. We restate the theorem for orthogonal symmetric tensor decomposition in Anandkumar et al. Anandkumar et al. (2014) below:

In order to reduce our problem to the orthogonal tensor decomposition so that the tensor power method (Algorithm 1, page 21 in Anandkumar et al. (2014)) can be applied, we use the same “whitening” technique as in Anandkumar et al. (2014). We first compute the SVD of the unfolded 4-th moments Y~4=V~2Λ~2V~2\widetilde{Y}_{4}=\widetilde{V}_{2}\widetilde{\Lambda}_{2}\widetilde{V}_{2}^{\top}, then use the singular vectors to transform the unfolded 6-th moments Y6Y_{6} into an orthogonal symmetric tensor Y~6(V~2Λ~21/2,V~2Λ~21/2,V~2Λ~21/2)\widetilde{Y}_{6}(\widetilde{V}_{2}\widetilde{\Lambda}_{2}^{-1/2},\widetilde{V}_{2}\widetilde{\Lambda}_{2}^{-1/2},\widetilde{V}_{2}\widetilde{\Lambda}_{2}^{-1/2}).

Next we complete the stability analysis for the two-step procedure, i.e. whitening and orthogonal tensor decomposition, which was not analyzed in Anandkumar et al. (2014).

There exists an algorithm that finds a^i\widehat{a}_{i} and ω^i\widehat{\omega}_{i} in polynomial (in variables (n,k,1/σmin(G2))(n,k,1/\sigma_{min}(G_{2}))) running time with the following guarantee: with probability at least 1en1-e^{-n}, for some permutation π\pi over [k][k] and for all i[k]i\in[k] we have:

Note th at in the exact case with G2G_{2} and G3G_{3}, we have that:

where λi=ωi1/2\lambda_{i}=\omega_{i}^{-1/2}, and the vectors vi=λi1V2Λ21/2aiv_{i}=\lambda_{i}^{-1}V_{2}^{\top}\Lambda_{2}^{-1/2}a_{i} and they are orthonormal. Also note that λmin1\lambda_{min}\geq 1 and λmaxωo1/2\lambda_{max}\leq\omega_{o}^{-1/2}. We can then apply orthogonal tensor decomposition (Algorithm 1 in Anandkumar et al. (2014)) to G^\widehat{G} to robustly obtain estimations of viv_{i}’s and λi\lambda_{i}’s. After obtaining the estimation v^i\widehat{v}_{i} and λ^i\widehat{\lambda}_{i}’s, we can further obtain the estimation of aia_{i}’s and ωi\omega_{i}’s as:

The estimation of the vectors and weights are given in (52). In order to bound the distance a^iai\|\widehat{a}_{i}-a_{i}\| and ω^iωi\|\widehat{\omega}_{i}-\omega_{i}\|, we show the stability of the estimation V^2\widehat{V}_{2}, Λ^2\widehat{\Lambda}_{2}, and v^i\widehat{v}_{i}, λ^i\widehat{\lambda}_{i} separately.

First, note that by assumption G^2G2Fδ2\|\widehat{G}_{2}-G_{2}\|_{F}\leq\delta_{2}, we can apply Lemma G.2 and Lemma G.3 to bound the singular values and the singular vectors of G^2\widehat{G}_{2} by:

Define X=V2Λ21/2X=V_{2}\Lambda_{2}^{-1/2} and define ΔX=X^X\Delta_{X}=\widehat{X}-X. By the assumption that δ2o(γmin)\delta_{2}\leq o(\gamma_{min}), we have V^2V21\|\widehat{V}_{2}-V_{2}\|\leq 1 and Λ^21/2Λ21/2Λ21/2γmin1/2\|\widehat{\Lambda}_{2}^{-1/2}-\Lambda_{2}^{-1/2}\|\leq\|\Lambda_{2}^{-1/2}\|\leq\gamma_{min}^{-1/2}. Therefore we can apply Lemma G.6 to bound ΔX\|\Delta_{X}\|:

Moreover, since δ2o(γmin)\delta_{2}\leq o(\gamma_{min}), we also have ΔXX=γmin0.5\|\Delta_{X}\|\leq\|X\|=\gamma_{min}^{-0.5}.

Next, we bound the distance G^G\|\widehat{G}-G\|. Recall that G^=G^3(X^,X^,X^)\widehat{G}=\widehat{G}_{3}(\widehat{X},\widehat{X},\widehat{X}). Using the fact that tensor is a multi-linear operator, and by the assumption that G^3G3δ3\|\widehat{G}_{3}-G_{3}\|\leq\delta_{3}, we have:

Note that by the assumption δ2o(γmin2.5kG3)\delta_{2}\leq o({\gamma_{min}^{2.5}\over k\|G_{3}\|}), δ3o(γmin1.5k)\delta_{3}\leq o({\gamma_{min}^{1.5}\over k}), we have ϵo(1k)\epsilon\leq o({1\over k}). Therefore we can apply Theorem D.2 to conclude that with probability at least 1en1-e^{-n} (over the randomness of the randomized algorithm itself), the tensor power algorithm runs in time poly(n,k,1/λmin)\text{poly}(n,k,1/\lambda_{min}) and for some permutation π\pi over [k][k] it returns:

Finally, since we also have 5ϵ1/2λmin/25\epsilon\leq 1/2\leq\lambda_{min}/2 we can bound the estimation error of a^i\widehat{a}_{i} and ω^i\widehat{\omega}_{i} as defined in (52) by:

Now we can apply Theorem D.2 to our case.

Given Y^4\widehat{Y}_{4}, Y^6\widehat{Y}_{6}, U^\widehat{U} and suppose that Y^4Y~4F\|\widehat{Y}_{4}-\widetilde{Y}_{4}\|_{F}, Y^6Y~6F\|\widehat{Y}_{6}-\widetilde{Y}_{6}\|_{F} as well as U^U~\|\widehat{U}-\widetilde{U}\| are bounded by some inverse poly(n,k,1/ωo,1/ρ)δinverse\ poly(n,k,1/\omega_{o},1/\rho)\delta. There exists an algorithm that with high probability, returns Σ^(i)\widehat{\Sigma}^{(i)}’s and ω^i\widehat{\omega}_{i}’s such that for some permutation π\pi over [k][k], we have the distance Σ^(i)Σ~(i)\|\widehat{\Sigma}^{(i)}-\widetilde{\Sigma}^{(i)}\| and ω^iω~i\|\widehat{\omega}_{i}-\widetilde{\omega}_{i}\| are bounded by δ\delta. Moreover, the running time of the algorithm is upperbounded by poly(n,k,1/ωo,1/ρ)\text{poly}(n,k,1/\omega_{o},1/\rho).

We apply Theorem D.2, and pick G2=Y~4G_{2}=\widetilde{Y}_{4}, G3=Y~6G_{3}=\widetilde{Y}_{6}. We only need to verify that Y~6\|\widetilde{Y}_{6}\| and 1/σmin(Y~4)1/\sigma_{min}(\widetilde{Y}_{4}) are polynomials of the relevant parameters. This is easy to see, since σmin(Y~4)ωoσmin(Σ~)2\sigma_{min}(\widetilde{Y}_{4})\geq\omega_{o}\sigma_{min}(\widetilde{\Sigma})^{2}, and the matrix Σ~\widetilde{\Sigma} is a perturbed rectangular matrix which by Lemma G.15 has σmin(Σ~)Ω(ρn2)\sigma_{min}(\widetilde{\Sigma})\geq\Omega(\rho\sqrt{n_{2}}) with high probability.

Finally, given σ^(i)\widehat{\sigma}^{(i)}, and given the output of Step 2, i.e. U^\widehat{U}, with inverse polynomial accuracy, we can recover Σ^(i)=U^σ^(i)\widehat{\Sigma}^{(i)}=\widehat{U}\widehat{\sigma}^{(i)} up to accuracy polynomial in the relevant parameters. ∎

Appendix E Proofs of Theorem 3.5

The results in all previous sections showed the correctness and robustness of each individual step for the algorithm for zero-mean case, In this section, we summarize those results to prove that the overall algorithm has polynomial time/sample complexity.

Given NN samples x1,,xNx_{1},\dots,x_{N} drawn i.i.d. from the nn-dimensional mixture of kk Gaussians, if Nn7/δ2N\geq n^{7}/\delta^{2}, then with high probability, we have that for all j1,,j6[n]j_{1},\dots,j_{6}\in[n]:

Let xx denote the random vector of this mixture of Gaussians. We first truncate its tail probabilities to make all the entries ([x]j[x]_{j} for j[n]j\in[n]) in the vector xx be in the range [n,n][-\sqrt{n},\sqrt{n}]. Apply union bound, we know that with high probability (at least 1O(en)1-O(e^{-n})), for all indices j1,,j6[n]j_{1},\dots,j_{6}\in[n], we have \Big{|}[x]_{j_{1}}\dots[x]_{j_{6}}\Big{|}\leq n^{3}. Then we can apply Hoeffding’s inequality to bound the empirical moments by:

We show that, to achieve ϵ\epsilon accuracy in the output of Step 3 in the algorithm for the zero-mean case, the number of samples we need to estimate the moments M4M_{4} and M6M_{6} is bounded by a polynomial of relevant parameters, namely poly(n,k,1/ωo,1/ϵ,1/ρ)\text{poly}(n,k,1/\omega_{o},1/\epsilon,1/\rho), and each step of the algorithm can be done in polynomial time.

We backtrack the input-output relations from Step 3 to Step 2 and to Step 1, and we show that the estimation error in the empirical moments and the inputs / outputs only polynomially propagate throughout the steps.

First note that we have shown that every steps fails with negligible probability (O(enC)O(e^{-n^{C}}) for any absolute constant CC). Then apply union bound, we have that the entire algorithm works correctly with high probability.

By Lemma D.3, in order to achieve ϵ\epsilon accuracy in the final estimation of the mixing weights and the covariance matrices, we need to drive the input accuracy of Step 3 (also the output accuracy of Step 2) to be bounded by some inverse polynomial in (n,1/ϵ,1/ρ,1/ωo)(n,1/\epsilon,1/\rho,1/\omega_{o}), Also recall that this step has running time poly(n,k,1/ρ,1/ωo)\text{poly}(n,k,1/\rho,1/\omega_{o}).

Theorem C.1 and Theorem C.10 guarantee that with smoothed analysis σmin(H~4)\sigma_{min}(\widetilde{H}_{4}) and σmin(H~6)\sigma_{min}(\widetilde{H}_{6}) are lower bounded polynomially. Then by Lemma C.16, in order to have the output accuracy of Step 2 be bounded by inverse poly(n,1/ϵ,1/ρ,1/ωo)\text{poly}(n,1/\epsilon,1/\rho,1/\omega_{o}), we need to drive the input accuracy of Step 2 (U^\widehat{U}, M^4\widehat{M}_{4}) to be bounded by some other inverse polynomial. Step 2 involves solving linear systems of dimension n4k2n_{4}k_{2} and n6k3n_{6}k_{3}, thus it running time is polynomial.

Lemma B.13 and B.14 guarantees that with smoothed analysis σk(Q~U)\sigma_{k}(\widetilde{Q}_{U}) is lower bounded polynomially. Then by Lemma B.15, in order to have the output accuracy of Step 1 (c) (U^\widehat{U}) be bounded by inverse polynomial, we need to drive the input accuracy (output S^i\widehat{S}_{i} of Step 1 (a) and output U^i\widehat{U}_{i} of Step 1 (b) ) to be bounded by some other inverse polynomial. Step 1 (c) involves multiplications and factorization of matrices of polynomial size, and thus the running time is also polynomial.

Lemma B.7 guarantees that with smoothed analysis σk(Q~US)\sigma_{k}(\widetilde{Q}_{U_{S}}) is lower bounded polynomially. Then by Lemma B.10, in order to have the output accuracy of Step 1 (b) (U^S\widehat{U}_{S}) be bounded by inverse polynomial, we need to drive the input accuracy (output S^i\widehat{S}_{i} of Step 1 (a) ) to be bounded by some other inverse polynomial. Step 1 (b) involves multiplications and factorization of matrices of polynomial size, and thus the running time is also polynomial.

Lemma B.1 guarantees that with smoothed analysis σk(Q~S)\sigma_{k}(\widetilde{Q}_{S}) is lower bounded by inverse polynomial. Then by Lemma B.3, in order to have the output accuracy of Step 1 (a) (S^\widehat{S}) be bounded by inverse polynomial, we need to drive the input accuracy (the moment estimation M^4\widehat{M}_{4}) to be bounded by some other inverse polynomial. Step 1 (a) involves multiplications and factorization of matrices of polynomial size, and thus the running time is also polynomial.

Finally, by Lemma E.1, in order to have the accuracy of moment estimation (M^4,M^6)(\widehat{M}_{4},\widehat{M}_{6}) be bounded by inverse polynomial, we need the number of samples NN polynomial in all the relevant parameters, including kk.

Appendix F General Case

In this section, we present the algorithm for learning mixture of Gaussians with general means. The algorithm generalizes the insights obtained from the algorithm for the zero-mean case. The steps are very similar, and we will highlight the differences.

In this step, we find the following two subspaces:

This is very similar to Step 1 in the algorithm for the zero-mean case, and can be achieved in three small steps:

Step 1 (a). For a subset H\mathcal{H} of size 12n12\sqrt{n}, find the span S\mathcal{S} of the mean vectors and a subset of columns of the covariance matrices:

Step 1 (b). Find the span of covariance matrices projected to the subspace SS^{\perp}:

Step 1 (c). Run 1(a) and 1(b) on two disjoint subsets H1\mathcal{H}_{1} and H2\mathcal{H}_{2}. Merge the two spans U1U_{1} and U2U_{2} to get Z~\widetilde{Z} and span{ProjZ~Σ~(i):i[k]}span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}:i\in[k]\}.

Next, we discuss each small step and compare it with the similar analysis of the algorithm for the zero-mean case.

Step 1 (a). Find the span 𝒮𝒮\mathcal{S} of the means and a subset of the columns of the covariance matrices

Similar to Step 1 (a) for the zero-mean case, in this step we want to find a subspace S\mathcal{S} which contains the span of a subset of columns of Σ~(i)\widetilde{\Sigma}^{(i)}’s. However, with the mean vector μ~(i)\widetilde{\mu}^{(i)}’s appearing in the moments, the subspace we find also contains the span of all the mean vectors. In particular, for a subset H[n]\mathcal{H}\in[n] with H=n|\mathcal{H}|=\sqrt{n}, we aim to find the following subspace:

Similar to Claim 5.1 for the zero-mean case, the key observation for finding the subspace is the structure of the one-dimensional slices of the 44-th order moments for the general case:

For any indices j1,j2,j3[n]j_{1},j_{2},j_{3}\in[n], the one-dimensional slices of M~4\widetilde{M}_{4} are given by:

The proof of this step is similar to the Lemmas B.1 (for smoothed analysis) and B.3 (for stability analysis). The main difference is that in the matrix B~\widetilde{B} defined in the structural Claim B.2, there is now another block B~(0)\widetilde{B}^{(0)} with kk columns that corresponds to the μ~(i)\widetilde{\mu}^{(i)} directions, which we can again handle with Lemma G.12.

Lemma F.2 shows the deterministic conditions for Step 1 (a) to correctly identify the subspace S\mathcal{S} from the columns of Q~S\widetilde{Q}_{S}, and uses smoothed analysis to show that the conditions hold with high probability.

Given M~4\widetilde{M}_{4} of a general mixture of Gaussians , for any subset H[n]\mathcal{H}\in[n] and H=c2k|\mathcal{H}|=c_{2}{k} with the constant c2>9c_{2}>9, let Q~S\widetilde{Q}_{S} be the matrix defined as in (55). The columns of Q~S\widetilde{Q}_{S} give the desired span SS defined in (53) if the matrix Q~S\widetilde{Q}_{S} achieves the maximal column rank k+kHk+k|\mathcal{H}|. With probability (over the ρ\rho-perturbation) at least 1Cϵ0.5n1-C\epsilon^{0.5n} for some constant CC, the k(1+H)k(1+|\mathcal{H}|)-th singular value of Q~S\widetilde{Q}_{S} is bounded below by:

Note that the dimension of the subspace S\mathcal{S} is at most k(H+1)<n/3k(|\mathcal{H}|+1)<n/3. Then we show by the Claim about the moment structure that the matrix Q~S\widetilde{Q}_{S} can be written as a product of P~S\widetilde{P}_{S} and some coefficient matrix B~S\widetilde{B}_{S}. Then we bound the smallest singular value of the two matrices P~S\widetilde{P}_{S} and B~S\widetilde{B}_{S} via smoothed analysis separately. The coefficient matrix B~S\widetilde{B}_{S} is slightly different than that in the zero-mean case, but has similar block-diagonal structure properties.

Similar to structural property in Claim B.2 for the zero-mean case, we can write the matrix Q~S\widetilde{Q}_{S} in a product form:

We will bound the smallest singular value for each of the factor, and apply union bound to conclude the lower bound of σk(1+H)(Q~S)\sigma_{k(1+|\mathcal{H}|)}(\widetilde{Q}_{S}).

In order to lower bound σmin(B~S)\sigma_{min}(\widetilde{B}_{S}), we first analyze the structure of this coefficient matrix. The matrix B~S\widetilde{B}_{S} has the following block structure:

Note that we can write the block B~(0)\widetilde{B}^{(0)} as:

where it is easy to see the first summand (μ~H(3)μ~H(2)+Σ~H(3),H(2))μ~H(1)(\widetilde{\mu}_{\mathcal{H}^{(3)}}\odot\widetilde{\mu}_{\mathcal{H}^{(2)}}+\widetilde{\Sigma}_{\mathcal{H}^{(3)},\mathcal{H}^{(2)}})\odot\widetilde{\mu}_{\mathcal{H}^{(1)}} is a linear combination of the columns of the block diagonal matrix B~(1)\widetilde{B}^{(1)}, and similarly the second and third summands are linear combinations of the columns of B~(2)\widetilde{B}^{(2)} and B~(3)\widetilde{B}^{(3)}, and the last summand is simply 2B~0(0)-2\widetilde{B}_{0}^{(0)}. Therefore for some absolute constant CC (the smallest singular value corresponding to the linear transformation) we have that:

Note that B~0(0)=μ~H(3)μ~H(2)μ~H(1)\widetilde{B}_{0}^{(0)}=\widetilde{\mu}_{\mathcal{H}^{(3)}}\odot\widetilde{\mu}_{\mathcal{H}^{(2)}}\odot\widetilde{\mu}_{\mathcal{H}^{(1)}} only depends on the randomness over the mean vectors. Note that the Khatri-Rao product is a submatrix of the Kronecker product, therefore for tall matrices Q1Q_{1} and Q2Q_{2}, we have that σmin(Q1Q2)σmin(Q1krQ2)=σmin(Q1)σmin(Q2)\sigma_{min}(Q_{1}\odot Q_{2})\leq\sigma_{min}(Q_{1}\otimes_{kr}Q_{2})=\sigma_{min}(Q_{1})\sigma_{min}(Q_{2}). In particular, we can bound the smallest singular value of B~0(0)\widetilde{B}_{0}^{(0)} with high probability (at least 1Cϵ0.5n1-C\epsilon^{0.5n}) as follows:

Then condition on the value of the means, we further exploit the randomness over the covariance matrices to lower bound σkH(ProjB~0(0)[B~(1),B~(2),B~(3)])\sigma_{k|\mathcal{H}|}\left(\text{Proj}_{\widetilde{B}_{0}^{(0)\perp}}[\widetilde{B}^{(1)},\widetilde{B}^{(2)},\widetilde{B}^{(3)}]\right). It is almost the same as the argument of the proof for Proposition B.1. For example, compared to (18) we have the following inequality instead:

and note that any block in B~(0)\widetilde{B}^{(0)} is independent of the randomness of covariance matrices, and we have (H/3)2k2kH/32k(|\mathcal{H}|/3)^{2}-k-2k|\mathcal{H}|/3\geq 2k. Similar modifications apply to the inequalities in (20),(21).

Finally by the argument of Lemma G.12 we can bound σmin(B~S)\sigma_{min}(\widetilde{B}_{S}) with probability at least 1Cϵ0.5n1-C\epsilon^{0.5n} (over the randomness of both the perturbed means and covariance matrices):

as we assume ρ\rho to be small perturbation and ρϵn<1\rho\epsilon\sqrt{n}<1.

Step 1 (b). Find the projected span of covariance matrices

Given the subspace S=span{μ~(i),Σ~[:,H](i):i[k]}\mathcal{S}=span\{\widetilde{\mu}^{(i)},\widetilde{\Sigma}_{[:,\mathcal{H}]}^{(i)}:i\in[k]\} obtained from Step 1 (a), Step 1(b) finds the span of the covariance matrices with the columns projected to SS^{\perp}, namely:

This is in parallel with Step 1 (b) for the zero-mean case, and we rely on the structure of the two-dimensional slices of M~4\widetilde{M}_{4} to find the span of the projected covariance matrices. Similar to Claim B.6 for the zero-mean case, the following claim shows how the structure of the two-dimensional slices is related to the desired span.

For a mixture of general Gaussians, the two-dimensional slices of M~4\widetilde{M}_{4} are given by:

Note that given the set of indices H\mathcal{H} we chose in Step 1 (a) and the subspace SS, if we pick the indices j1,j2Hj_{1},j_{2}\in\mathcal{H}, project the two-dimensional slice to SS^{\perp}, all the rank one terms in the sum are eliminated and the projected slice lies in the desired span US\mathcal{U}_{S}:

Applying the same argument as in Lemma B.7 for the zero-mean case, we can show that with high probability over the perturbation, all the projected slices span the subspace US\mathcal{U}_{S}.

Step 1 (c). Merge the two projections of covariance matrices

Pick two disjoint index set H1\mathcal{H}_{1} and H2\mathcal{H}_{2} and repeat the previous two steps 1 (a) and 1 (b), we can obtain the two spans U1U_{1} and U2U_{2}, corresponding to the subspace of the covariance matrices projected to S1\mathcal{S}_{1} and S2\mathcal{S}_{2}, respectively.

In this step, we apply similar techniques as in Step 1 (c) for the zero-mean case to merge the two spans U1U_{1} and U2U_{2}: we first use the overlapping part of the two projections ProjS1\text{Proj}_{S_{1}^{\perp}} and ProjS2\text{Proj}_{S_{2}^{\perp}} to align the basis of U1U_{1} and U2U_{2}, then merge the two spans using the same basis.

Note that for the general case, by definition the span of the mean vectors Z~\widetilde{Z} lie in both subspaces S1\mathcal{S}_{1} and S2\mathcal{S}_{2}, therefore we have S1Z~\mathcal{S}_{1}^{\perp}\subset\widetilde{Z}^{\perp} and S2Z~\mathcal{S}_{2}^{\perp}\subset\widetilde{Z}^{\perp}. We can show that S1S2=Z~\mathcal{S}_{1}^{\perp}\cup\mathcal{S}_{2}^{\perp}=\widetilde{Z}^{\perp} by lower bounding σnk([ProjS1,ProjS2])\sigma_{n-k}([\text{Proj}_{\mathcal{S}_{1}^{\perp}},\text{Proj}_{\mathcal{S}_{2}^{\perp}}]) with high probability, similar to that in (47). This gives us the span of the mean vectors Z~\widetilde{Z}.

Moreover, in the general case, from merging U1U_{1} and U2U_{2} we are only able to find the span of covariance matrices projected to the subspace Z~\widetilde{Z}^{\perp}. In particular, we can follow Lemma B.11 and Lemma B.15 in Step 1 (c) for the zero-mean case to show that for the general case, we can merge U1U_{1} and U2U_{2} to obtain the span span{ProjZ~Σ~(i):i[k]span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}:i\in[k]. By further projecting the span to Z~\widetilde{Z}^{\perp} from the right side, we can also obtain Σ~o=span{ProjZ~Σ~(i)ProjZ~:i[k]}\widetilde{\Sigma}_{o}=span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}:i\in[k]\}.

Step 2. Find the covariance matrices in the subspace orthogonal to the means

Given the subspace Z~\widetilde{Z} and Σ~o=span{ProjZ~Σ~(i)ProjZ~:i[k]}\widetilde{\Sigma}_{o}=span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}:i\in[k]\} obtained from Step 1, Step 2 applies the zero-mean case algorithm to find the covariance matrices projected to the subspace Z~\widetilde{Z}^{\perp}, i.e., ProjZ~Σ~(i)ProjZ~\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}’s, as well as find the mixing weights ω~i\widetilde{\omega}_{i}’s.

This follows the same arguments as in Step 2 and Step 3 for the zero mean case. Consider projecting all the samples to Z~\widetilde{Z}^{\perp}, the subspace orthogonal to all the means. In this subspace, the samples are like from a mixture of zero-mean Gaussians with the projected covariance matrices, and the 44-th and 66-th order moment are given by M~4(ProjZ~,ProjZ~,ProjZ~,ProjZ~)\widetilde{M}_{4}(\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}}) and M~6(ProjZ~,ProjZ~,ProjZ~,ProjZ~,ProjZ~,ProjZ~)\widetilde{M}_{6}(\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}}). Since Z~\widetilde{Z} is of dimension kk, the dimension of the zero-mean Gaussian in the projected space is at least nk=O(n)n-k=O(n).

Note that the subspace Z~\widetilde{Z}^{\perp} only depends on the randomness of the means, and random perturbation on the covariance matrices is independent of that of μ~\widetilde{\mu}. The smoothed analysis for the moment unfolding in Step 2 and tensor decomposition in Step 3 for the zero-mean case, which only depend on the randomness of the covariance matrices, still go through in the projected space.

Step 3. Find the means

This step finds the mean vectors based on the outputs of the previous steps. The key observation for this step is about the structure of the 3-rd order moments in the following claim:

The following lemma shows how to extract the means μ~(i)\widetilde{\mu}^{(i)}’s from M~3(1)\widetilde{M}_{3(1)} using the information of the covariance matrices projected to the subspace orthogonal to the means, i.e. Σ~o\widetilde{\Sigma}_{o}, and the mixing weights ω~i\widetilde{\omega}_{i}’s.

The mean μ~(i)\widetilde{\mu}^{(i)} of the ii-th component can be obtained by:

This step correctly finds the means if the Σ~o\widetilde{\Sigma}_{o} is full rank with good condition number, and this holds with high probability over the perturbation.

The basic idea is that since Σ~o\widetilde{\Sigma}_{o} lies in the span of P~=ProjZ~krProjZ~\widetilde{P}=\text{Proj}_{\widetilde{Z}^{\perp}}\otimes_{kr}\text{Proj}_{\widetilde{Z}^{\perp}}, and the last three summands in the parenthesis in (57) all lie in span{InkrProjZ~, ProjZ~krIn}=span{P~}span\{I_{n}\otimes_{kr}\text{Proj}_{\widetilde{Z}},\ \text{Proj}_{\widetilde{Z}}\otimes_{kr}I_{n}\}=span\{\widetilde{P}^{\perp}\}. Therefore hitting the matrix M~3(1)\widetilde{M}_{3(1)} with Σ~o\widetilde{\Sigma}_{o}^{\dagger} from the right will eliminate those summands and pull out only the mean vectors.

Recall that the columns of the matrix Σ~o\widetilde{\Sigma}_{o} are vec(ProjZ~Σ~(i)ProjZ~)=P~vec(Σ~(i))\text{vec}(\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}})=\widetilde{P}\text{vec}(\widetilde{\Sigma}^{(i)})’s, and the columns of Σ~\widetilde{\Sigma} are vec(Σ~(i))\text{vec}(\widetilde{\Sigma}^{(i)})’s.

Note that T~=(P~Σ~)=P~Σ~\widetilde{T}=(\widetilde{P}{\widetilde{\Sigma}})^{{\dagger}\top}=\widetilde{P}{\widetilde{\Sigma}}^{{\dagger}\top}, and the columns of T~\widetilde{T} lie in span{P~}span\{\widetilde{P}\}. Also note that for all i,j[k]i,j\in[k] the vectors μ~(i)μ~(i)\widetilde{\mu}^{(i)}\odot\widetilde{\mu}^{(i)}, Σ~[:,j](i)μ~(i)\widetilde{\Sigma}^{(i)}_{[:,j]}\odot\widetilde{\mu}^{(i)} and μ~(i)Σ~[:,j](i)\widetilde{\mu}^{(i)}\odot\widetilde{\Sigma}^{(i)}_{[:,j]} all lie in the subspace span{InkrProjZ~, ProjZ~krIn}=span{P~}span\{I_{n}\otimes_{kr}\text{Proj}_{\widetilde{Z}},\ \text{Proj}_{\widetilde{Z}}\otimes_{kr}I_{n}\}=span\{\widetilde{P}^{\perp}\}. Therefore these terms will be eliminated if we multiply the columns of T~\widetilde{T} to the right of M~3(1)\widetilde{M}_{3(1)}. For the first term μ~j(i)vec(Σ~(i))\widetilde{\mu}_{j}^{(i)}\text{vec}(\widetilde{\Sigma}^{(i)}), since vec(Σ~(j))T~[:,i]=(P~vec(Σ~(j)))T~[:,i]=1[i=j]\text{vec}(\widetilde{\Sigma}^{(j)})^{\top}\widetilde{T}_{[:,i]}=(\widetilde{P}\text{vec}(\widetilde{\Sigma}^{(j)}))^{\top}\widetilde{T}_{[:,i]}=1_{[i=j]}. Therefore, we have M~3(1)T~[:,i]=ω~iμ~(i)\widetilde{M}_{3(1)}\widetilde{T}_{[:,i]}=\widetilde{\omega}_{i}\widetilde{\mu}^{(i)}.

The smoothed analysis for the correctness of this step is easy. We only need to show that both Σ~o{\widetilde{\Sigma}}_{o} and Σ~{\widetilde{\Sigma}} robustly have full column rank with high probability over perturbation of the covariance matrices, and thus the pseudo-inverse T~\widetilde{T} is well defined. This follows from Lemma G.15.

Finally, the stability analysis for this step is also straightforward using the perturbation bound for pseudo-inverse in Theorem G.7. ∎

Step 4. Find the unprojected covariance matrices

Note that by definition Z~=span{μ~(i):i[k]}\widetilde{Z}=span\{\widetilde{\mu}^{(i)}:i\in[k]\}, the projected covariance ProjZ~(Σ~(i))\text{Proj}_{\widetilde{Z}^{\perp}}(\widetilde{\Sigma}^{(i)}) we obtained in Step 2 is also equal to ProjZ~(Σ~(i)+μ~(i)(μ~(i)))\text{Proj}_{\widetilde{Z}^{\perp}}(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}). In Step 4 we try to recover the missing part of the covariance matrices in the subspace Z~\widetilde{Z}. Note that since we have also obtained the means in Step 3, it is equivalent to finding (Σ~(i)+μ~(i)(μ~(i)))(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}) for all ii. We will show that if we can find the span{(Σ~(i)+μ~(i)(μ~(i))):i[k]}span\{(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}):i\in[k]\}, the projected vector ProjZ~(Σ~(i)+μ~(i)(μ~(i)))\text{Proj}_{\widetilde{Z}^{\perp}}(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}) can be used as anchor to pin down the unprojected vector.

They key observation for finding the span of span{(Σ~(i)+μ~(i)(μ~(i))):i[k]}span\{(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}):i\in[k]\} is to first construct a 4-th order tensor M~4\widetilde{M}^{\prime}_{4} which corresponds to the 4-th moment of a mixture of zero-mean Gaussians with covariance matrices (Σ~(i)+μ~(i)(μ~(i)))(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}), and then follow Step 1 in the algorithm for zero-mean case to find the span of the covariance matrices for this new mixture of Gaussians.

The next lemma shows how to construct such 4-th order tensor:

Given the 44-th moment M~4\widetilde{M}_{4} for a mixture of Gaussians with parameters {ω~i,μ~(i),Σ~(i)}\{\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)}\}, define the 4-th order tensor M~4\widetilde{M}^{\prime}_{4} to be:

then M~4\widetilde{M}^{\prime}_{4} is equal to the 44-th moment of a mixture Gaussians with parameters {ω~i,0,Σ~(i)+μ~(i)(μ~(i))}\{\widetilde{\omega}_{i},0,\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}\}.

The proof follows directly from Isserlis’ Theorem. Therefore we can repeat Step 1 in the zero-mean case here to find the span of the space {vec(Σ~(i))+μ~(i)μ~(i):i[k]}\{\text{vec}(\widetilde{\Sigma}^{(i)})+\widetilde{\mu}^{(i)}\odot\widetilde{\mu}^{(i)}:i\in[k]\}. Since we also know the projection of Σ~(i)\widetilde{\Sigma}^{(i)}’s in a large subspace (in the subspace ProjZ~krProjZ~\text{Proj}_{\widetilde{Z}^{\perp}}\otimes_{kr}\text{Proj}_{\widetilde{Z}^{\perp}} obtained from Step 2), we can easily recover Σ~(i)\widetilde{\Sigma}^{(i)}’s:

Further, this procedure is stable if σmin(PS)\sigma_{min}(P^{\top}S) is lower bounded.

This is a special case of the Step 1 (c) where we merge two projections of an unknown subspace.

The span SS is equal to UVUV for some unknown matrix VV. We can compute V=(PU)PSV=(P^{\top}U)^{\dagger}P^{\top}S, and hence U=SV1=S(PS)(PU)U=SV^{-1}=S(P^{\top}S)^{\dagger}(P^{\top}U). The stability analysis is similar (and simpler than) Lemma B.11. ∎

We will apply this lemma to where the subspace PP is ProjZ~krProjZ~\text{Proj}_{\widetilde{Z}^{\perp}}\otimes_{kr}\text{Proj}_{\widetilde{Z}^{\perp}}. Since the perturbation of the means and the covariance matrices are independent, we can lower bound the smallest singular value of PSP^{\top}S.

F.1 Proof Sketch of the Main Theorem 3.4

The proof follows the same strategy as Theorem 3.5. First we apply the union bound to all the smoothed analysis lemmas, this will ensure the matrices we are inverting all have good condition number, and the whole algorithm is robust to noise.

Then in order to get the desired accuracy ϵ\epsilon, we need to guarantee inverse polynomial accuracy in different steps (through the stability lemmas). The flow of the algorithm is illustrated in Figure 5. In the end all the requirements becomes a inverse polynomial accuracy requirement on M^4\widehat{M}_{4} and M^6\widehat{M}_{6}, which we obtain by Lemma E.1.

Appendix G Matrix Perturbation, Concentration Bounds and Auxiliary Lemmas

In this section we collect known results on matrix perturbation and concentration bounds. In general, matrix perturbation bounds are the key for the perturbation lemmas, and concentration bounds are crucial for the smoothed analysis lemmas. We also prove some corollaries of known results that are very useful in our settings.

Given a matrix A^=A+E\widehat{A}=A+E where EE is a small perturbation, how does the singular values and singular vectors of AA change? This is a well-studied problem and many results can be found in Stewart and Sun Stewart (1977). Here we review some results used in this paper, and prove some corollaries.

Given A^=A+E\widehat{A}=A+E, the perturbation in individual singular values can be bounded by Weyl’s theorem:

Given A^=A+E\widehat{A}=A+E, we know σk(A)Eσk(A^)σk(A)+E\sigma_{k}(A)-\|E\|\leq\sigma_{k}(\widehat{A})\leq\sigma_{k}(A)+\|E\|.

For singular vectors, the perturbation is bounded by Wedin’s Theorem:

Let A^=A+E\widehat{A}=A+E, with analogous singular value decomposition. Let Φ\Phi be the matrix of canonical angles between the column span of U1U_{1} and that of U^1\widehat{U}_{1}, and Θ\Theta be the matrix of canonical angles between the column span of V1V_{1} and that of V^1\widehat{V}_{1}. Suppose that there exists a δ\delta such that

We do not go into the definition of canonical angles here. The only way we will be using this lemma is by combining it with the following:

Let Φ\Phi be the matrix of canonical angles between the column span of UU and that of U^\widehat{U}, then

Moreover, if EFσk(A)/2\|E\|_{F}\leq{\sigma_{k}(A)/\sqrt{2}} we have S^S~2Eσk(A).\|\widehat{S}-\widetilde{S}\|\leq{\sqrt{2}\|E\|\over\sigma_{k}(A)}.

The equality is because ProjS=IProjS\text{Proj}_{S^{\perp}}=I-\text{Proj}_{S} so the two differences are the same. The final step follows from Wedin’s Theorem and Lemma G.4. ∎

Often we need to bound the perturbation of a product of perturbed matrices, where we apply the following lemma:

Consider a product of matrices A1AkA_{1}\cdots A_{k}, and consider any sub-multiplicative norm on matrix \|\cdot\|. Given A^1,,A^k\widehat{A}_{1},\dots,\widehat{A}_{k} and assume that A^iAiAi\|\widehat{A}_{i}-A_{i}\|\leq\|A_{i}\|, then we have:

The proof of this lemma is straightforward by induction.

When we have a lowerbound on σmin(A)\sigma_{min}(A), it is easy to get bounds for the perturbation of pseudoinverse.

We first apply Theorem G.7, and then bound A\|A^{\dagger}\| and B\|B^{{\dagger}}\|. By definition we know A=1/σmin(A)\|A^{\dagger}\|=1/\sigma_{min}(A). By Weyl’s theorem σmin(B)σmin(A)Eσmin(A)/2\sigma_{min}(B)\geq\sigma_{min}(A)-\|E\|\geq\sigma_{min}(A)/2, hence B=σmin(B)12σmin(A)1\|B^{\dagger}\|=\sigma_{min}(B)^{-1}\leq 2\sigma_{min}(A)^{-1}. ∎

G.2 Lowerbounding the Smallest Singular Value

Gershgorin’s Disk Theorem is very useful in bounding the singular values.

Sometimes, it is easier to consider the projection of a matrix. Lowerbounding the smallest singular value of a projection will imply the same lowerbound on the original matrix:

Observe that (PA)(PA)=A(PP)AAA(P^{\top}A)^{\top}(P^{\top}A)=A^{\top}(PP^{\top})A\preceq A^{\top}A (because PP is a subspace). Therefore the eigenvalues of (PA)(PA)(P^{\top}A)^{\top}(P^{\top}A) must be dominated by the eigenvalues of AAA^{\top}A. Then the lemma follows from the definition of singular values. ∎

As a corollary we have the following lemma:

In several places of this work we want to bound the singular value of a matrix, where part of the matrix has a block structure.

The smallest singular value is bounded by:

The idea is to break the matrix into two parts A=ProjBA+ProjBAA=\text{Proj}_{B}A+\text{Proj}_{B^{\perp}}A.Since these two spaces are orthogonal we know σ(n+dn)(A)min{σn(ProjBA),σdn(ProjBA)}\sigma_{(n+dn^{\prime})}(A)\geq\min\{\sigma_{n}(\text{Proj}_{B}A),\sigma_{dn^{\prime}}(\text{Proj}_{B^{\perp}}A)\}.

For the first part, clearly σn(ProjBA)σn(B)\sigma_{n}(\text{Proj}_{B}A)\geq\sigma_{n}(B), as BB is a submatrix of ProjBA\text{Proj}_{B}A.

For the second part, we actually do the projection to a smaller subspace: for each block we project to the orthogonal subspace of B(i)B^{(i)}. Under this projection, the block structure is preserved. The dndn^{\prime}-th singular value must be at least the minimum of the nn^{\prime}-th singular value of the blocks. In summary we have:

In our analysis, we often also want to bound the smallest singular value of a matrix whose entries are Gaussian random variables. Our analysis mostly builds on the following results in random matrix theory.

For a random rectangular matrix, Rudelson and Vershynin (2009) gives the following nice result:

We will mostly use an immediate corollary of the above lemma with slightly simpler form:

This lemma can also be applied to a projection of a Gaussian matrix:

Then for any ϵ>0\epsilon>0, we have that with probability at least 1(Cϵ)0.5(mJr)1-(C\epsilon)^{0.5(m-|\mathcal{J}|-r)}, for some absolute constant CC, the smallest singular value of the projected random matrix is bounded by:

We justify the last equality below. Note that

and note that the columns of (PJcProjSE)(P_{J^{c}}\text{Proj}_{S}E) lie in the column span of PJcSP_{J^{c}}S, therefore,

Finally, note that PJcSP_{J^{c}}S, with column rank no more than rr, is independent of PJcEP_{J^{c}}E, which is a random Gaussian matrix of size (mJ)×n(m-|\mathcal{J}|)\times n, therefore we have that Proj(PJcS)PJcE\text{Proj}_{(P_{J^{c}}S)^{\perp}}P_{J^{c}}E is equivalent to a (mJr)×n(m-|\mathcal{J}|-r)\times n random Gaussian matrix. Since (58) is satisfied, we can apply Lemma G.13 and conclude (59) with high probability. ∎

However, in the smoothed analysis setting, the matrix we are interested in are often not random Gaussian matrices. Instead they are fixed matrices perturbed by Gaussian variables. We call these “perturbed rectangular matrices”, their singular values can be bounded as follows:

Since mn>2nm-n>2n, we can apply Lemma G.15 to conclude that for any ϵ>0\epsilon>0,

with probability at least 1(Cϵ)0.5(mn)1(Cϵ)0.25m1-(C\epsilon)^{0.5(m-n)}\leq 1-(C\epsilon)^{0.25m}. ∎

G.3 Projection of random vectors

In Step 2, we need to bound the norm of a random vector of the form uvu\odot v after a projection, where uu and vv are two Gaussian vectors. In order to show this, we apply the result in Vu and Wang (2013) which provides a concentration bound of projection of well-behaved (KK-concentrated) random vectors.

First we cite the definition of “KK-concentrated” below:

where med()med(\cdot) denotes the median of a random variable (choose an arbitrary one if there are many).

In order to apply this lemma in our setting, we need to prove the vectors that we are interested in is KK-concentrated:

Conditioned on the high probability event that E[:,i],E[:,j]2n2\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}, the vector [[E[:,i]E[:,j]]s,s:s<s][[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}] is 2n2)2\sqrt{n_{2}})-concentrated.

For any 11-Lipschitz function FF on [[E[:,i]E[:,j]]s,s:s<s][[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}], we can define a function G(E[:,i],E[:,j])=F([[E[:,i]E[:,j]]s,s:s<s])G(E_{[:,i]},E_{[:,j]})=F([[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}]) (if i=ji=j then the function GG only takes E[:,i]E_{[:,i]} as the variable). Under the assumption that E[:,i],E[:,j]2n2\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}, this new function GG is 2n22\sqrt{n_{2}}-Lipschitz.

Now we extend GG to GG^{*} when the input E[:,i],E[:,j]>2n2\|E_{[:,i]}\|,\|E_{[:,j]}\|>2\sqrt{n_{2}}. Define the truncation function \mboxtrunc(v)=v\mbox{trunc}(v)=v for v2n2\|v\|\leq 2\sqrt{n_{2}}, and \mboxtrunc(v)=2n2v/v\mbox{trunc}(v)=2\sqrt{n_{2}}v/\|v\| for v>2n2\|v\|>2\sqrt{n_{2}}. Define the extended function G(E[:,i],E[:,j])=G(\mboxtrunc(E[:,i]),\mboxtrunc(E[:,j]))G^{*}(E_{[:,i]},E_{[:,j]})=G(\mbox{trunc}(E_{[:,i]}),\mbox{trunc}(E_{[:,j]})), which is still 2n22\sqrt{n_{2}}-Lipschitz since the truncation function is 11-Lipschitz.

Note that for the two Gaussian random vectors E[:,i],E[:,j]N(0,I){E_{[:,i]},E_{[:,j]}\sim N(0,I)}, we can apply Gaussian concentration bound in Theorem G.20 on GG^{*}, which implies

Since the probability of the event E[:,i],E[:,j]>2n2\|E_{[:,i]}\|,\|E_{[:,j]}\|>2\sqrt{n_{2}} is very small (exp(Ω(n2))\sim\exp(-\Omega(n_{2}))), we have δ=\mboxmed(G(E[:,i],E[:,j]))\mboxmed(G(E[:,i],E[:,j]))\delta=|\mbox{med}(G(E_{[:,i]},E_{[:,j]}))-\mbox{med}(G^{*}(E_{[:,i]},E_{[:,j]}))| in the order of O(n2)O(\sqrt{n_{2}}). Therefore, for tΩ(n2)t\sim\Omega(\sqrt{n_{2}}), we have

Therefore the random vector [[E[:,i]E[:,j]]s,s:s<s][[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}] is 2n22\sqrt{n_{2}}-concentrated. ∎

for all s>0s>0 and some absolute constant C>0C>0.

G.4 Gaussian Chaoses

In Step 2, we want to show that the inner product of two random vectors of the form <Proj(uv),Proj(uv)><\text{Proj}(u\odot v),\text{Proj}(u\odot v)> is small, where u,uu,u^{\prime} and v,vv,v^{\prime} are Gaussian vectors. In order to show this, we treat the inner product as a (homogeneous) Gaussian chaos, which is defined to be a homogeneous polynomial over Gaussian random variablesIn fact, the squared norm of projected random vectors considered previously is a special case of Gaussian chaos, and we treat it separately.. Our analysis builds on the results of many works studying the concentration bound of Gaussian chaoses.

For decoupled Gaussian chaoses, we mostly use the following theorem, which is a simple corollary of Lemma G.22.

Suppose a=(ai1,...,id)1i1,...,idna=(a_{i_{1},...,i_{d}})_{1\leq i_{1},...,i_{d}\leq n} is a dd-indexed array, and aF\|a\|_{F} denotes its Frobenius norm. Let (Xi(j))1in,j=1,...,d(X^{(j)}_{i})_{1\leq i\leq n,j=1,...,d} be independent copies of XN(0,In)X\sim\mathcal{N}(0,I_{n}). For any fixed ϵ>0\epsilon>0, with probability at least 1Cexp(Cn2ϵ/d)1-C\text{exp}\left(-C^{\prime}{n^{2\epsilon/d}}\right),

Suppose a=(ai1,...,id)1i1,...,idna=(a_{i_{1},...,i_{d}})_{1\leq i_{1},...,i_{d}\leq n} is a dd-indexed array. Consider a decoupled Gaussian chaos G=i1,...,idai1,...,idXi1(1)Xid(d)G=\sum_{i_{1},...,i_{d}}a_{i_{1},...,i_{d}}X_{i_{1}}^{(1)}\cdots X_{i_{d}}^{(d)}, where Xi(k)X_{i}^{(k)} are independent copies of the standard normal random variable for all i[n],k[d]i\in[n],k\in[d].

where Cd(0,)C_{d}\in(0,\infty) depends only on dd, and S(k,d)S(k,d) denotes a set of all partitions of {1,,d}\{1,\dots,d\} into kk nonempty disjoint sets I1,,IkI_{1},\dots,I_{k}, and the norm I1,,Ik\|\cdot\|_{I_{1},\dots,I_{k}} is given by:

For coupled Gaussian chaoses, namely when X(j)X^{(j)}’s are identical copies of the same XX, we first cite the following decoupling theorem in de la Peña and Montgomery-Smith (1995).

(Decoupling) Let (ai1,...,id)1i1,...,idn(a_{i_{1},...,i_{d}})_{1\leq i_{1},...,i_{d}\leq n} be a symmetric dd-indexed array such that ai1,...,id=0a_{i_{1},...,i_{d}}=0 whenever there exists klk\neq l such that ik=ili_{k}=i_{l}. Let X1,...,XnX_{1},...,X_{n} be independent random variables and (Xi(j))1in(X^{(j)}_{i})_{1\leq i\leq n} for j=1,dots,dj=1,dots,d, be independent copies of the sequence (Xi)1in(X_{i})_{1\leq i\leq n}, then for all t0t\geq 0,

where Ld(0,)L_{d}\in(0,\infty) depends only on dd.

Essentially this theorem shows for a symmetric tensor with no “diagonal” terms, i.e., ai1,...,id=0a_{i_{1},...,i_{d}}=0 whenever there exists klk\neq l such that ik=ili_{k}=i_{l}), there is only a constant factor difference between the coupled and decoupled Gaussian chaos distribution.

In most of our applications, we do have symmetric tensors with no “diagonal” terms. However there is one case where we do have diagonal terms, for which we need the following lemma.

Let (ai1,i2,i3)1i1,...,i3n(a_{i_{1},i_{2},i_{3}})_{1\leq i_{1},...,i_{3}\leq n} be a symmetric 33-indexed array and let aF\|a\|_{F} denote its Frobenius norm. Let XN(0,In)X\sim\mathcal{N}(0,I_{n}), then for any ϵ>0\epsilon>0, with probability at least 1Cnexp(Cn2ϵ/3)1-Cn\text{exp}(-C^{\prime}n^{2\epsilon/3}),

The sum of the “diagonal” terms is equal to 3ijai,i,jXi2Xj+1/2iai,i,iXi33\sum_{i\neq j}a_{i,i,j}X_{i}^{2}X_{j}+1/2\sum_{i}a_{i,i,i}X_{i}^{3}. Since XiX_{i} are independent standard Gaussian random variables, with probability at least 1Cnexp(Cn2ϵ/3)1-Cn\text{exp}(-C^{\prime}n^{2\epsilon/3}) (union bound), Xinϵ/3|X_{i}|\leq n^{\epsilon/3} for all i[n]i\in[n]. Conditioned on this high probability event, the absolute value of the sum is bounded by:

By Theorem G.21, we know that with probability at least 1Cexp(Cn2ϵ/3)1-C\text{exp}\left(-C^{\prime}{n^{2\epsilon/3}}\right), the absolute value of the sum of the “non-diagonal” terms is bounded by aFnϵ\|a\|_{F}n^{\epsilon}. Therefore we can conclude the proof by applying the union bound. ∎