Learning Mixtures of Gaussians in High Dimensions

Rong Ge, Qingqing Huang, Sham M. Kakade

Introduction

Learning mixtures of Gaussians is a fundamental problem in statistics and learning theory, whose study dates back to Pearson (1894). Gaussian mixture models arise in numerous areas including physics, biology and the social sciences (McLachlan and Peel (2004); Titterington et al. (1985)), as well as in image processing (Reynolds and Rose (1995)) and speech (Permuter et al. (2003)).

In a Gaussian mixture model, there are $k$ unknown $n$ -dimensional multivariate Gaussian distributions. Samples are generated by first picking one of the $k$ Gaussians, then drawing a sample from that Gaussian distribution. Given samples from the mixture distribution, our goal is to estimate the means and covariance matrices of these underlying Gaussian distributions This is different from the problem of density estimation considered in Feldman et al. (2006); Chan et al. (2014).

This problem has a long history in theoretical computer science. The seminal work of Dasgupta (1999) gave an algorithm for learning spherical Gaussian mixtures when the means are well separated. Subsequent works (Dasgupta and Schulman (2000); Sanjeev and Kannan (2001); Vempala and Wang (2004); Brubaker and Vempala (2008)) developed better algorithms in the well-separated case, relaxing the spherical assumption and the amount of separation required.

When the means of the Gaussians are not separated, after several works (Belkin and Sinha (2009); Kalai et al. (2010)), Belkin and Sinha (2010) and Moitra and Valiant (2010) independently gave algorithms that run in polynomial time and with polynomial number of samples for a fixed number of Gaussians. However, both running time and sample complexity depend super exponentially on the number of components $k$ In fact, it is in the order of $O({e^{O(k)}}^{k})$ as shown in Theorem 11.3 in Valiant (2012). . Their algorithm is based on the method of moments introduced by Pearson (1894): first estimate the $O(k)$ -order moments of the distribution, then try to find the parameters that agree with these moments. Moitra and Valiant (2010) also show that the exponential dependency of the sample complexity on the number of components is necessary, by constructing an example of two mixtures of Gaussians with very different parameters, yet with exponentially small statistical distance.

Recently, Hsu and Kakade (2013) applied spectral methods to learning mixture of spherical Gaussians. When $n\geq k+1$ and the means of the Gaussians are linearly independent, their algorithm can learn the model in polynomial time and with polynomial number of samples. This result suggests that the lower bound example in Moitra and Valiant (2010) is only a degenerate case in high dimensional space. In fact, most (in general position) mixture of spherical Gaussians are easy to learn. This result is also based on the method of moments, and only uses second and third moments. Several follow-up works (Bhaskara et al. (2014); Anderson et al. (2013)) use higher order moments to get better dependencies on $n$ and $k$ .

However, the algorithm in Hsu and Kakade (2013) as well as in the follow-ups all make strong requirements on the covariance matrices. In particular, most of them only apply to learning mixture of spherical Gaussians. For mixture of Gaussians with general covariance matrices, the best known result is still Belkin and Sinha (2010) and Moitra and Valiant (2010), which algorithms are not polynomial in the number of components $k$ . This leads to the following natural question:

Question: Is it possible to learn most mixture of Gaussians in polynomial time using a polynomial number of samples?

In this paper, we give an algorithm that learns most mixture of Gaussians in high dimensional space (when $n\geq\Omega(k^{2})$ ), and the argument is formalized under the smoothed analysis framework first proposed in Spielman and Teng (2004).

In the smoothed analysis framework, the adversary first choose an arbitrary mixture of Gaussians. Then the mean vectors and covariance matrices of this Gaussian mixture are randomly perturbed by a small amount $\rho$ See Definition 3.2 in Section 3.1 for the details.. The samples are then generated from the Gaussian mixture model with the perturbed parameters. The goal of the algorithm is to learn the perturbed parameters from the samples.

The smoothed analysis framework is a natural bridge between worst-case and average-case analysis. On one hand, it is similar to worst-case analysis, as the adversary chooses the initial instance, and the perturbation allowed is small. On the other hand, even with small perturbation, we may hope that the instance be different enough from degenerate cases. A successful algorithm in the smoothed analysis setting suggests that the bad instances must be very “sparse” in the parameter space: they are highly unlikely in any small neighborhood of any instance. Recently, the smoothed analysis framework has also motivated several research work (Kalai et al. (2009) Bhaskara et al. (2014)) in analyzing learning algorithms.

In the smoothed analysis setting, we show that it is easy to learn most Gaussian mixtures:

(informal statement of Theorem 3.4) In the smoothed analysis setting, when $n\geq\Omega(k^{2})$ , given samples from the perturbed $n$ -dimensional Gaussian mixture model with $k$ components, there is an algorithm that learns the correct parameters up to accuracy $\epsilon$ with high probability, using polynomial time and number of samples.

An important step in our algorithm is to learn Gaussian mixture models whose components all have mean zero, which is also a problem of independent interest (Zoran and Weiss (2012)). Intuitively this is also a “hard” case, as there is no separation in the means. Yet algebraically, this case gives rise to a novel tensor decomposition algorithm. The ideas for solving this decomposition problem are then generalized to tackle the most general case.

(informal statement of Theorem 3.5) In the smoothed analysis setting, when $n\geq\Omega(k^{2})$ , given samples from the perturbed mixture of zero-mean $n$ -dimensional Gaussian mixture model with $k$ components, there is an algorithm that learns the parameters up to accuracy $\epsilon$ with high probability, using polynomial running time and number of samples.

Organization

The main part of the paper will focus on learning mixtures of zero-mean Gaussians. The proposed algorithm for this special case contains most of the new ideas and techniques. In Section 2 we introduce the notations for matrices and tensors which are used to handle higher order moments throughout the discussion. Then in Section 3 we introduce the smoothed analysis model for learning mixture of Gaussians and discuss the moment structure of mixture of Gaussians, then we formally state our main theorems. Section 4 outlines our algorithm for learning zero-mean mixture of Gaussians. The details of the steps are presented in Section 5. The detailed proofs for the correctness and the robustness are deferred to Appendix (Sections B to D). In Section 6 we briefly discuss how the ideas for zero-mean case can be generalized to learning mixture of nonzero Gaussians, for which the detailed algorithm and the proofs are deferred to Appendix F.

Notations

We use $[n]$ to denote the set $\{1,2,...,n\}$ and $[n]\times[n]$ to denote the set $\{(i,j):i,j\in[n]\}$ . These are often used as indices of matrices.

Symmetric matrices

Linear subspaces

Tensors

In particular, the vector given by $X(\mathbf{e}_{i},\mathbf{e}_{j},I)$ is the one-dimensional slice of the 3-way array, with the index for the first dimension to be $i$ and the second dimension to be $j$ .

Matrix Products

Main results

In this section, we first formally introduce the smoothed analysis framework for our problem and state our main theorems. Then we will discuss the structure of the moments of Gaussian mixtures, which is crucial for understanding our method of moments based algorithm.

As an interesting special case of the general model, we also consider the mixture of “zero-mean” Gaussians, which has $\mu^{(i)}=0$ for all components $i\in[k]$ .

A sample $x$ from a mixture of Gaussians is generated in two steps:

Sample $h\in[k]$ from a multinomial distribution, with probability $\Pr[h=i]=\omega_{i}$ for $i\in[k]$ .

The learning problem asks to estimate the parameters of the underlying mixture of Gaussians:

Given $N$ samples $x_{1},x_{2},...,x_{N}$ drawn i.i.d. from a mixture of Gaussians $\mathcal{G}=\{(\omega_{i},\mu^{(i)},\Sigma^{(i)})\}_{i\in[k]}$ , an algorithm learns the mixture of Gaussians with accuracy $\epsilon$ , if it outputs an estimation $\widehat{\mathcal{G}}=\{(\widehat{\omega}_{i},\widehat{\mu}^{(i)},\widehat{\Sigma}^{(i)})\}_{i\in[k]}$ such that there exists a permutation $\pi$ on $[k]$ , and for all $i\in[k]$ , we have $|\widehat{\omega}_{i}-\omega_{\pi(i)}|\leq\epsilon$ , $\|\widehat{\mu}^{(i)}-\mu^{(\pi(i))}\|\leq\epsilon$ and $\|\widehat{\Sigma}^{(i)}-\Sigma^{(\pi(i))}\|\leq\epsilon$ .

In the worst case, learning mixture of Gaussians is a information theoretically hard problem (Moitra and Valiant (2010)). There exists worst-case examples where the number of samples required for learning the instance is at least exponential in the number of components $k$ (McLachlan and Peel (2004)). The non-convexity arises from the hidden variable $h$ : without knowing $h$ we cannot determine which Gaussian component each sample comes from.

The smoothed analysis framework provides a way to circumvent the worst case instances, yet still studying this problem in its most general form. The basic idea is that, with high probability over the small random perturbation to any instance, the instance will not be a “worst-case” instance, and actually has reasonably good condition for the algorithm.

Next, we show how the parameters of the mixture of Gaussians are perturbed in our setup.

For $\rho<1/n$ , a $\rho$ -smooth $n$ -dimensional $k$ -component mixture of Gaussians $\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}$ is generated as follows:

Choose an arbitrary (could be adversarial) instance $\mathcal{G}=\{(\omega_{i},\mu^{(i)},\Sigma^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}$ . Scale the distribution such that $0\preceq\Sigma^{(i)}\preceq{1\over 2}I_{n}$ and $\|\mu^{(i)}\|\leq{1\over 2}$ for all $i\in[k]$ .

Set $\widetilde{\omega}_{i}=\omega_{i}$ , $\widetilde{\mu}^{(i)}=\mu^{(i)}+\delta_{i}$ , $\widetilde{\Sigma}^{(i)}=\Sigma^{(i)}+\Delta_{i}$ .

Choose the diagonal entries of $\widetilde{\Sigma}^{(i)}$ arbitrarily, while ensuring the positive semi-definiteness of the covariance matrix $\widetilde{\Sigma}^{(i)}$ , and the diagonal entries are upper bounded by $1$ . The perturbation procedure fails if this step is infeasible Note that by standard random matrix theory, with high probability the 4-th step is feasible and the perturbation procedure in Definition 3.2 succeeds. Also, with high probability we have $\|\widetilde{\mu}^{(i)}\|\leq 1$ and $0\preceq\widetilde{\Sigma}^{(i)}\preceq I_{n}$ for all $i\in[k]$ . .

A $\rho$ -smooth zero-mean mixture of Gaussians is generated using the same procedure, except that we set $\widetilde{\mu}^{(i)}=\mu^{(i)}=0$ , for all $i\in[k]$ .

When the original matrix is of low rank, a simple random perturbation may not lead to a positive semidefinite matrix, which is why our procedure of perturbation is more restricted in order to guarantee that the perturbed matrix is still a valid covariance matrix.

There could be other ways of locally perturbing the covariance matrix. Our procedure actually gives more power to the adversary as it can change the diagonals after observing the perturbations for other entries. Note that with high probability if we just let the new diagonal to be $5\sqrt{n}\rho$ larger than the original ones, the resulting matrix is still a valid covariance matrix. In other words, the adversary can always keep the perturbation small if it wants to.

Instead of the worst-case problem in Definition 3.1, our algorithms work on the smoothed instance. Here the model first gets perturbed to $\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)})\}_{i\in[k]}$ , the samples are drawn according to the perturbed model, and the algorithm tries to learn the perturbed parameters. We give a polynomial time algorithm in this case:

Consider a $\rho$ -smooth mixture of Gaussians $\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}$ for which the number of components is at least Note that the algorithms of Belkin and Sinha (2010) and Moitra and Valiant (2010) run in polynomial time for fixed $k$ . $k\geq C_{0}$ and the dimension $n\geq C_{1}k^{2}$ , for some fixed constants $C_{0}$ and $C_{1}$ . Suppose that the mixing weights $\widetilde{\omega}_{i}\geq\omega_{o}$ for all $i\in[k]$ . Given $N$ samples drawn i.i.d. from $\widetilde{\mathcal{G}}$ , there is an algorithm that learns the parameters of $\widetilde{\mathcal{G}}$ up to accuracy $\epsilon$ , with high probability over the randomness in both the perturbation and the samples. Furthermore, the running time and number of samples $N$ required are both upper bounded by $\text{poly}(n,k,1/\omega_{o},1/\epsilon,1/\rho)$ .

To better illustrate the algorithmic ideas for the general case, we first present an algorithm for learning mixtures of zero-mean Gaussians. Note that this is not just a special case of the general case, as with the smoothed analysis, the zero mean vectors are not perturbed.

Consider a $\rho$ -smooth mixture of zero-mean Gaussians $\widetilde{\mathcal{G}}=\{(\widetilde{\omega}_{i},0,\widetilde{\Sigma}^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}$ for which the number of components is at least $k\geq C_{0}$ and the dimension $n\geq C_{1}k^{2}$ , for some fixed constants $C_{0}$ and $C_{1}$ . Suppose that the mixing weights $\widetilde{\omega}_{i}\geq\omega_{o}$ for all $i\in[k]$ . Given $N$ samples drawn i.i.d. from $\widetilde{\mathcal{G}}$ , there is an algorithm that learns the parameters of $\widetilde{\mathcal{G}}$ up to accuracy $\epsilon$ , with high probability over the randomness in both the perturbation and the samples. Furthermore, the running time and number of samples $N$ are both upper bounded by $\text{poly}(n,k,1/\omega_{o},1/\epsilon,1/\rho)$ .

Throughout the paper we always assume that $n\geq C_{1}k^{2}$ and $\widetilde{\omega}_{i}\geq\omega_{o}$ .

2 Moment Structure of Mixture of Gaussians

Our algorithm is also based on the method of moments, and we only need to estimate the $3$ -rd, the $4$ -th and the $6$ -th order moments. In this part we briefly discuss the structure of $4$ -th and $6$ -th moments in the zero-mean case ( $3$ -rd moment is always 0 in the zero-mean case). These structures are essential to the proposed algorithm. For more details, and discussions on the general case see Appendix A.

where $y^{(i)}$ corresponds to the $n$ -dimensional zero-mean Gaussian distribution $\mathcal{N}(0,\Sigma^{(i)})$ . The moments for each Gaussian component are characterized by Isserlis’s theorem as below:

Let $(y_{1},\dots,y_{2t})$ be a multivariate zero-mean Gaussian random vector $\mathcal{N}(0,\Sigma)$ , then

where the summation is taken over all distinct ways of partitioning $y_{1},\dots,y_{2t}$ into $t$ pairs, which correspond to all the perfect matchings in a complete graph.

Ideally, we would like to obtain the following quantities (recall $n_{2}={n+1\choose 2}$ ):

Note that the entries in $X_{4}$ and $X_{6}$ are quadratic and cubic monomials of the covariance matrices, respectively. If we have $X_{4}$ and $X_{6}$ , the tensor decomposition algorithm in Anandkumar et al. (2014) can be immediately applied to recover $\omega_{i}$ ’s and $\Sigma^{(i)}$ ’s under mild conditions. It is easy to verify that those conditions are indeed satisfied with high probability in the smoothed analysis setting.

By Isserlis’s theorem, the entries of the moments $M_{4}$ and $M_{6}$ are indeed quadratic and cubic functions of the covariance matrices, respectively. However, the structure of the true moments $M_{4}$ and $M_{6}$ have more symmetries, consider for example,

Note that due to symmetry, the number of distinct entries in $M_{4}$ ( ${n+3\choose 4}\approx n^{4}/24$ ) is much smaller than the number of distinct entries in $X_{4}$ ( ${n_{2}+1\choose 2}\approx n^{4}/8$ ). Similar observation can be made about $M_{6}$ and $X_{6}$ .

Therefore, it is not immediate how to find the desired $X_{4}$ and $X_{6}$ based on $M_{4}$ and $M_{6}$ . We call the moments $M_{4},M_{6}$ the folded moments as they have more symmetry, and the corresponding $X_{4},X_{6}$ the unfolded moments. One of the key steps in our algorithm is to unfold the true moments $M_{4},M_{6}$ to get $X_{4},X_{6}$ by exploiting special structure of $M_{4},M_{6}$ .

In some cases, it is easier to restrict our attention to the entries in $M_{4}$ with indices corresponding to distinct variables. In particular, we define

Moreover $\mathcal{F}_{4}$ is a projection from a ${n_{2}+1\choose 2}$ -dimensional subspace to a $n_{4}$ -dimensional subspace, and $\mathcal{F}_{6}$ is a projection from a ${n_{2}+2\choose 3}$ -dimensional subspace to a ${n_{6}}$ -dimensional subspace.

Algorithm Outline for Learning Mixture of Zero-Mean Gaussians

In this section, we present our algorithm for learning zero-mean Gaussian mixture model. The algorithmic ideas and the analysis are at the core of this paper. Later we show that it is relatively easy to generalize the basic ideas and the techniques to handle the general case.

For simplicity we state our algorithm using the exact moments $\widetilde{M}_{4}$ and $\widetilde{M}_{6}$ , while in implementation the empirical moments $\widehat{M}_{4}$ and $\widehat{M}_{6}$ obtained with the samples are used. In later sections, we verify the correctness of the algorithm and show that it is robust: the algorithm learns the parameters up to arbitrary accuracy using polynomial number of samples.

Span Finding: Find the span of covariance matrices .

For a set of indices $\mathcal{H}\subset[n]$ of size $|\mathcal{H}|=\sqrt{n}$ , find the span:

Find the span of the covariance matrices with the columns projected onto $\mathcal{S}^{\perp}$ , namely,

For two disjoint sets of indices $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ , repeat Step 1 (a) and Step 1 (b) to obtain $\mathcal{U}_{1}$ and $\mathcal{U}_{2}$ , namely the span of covariance matrices projected onto two subspaces $\mathcal{S}_{1}^{\perp}$ and $\mathcal{S}_{2}^{\perp}$ . Merge $\mathcal{U}_{1}$ and $\mathcal{U}_{2}$ to obtain the span of covariance matrices $\mathcal{U}$ :

The unfolded moments $\widetilde{X}_{4},\widetilde{X}_{6}$ are then given by $\widetilde{X}_{4}=U\widetilde{Y}_{4}U^{\top},\widetilde{X}_{6}=\widetilde{Y}_{6}(U^{\top},U^{\top},U^{\top}).$

Tensor Decomposition: learn $\widetilde{\omega}_{i}$ and $\widetilde{\Sigma}^{(i)}$ from $\widetilde{Y}_{4}$ and $\widetilde{Y}_{6}$ . Given $U$ , and given $\widetilde{Y}_{4}$ and $\widetilde{Y}_{6}$ which are relate to the parameters as follows:

we apply tensor decomposition techniques to recover $\widetilde{\Sigma}^{(i)}$ ’s and $\widetilde{\omega}_{i}$ ’s.

Implementing the Steps for Mixture of Zero-Mean Gaussians

In this part we show how to accomplish each step of the algorithm outlined in Section 4 and sketch the proof ideas.

For each step, we first explain the detailed algorithm, and list the deterministic conditions on the underlying parameters as well as on the exact moments for the step to work correctly. Then we show that these deterministic conditions are satisfied with high probability over the $\rho$ -perturbation of the parameters in the smoothed analysis setting. In order to analyze the sample complexity, we further show that when we are given the empirical moments which are close to the exact moments, the output of the step is also close to that in the exact case.

In particular we show the correctness and the stability of each step in the algorithm with two main lemmas: the first lemma shows that with high probability over the random perturbation of the covariance matrices, the exact moments satisfy the deterministic conditions that ensure the correctness of each step; the second lemma shows that when the algorithm for each step works correctly, it is actually stable even when the moments are estimated from finite samples and have only inverse polynomial accuracy to the exact moments.

The detailed proofs are deferred to Section B to D in the appendix.

Given the 4-th order moments $\widetilde{M}_{4}$ , Step 1 finds the span of covariance matrices $\mathcal{U}$ as defined in (6). Note that by definition of the unfolded moments $\widetilde{X}_{4}$ in (1), the subspace $\mathcal{U}$ coincides with the column span of the matrix $\widetilde{X}_{4}$ .

By Lemma 3.7, we know that the entries in $\widetilde{M}_{4}$ are linear mappings of entries in $\widetilde{X}_{4}$ . Since the matrix $\widetilde{X}_{4}$ is of low rank ( $k\ll n_{2}$ ), this corresponds to the matrix sensing problem first studied in Recht et al. (2010). In general, matrix sensing problems can be hard even when we have many linear observations (Hardt et al. (2014b)). Previous works (Recht et al. (2010); Hardt et al. (2014a); Jain et al. (2013)) showed that if the linear mapping satisfy matrix RIP property, one can uniquely recover $\widetilde{X}_{4}$ from $\widetilde{M}_{4}$ .

However, properties like RIP do not hold in our setting where the linear mapping is determined by Isserlis’ Theorem. We can construct two different mixtures of Gaussians with different unfolded moments $\widetilde{X}_{4}$ , but the same folded moment $\widetilde{M}_{4}$ (see Section A.3). Therefore the existing matrix recovery algorithm cannot be applied, and we need to develop new tools by exploiting the special moment structure of Gaussian mixtures.

Step 1 (a). Find the Span of a Subset of Columns of the Covariance Matrices.

The key observation for this step is that if we hit $\widetilde{M}_{4}$ with three basis vectors, we get a vector that lies in the span of the columns of the covariance matrices:

For a mixture of zero-mean Gaussians $\mathcal{G}=\{(\omega_{i},0,\Sigma^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}$ , the one-dimensional slices of the 4-th order moments $M_{4}$ are given by:

In particular, if we pick the indices $j_{1},j_{2},j_{3}$ in the index set $\mathcal{H}$ , the vector $M_{4}(\mathbf{e}_{j_{1}},\mathbf{e}_{j_{2}},\mathbf{e}_{j_{3}},I)$ lies in the desired span $\mathcal{S}=\left\{\Sigma^{(i)}_{[:,j]}:i\in[k],j\in\mathcal{H}\right\}$ .

We shall partition the set $\mathcal{H}$ into three disjoint subsets $\mathcal{H}^{(i)}$ of equal size $\sqrt{n}/3$ , and pick $j_{i}\in H^{(i)}$ for $i=1,2,3$ . In this way, we have $(|\mathcal{H}|/3)^{3}=\Omega(n^{1.5})$ such one-dimensional slices of $M_{4}$ , which all lie in the desired subspace $\mathcal{S}$ . Moreover, the dimension of the subspace $\mathcal{S}$ is at most $k|\mathcal{H}|\ll n^{1.5}$ . Therefore, with the $\rho$ -perturbed parameters $\widetilde{\Sigma}^{(i)}$ ’s, we can expect that with high probability the slices of $\widetilde{M}_{4}$ span the entire subspace $\mathcal{S}$ .

We first show that this deterministic condition is satisfied with high probability by bounding the $k|\mathcal{H}|$ -th singular value of $\widetilde{Q}_{S}$ with smoothed analysis.

Given the exact 4-th order moments $\widetilde{M}_{4}$ , for any index set $\mathcal{H}$ of size $|\mathcal{H}|=\sqrt{n}$ , With high probability, the $k|\mathcal{H}|$ -th singular value of $\widetilde{Q}_{S}$ is at least $\Omega(\omega_{o}\rho^{2}n)$ .

The proof idea involves writing the matrix $\widetilde{Q}_{S}$ as a product of three matrices, and using the results on spectral properties of random matrices Rudelson and Vershynin (2009) to show that with high probability the smallest singular value of each factor is lower bounded.

Since this step only involves the singular value decomposition of the matrix $\widetilde{Q}_{S}$ , we then use the standard matrix perturbation theory to show that this step is stable:

Note that we need the high dimension assumption ( $n\gg k$ ) to guarantee the correctness of this step: in order to span the subspace $\mathcal{S}$ , the number of distinct vectors should be equal or larger than the dimension of the subspace, namely $|\mathcal{H}|^{3}\geq k|\mathcal{H}|$ ; and the subspace should be non-trivial, namely $k|\mathcal{H}|<n$ . These two inequalities suggest that we need $n\geq\Omega(k^{1.5})$ . However, we used the stronger assumption $n\geq\Omega(k^{2})$ to obtain the lower bound of the smallest singular value in the proof.

Step 1 (b). Find the Span of Projected Covariance Matrices.

In this step, we continue to use the structural properties of the 4-th order moments. In particular, we look at the two-dimensional slices of $M_{4}$ obtained by hitting it with two basis vectors:

For a mixture of zero-mean Gaussians $\mathcal{G}=\{(\omega_{i},0,\Sigma^{(i)})\}_{i\in[k]}\in\mathcal{G}_{n,k}$ , the two-dimensional slices of the 4-th order moments $M_{4}$ are given by:

Note that if we take the indices $j_{1}$ and $j_{2}$ in the index set $\mathcal{H}$ , the slice $M_{4}(\mathbf{e}_{j_{1}},\mathbf{e}_{j_{2}},I,I)$ is almost in the span of the covariance matrices, except $2k$ additive rank-one terms in the form of $\Sigma^{(i)}_{[:,j_{1}]}(\Sigma^{(i)}_{[:,j_{2}]})^{\top}$ . These rank-one terms can be eliminated by projecting the slice to the subspace $\mathcal{S}^{\perp}$ obtained in Step 1 (a), namely,

and this projected two-dimensional slice lies in the desired span $\mathcal{U}_{S}$ as defined in (5). Moreover, there are ${|\mathcal{H}|+1\choose 2}=\Omega(n)$ such projected two-dimensional slices, while the dimension of the desired span $\mathcal{U}_{S}$ is at most $k$ .

We show that this deterministic condition is satisfied by bounding the $k$ -th singular value of $\widetilde{Q}_{U_{S}}$ in the smoothed analysis setting:

Given the exact 4-th order moments $\widetilde{M}_{4}$ , with high probability, the $k$ -th singular value of $\widetilde{Q}_{U_{S}}$ is at least $\Omega(\omega_{o}\rho^{2}n^{1.5})$ .

Similar to Lemma 5.3, the proof is based on writing the matrix $Q_{U_{S}}$ as a product of three matrices, then bound their $k$ -th singular values using random matrix theory. The stability analysis also relies on the matrix perturbation theory.

Given the empirical 4-th order moments $\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}$ , assume that the absolute value of entries of $E_{4}$ are at most $\delta_{2}$ . Also, given the output $\text{Proj}_{\widehat{S}^{\perp}}$ from Step 1 (a), and assume that $\|\text{Proj}_{\widehat{S}^{\perp}}-\text{Proj}_{\widetilde{S}^{\perp}}\|\leq\delta_{1}$ . When $\delta_{1}$ and $\delta_{2}$ are inverse polynomially small, we have $\|\text{Proj}_{\widehat{U}_{S}}-\text{Proj}_{\widetilde{U}_{S}}\|\leq O\left({n^{2.5}\left(\delta_{2}+2\delta_{1}\right)/\sigma_{k}(\widetilde{Q}_{U_{S}})}\right)$ .

Note that for a given index set $\mathcal{H}$ , the span $\mathcal{U}_{S}$ obtained in Step 1 (b) only gives partial information about the span of the covariance matrices. The idea of getting the span of the full covariance matrices is to obtain two sets of such partial information and then merge them.

In order to achieve that, we repeat Step 1 (a) and Step 1 (b) for two disjoint sets $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ , each of size $\sqrt{n}$ . The two subspace $S_{1}$ and $S_{2}$ thus correspond to the span of two disjoint sets of covariance matrix columns. Therefore, we can hope that $U_{1}$ and $U_{2}$ , the span of covariance matrices projected to $S_{1}^{\perp}$ and $S_{2}^{\perp}$ contain enough information to recover the full span $U$ .

In particular, we prove the following claim:

The main idea in the proof is that since $s$ is not too large, the two subspaces $S_{1}^{\perp}$ and $S_{2}^{\perp}$ have a large intersection. Using this intersection we can “align” the two basis $V_{1}$ and $V_{2}$ and obtain $V_{1}^{\dagger}V_{2}$ , and then it is easy to merge the two projections of the same matrix (instead of a subspace).

Moreover, we show that when applying this result to the projected span of covariance matrices, we have $s=k|\mathcal{H}|\leq n/3$ , and the two deterministic conditions $\sigma_{2s}([S_{1},S_{2}])>0$ and $\sigma_{k}(\text{Proj}_{S_{3}}V_{1})>0$ are indeed satisfied with high probability over the parameter perturbation. The detailed smoothed analysis (Lemma B.13 and B.14) and the stability analysis (Lemma B.11) are provided in Section B.3 in the appendix.

We show that given the span of covariance matrices $\mathcal{U}$ obtained from Step 1, finding the unfolded moments $\widetilde{X}_{4}$ , $\widetilde{X}_{6}$ is reduced to solving two systems of linear equations.

Note that once we know $\widetilde{Y}_{4}$ and $\widetilde{Y}_{6}$ , the unfolded moments $\widetilde{X}_{4}$ and $\widetilde{X}_{6}$ are given by $\widetilde{X}_{4}=U\widetilde{Y}_{4}U^{\top}$ and $\widetilde{X}_{6}=\widetilde{Y}_{6}(U^{\top},U^{\top},U^{\top})$ . Therefore, after changing the variable, we need to solve the two linear equation systems given in (7) with the variables $\widetilde{Y}_{4}$ and $\widetilde{Y}_{6}$ .

This change of variable significantly reduces the number of unknown variables. Note that the number of distinct entries in $\widetilde{Y}_{4}$ and $\widetilde{Y}_{6}$ are $k_{2}={k+1\choose 2}$ and $k_{3}={k+2\choose 3}$ , respectively. Since $k_{2}\leq n_{4}$ and $k_{3}\leq n_{6}$ , we can expect that the linear mapping from $\widetilde{Y}_{4}$ to $\overline{\widetilde{M}}_{4}$ and the one from $\widetilde{Y}_{6}$ to $\overline{\widetilde{M}}_{6}$ are linearly invertible. This argument is formalized below.

We show with smoothed analysis that the smallest singular value of the two coefficient matrices are lower bounded with high probability:

With high probability over the parameter random perturbation, the $k_{2}$ -th singular value of the coefficient matrix $\widetilde{H}_{4}$ is at least $\Omega(\rho^{2}n/k)$ , and the $k_{3}$ -th singular value of the coefficient matrix $\widetilde{H}_{6}$ is at least $\Omega(\rho^{3}(n/k)^{1.5})$ .

To prove this lemma we rewrite the coefficient matrix as product of two matrices and bound their smallest singular values separately. One of the two matrices corresponds to a projection of the Kronecker product $\widetilde{\Sigma}\otimes_{kr}\widetilde{\Sigma}$ . In the smoothed analysis setting, this matrix is not necessarily incoherent. In order to provide a lower bound to its smallest singular value, we further apply a carefully designed projection to it, and then we use the concentration bounds for Gaussian chaoses to show that after the projection its columns are incoherent, finally we apply Gershgorin’s Theorem to bound the smallest singular value Note that the idea of unfolding using system of linear equations also appeared in the work of Jain and Oh (2014). However, in order to show the system of linear equations in their setup is robust, i.e., the coefficient matrix has full rank, they heavily rely on the incoherence assumption, which we do not impose in the smoothed analysis setting. .

When implementing this step with the empirical moments, we solve two least squares problems instead of solving the system of linear equations. Again using results in matrix perturbation theory and using the lower bound of the smallest singular values of the two coefficient matrices, we show the stability of the solution to the least squares problems:

Given the empirical moments $\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}$ , $\widehat{M}_{6}=\widetilde{M}_{6}+E_{6}$ , and suppose that the absolute value of entries of $E_{4}$ and $E_{6}$ are at most $\delta_{1}$ . Let $\widehat{U}$ , the output of Step 1, be the estimation for the span of the covariance matrices, and suppose that $\|\widehat{U}-\widetilde{U}\|\leq\delta_{2}$ . Let $\widehat{Y}_{4}$ and $\widehat{Y}_{6}$ be the least squares solution respectively. When $\delta_{1}$ and $\delta_{2}$ are inverse polynomially small, we have $\|\widetilde{Y}_{4}-\widehat{Y}_{4}\|_{F}\leq O(\sqrt{n_{4}}(\delta_{1}+\delta_{2}/\sigma_{min}(\widetilde{H}_{4}))$ and $\|\widetilde{Y}_{6}-\widehat{Y}_{6}\|_{F}\leq O(\sqrt{n_{6}}(\delta_{1}+\delta_{2}/\sigma_{min}(\widetilde{H}_{6}))$ .

Step 3. Tensor Decomposition.

Given $\widetilde{Y}_{4}$ , $\widetilde{Y}_{6}$ and $\widetilde{U}$ , the symmetric tensor decomposition algorithm can correctly and robustly find the mixing weights $\widetilde{\omega}_{i}$ ’s and the vectors $\widetilde{\sigma}_{i}$ ’s, up to some unknown permutation over $[k]$ , with high probability over both the randomized algorithm and the parameter perturbation.

The algorithm and its analysis mostly follow the algorithm of symmetric tensor decomposition in Anandkumar et al. (2014), and the details are provided in Section D in the appendix.

Proof Sketch for the Main Theorem of Zero-mean Case.

Theorem 3.5 follows from the previous smoothed analysis and stability analysis lemmas for each step.

First, exploiting the randomness of parameter perturbation, the smoothed analysis lemmas show that the deterministic conditions, which guarantee the correctness of each step, are satisfied with high probability. Then using concentration bounds of Gaussian variables, we show that with high probability over the random samples, the empirical moments $\widehat{M}_{4}$ and $\widehat{M}_{6}$ are entrywise $\delta$ -close to the exact moments $\widetilde{M}_{4}$ and $\widetilde{M}_{6}$ . In order to achieve $\epsilon$ accuracy in the parameter estimation, we choose $\delta$ to be inverse polynomially small, and therefore the number of samples required will be polynomial in the relevant parameters. The stability lemmas show how the errors propagate only “polynomially” through the steps of the algorithm, which is visualized in Figure 1.

A more detailed illustration is provided in Section E in the appendix.

Algorithm Outline for Learning Mixture of General Gaussians

In this section, we briefly discuss the algorithm for learning mixture of general Gaussians. Figure 2 shows the inputs and outputs of each step in this algorithm. Many steps share similar ideas to those of the algorithm for the zero-mean case in previous sections. We only highlight the basic ideas and defer the details to Section F in the appendix.

Similar to Step 1 in the zero-mean case, this step makes use of the structure of the 4-th order moments $\widetilde{M}_{4}$ , and is achieved in three small steps:

For a subset $\mathcal{H}\subset[n]$ of size $|\mathcal{H}|=\sqrt{n}$ , find the span:

Find the span of the covariance matrices with the columns projected onto $\mathcal{S}^{\perp}$ , namely,

For disjoint subsets $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ , repeat Step 1 (a) and Step 1 (b) to obtain $\mathcal{U}_{1}$ and $\mathcal{U}_{2}$ , the span of the covariance matrices projected onto the subspaces $\mathcal{S}_{1}^{\perp}$ and $\mathcal{S}_{2}^{\perp}$ . The intersection of the two subspaces $\mathcal{U}_{1}$ and $\mathcal{U}_{2}$ gives the span of the mean vectors $\widetilde{Z}=\text{span}\left\{\widetilde{\mu}^{(i)},i\in[k]\right\}$ . Merge the two subspaces $\mathcal{U}_{1}$ and $\mathcal{U}_{2}$ to obtain the span of the covariance matrices projected to the subspace orthogonal to $\widetilde{Z}$ , namely $\widetilde{\Sigma}_{o}=\text{span}\left\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}:i\in[k]\right\}$ .

The key observation of this step is that when the samples are projected to the subspace orthogonal to all the mean vectors, they are equivalent to samples from a mixture of zero-mean Gaussians with covariance matrices $\widetilde{\Sigma}^{(i)}_{o}=\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}$ and with the same mixing weights $\widetilde{\omega}_{i}$ ’s. Therefore, projecting the samples to $\widetilde{Z}^{\perp}$ , the subspace orthogonal to the mean vectors, and use the algorithm for the zero-mean case, we can obtain $\widetilde{\Sigma}^{(i)}_{o}$ ’s, the covariance matrices projected to this subspace, as well as the mixing weights $\widetilde{\omega}_{i}$ ’s.

Step 3. Find the means

With simple algebra, this step extracts the projected covariance matrices $\widetilde{\Sigma}_{o}^{(i)}$ ’s from the $3$ -rd order moments $\widetilde{M}_{3}$ , the mixing weights $\widetilde{\omega}_{i}$ and the projected covariance matrices $\widetilde{\Sigma}_{o}^{(i)}$ ’s obtained in Step 2.

Step 4. Find the full covariance matrices

In Step 2, we obtained $\widetilde{\Sigma}^{(i)}_{o}$ , the covariance matrices projected to the subspace orthogonal to all the means. Note that they are equal to matrices $(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})$ projected to the same subspace. We claim that if we can find the span of these matrices ( $(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})$ ’s), we can get each matrix $(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})$ , and then subtracting the known rank-one component to find the covariance matrix $\widetilde{\Sigma}^{(i)}$ . This is similar to the idea of merging two projections of the same subspace in Step 1 (c) for the zero-mean case.

The idea of finding the desired span is to construct a $4$ -th order tensor:

which corresponds to the 4-th order moments of a mixture of zero-mean Gaussians with covariance matrices $\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}$ and the same mixing weights $\widetilde{\omega}_{i}$ ’s. Then we can then use Step 1 of the algorithm for the zero-mean case to obtain the span of the new covariance matrices, i.e. $span\{\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}:i\in[k]\}$ .

Conclusion

In this paper we give the first efficient algorithm for learning mixture of general Gaussians in the smoothed analysis setting. In the algorithm we developed new ways of extracting information from lower-order moment structure. This suggests that although the method of moments often involves solving systems of polynomial equations that are intractable in general, for natural models there is still hope of utilizing their special structure to obtain algebraic solution.

Smoothed analysis is a very useful way of avoiding degenerate examples in analyzing algorithms. In the analysis, we proved several new results for bounding the smallest singular values of structured random matrices. We believe the lemmas and techniques can be useful in more general settings.

Our algorithm uses only up to $6$ -th order moments. We conjecture that using higher order moments can reduce the number of dimension required to $n\geq\Omega(k^{1+\epsilon})$ , or maybe even $n\geq\Omega(k^{\epsilon})$ .

We thank Santosh Vempala for many insights and for help in earlier attempts at solving this problem.

References

Appendix A Moment Structures

In this section we characterize the structure of the 3-rd, 4-th and 6-th moments of Gaussians mixtures.

where $y^{(i)}$ corresponds to the $n$ -dimensional Gaussian distribution $\mathcal{N}(\mu^{(i)},\Sigma^{(i)})$ .

Gaussian distribution is a highly symmetric distribution, and in the zero-mean case the higher moments are well-understood by Isserlis’ Theorem:

Let $\mathbf{y}=(y_{1},\dots,y_{2t})$ be a multivariate Gaussian random vector with mean zero and covariance $\Sigma$ , then

where the summation is taken over all distinct ways of partitioning $y_{1},\dots,y_{2t}$ into $t$ pairs, which correspond to all the perfect matchings in a complete graph. Thus there are $(2t-1)!!$ terms in the sum, and each summand is a product of $t$ terms.

The non-zero mean case is a direct corollary using Isserlis’ Theorem and linearity of expectation.

Let $\mathbf{y}=(y_{1},\dots,y_{t})$ be a multivariate Gaussian random vector with mean $\mu$ and covariance $\Sigma$ , then

where the summation is taken over all distinct ways of partitioning $y_{1},\dots,y_{t}$ into $p$ pairs of $(u,v)$ and $s$ singletons of $(w)$ , where $p\geq 0$ , $s\geq 0$ and $2p+s=t$ .

We shall first prove Lemma 3.7 in Section 3.2. Recall that this lemma shows that for mixture of zero-mean Gaussians, the 4-th moments $\overline{M}_{4}$ and the 6-th moments $\overline{M}_{6}$ with distinct indices can be viewed as a linear projection of the unfolded moment $X_{4}$ and $X_{6}$ defined in (1).

By Isserlis Theorem A.1, the mapping $\sqrt{3}\mathcal{F}_{4}$ is characterized by: ( $\forall 1\leq j_{1}<j_{2}<j_{3}<j_{4}\leq n$ )

Therefore, with the normalization constant $\sqrt{3}$ , the $(j_{1},j_{2},j_{3},j_{4})$ -th mapping of $\mathcal{F}_{4}$ is a projection of the three elements in $X_{4}$ . Similarly, we have for $\sqrt{15}\mathcal{F}_{6}$ : ( $\forall 1\leq j_{1}<j_{2}<\dots<j_{6}\leq n$ )

Thus with the normalization constant $\sqrt{15}$ , the mapping $\mathcal{F}_{6}$ is a linear projection. ∎

A.2 Slices of Moments

Next we shall characterize the slices of the moments of mixture of Gaussians.

For mixture of zero-mean Gaussians, a one-dimensional slice of the 4th moment tensor is a vector in the span of corresponding columns of the covariance matrices:

For a mixture of zero-mean Gaussians, the one-dimensional slices of the 4-th moments $M_{4}$ are given by:

By the definition of multilinear map, $M_{4}(e_{j_{1}},e_{j_{2}},e_{j_{3}},I)$ is a vector whose $p$ -th entry is equal to $M_{4}(e_{j_{1}},e_{j_{2}},e_{j_{3}},e_{p})$ . We can compute this entry by Isserlis’ Theorem:

For mixture of zero-mean Gaussians, a two-dimensional slice of the 4th moment $M_{4}$ is a matrix, and it is a linear combination of the covariance matrices with some additive rank one matrices:

For a mixture of zero-mean Gaussians, the two-dimensional slices of the 4-th moment $M_{4}$ are given by:

Again this follows from Isserlis’ theorem. By definition of multilinear map this is a matrix whose $(p,q)$ -th entry is equal to

Similarly, for mixture of general Gaussians, we prove the following claims:

For a mixture of general Gaussians, the $(j_{1},j_{2},j_{3})$ -th one-dimensional slice of $M_{4}$ is given by:

This is very similar to Claim 5.1 and follows from the corollary of Isserlis’s theorem (Corollary A.2). There are 10 ways to partition the indices $\{j_{1},j_{2},j_{3},j_{4}\}$ into pairs and singletons: $((j_{1}),(j_{2}),(j_{3}),(j_{4}))$ , $((j_{1},j_{2}),(j_{3}),(j_{4}))$ , $((j_{1},j_{3}),(j_{2}),(j_{4}))$ , $((j_{1},j_{4}),(j_{2}),(j_{3}))$ , $((j_{2},j_{3}),(j_{1}),(j_{4}))$ , $((j_{2},j_{4}),(j_{1}),(j_{3}))$ , $((j_{3},j_{4}),(j_{1}),(j_{2}))$ , $((j_{1},j_{2}),(j_{3},j_{4}))$ , $((j_{1},j_{3}),(j_{2},j_{4}))$ , $((j_{1},j_{4}),(j_{2},j_{3}))$ . From this enumeration, we can specify each element in the vector of the one-dimensional slice. ∎

Since $M_{4}$ gives linear observations on the symmetric low rank matrix $X_{4}$ , it is natural to wonder whether we can use matrix completion techniques to recover $X_{4}$ from $M_{4}$ . Here we show this is impossible by giving a counter example: there are two mixture of Gaussians that generates the same 4th moment $M_{4}$ , but has different $X_{4}$ (even the span of $\Sigma^{(i)}$ ’s are different).

By $((a,b),(c,d))$ we denote a $5\times 5$ matrix $A$ which has $2$ ’s on diagonals, and the only nonzero off-diagonal entries are $A_{a,b}=A_{b,a}=A_{c,d}=A_{d,c}=1$ . For example, $((1,2),(4,5))$ will be the following matrix:

where all the missing entries are 0’s. Now we construct two mixtures of 3 Gaussians, all with mean 0 and weight $1/3$ . The covariance matrices are $((1,2),(4,5)),((1,3),(2,5)),((1,4),(3,5))$ for the first mixture and $((1,2),(3,5)),((1,3),(4,5)),((1,4),(2,5))$ for the second mixture. These are clearly different mixtures with different span of $\Sigma^{(i)}$ ’s: in the first mixture, $\Sigma^{(i)}_{1,2}=\Sigma^{(i)}_{4,5}$ for all matrices, but this is not true for the second mixture.

These two mixture of Gaussians have the same 4th moment $M_{4}$ . This can be checked by using Isserlis’ theorem to compute the moments. Intuitively, this is true because all the pairs $(1,i)$ and $(i,5)$ appeared exactly twice in the covariance matrices for both mixtures; also, every 4-tuple $(1,i,j,5)$ appeared exactly once in the covariance matrices for both mixtures.

Appendix B Step 1: Span Finding

Recall that in Step 1 of the algorithm for learning mixture of zero-mean Gaussians, we find the span of the covariance matrices in three small steps. In this section, we prove the correctness and the robustness of each step with smoothed analysis.

For completeness we restate each substep and highlight the key properties we need, followed by the detailed proofs.

In Step 1 (a), for any set $\mathcal{H}$ of size $\sqrt{n}$ , we want to show that the one-dimensional slices of $M_{4}$ span the entire subspace $\mathcal{S}=\text{span}\left\{\widetilde{\Sigma}^{(i)}_{[:,j]}:i\in[k],j\in\mathcal{H}\right\}$ , which is the span of a subset of the columns in the covariance matrices.

This in particular means when $j_{1},j_{2},j_{3}\in\mathcal{H}$ , the vector $\widetilde{M}_{4}(e_{j_{1}},e_{j_{2}},e_{j_{3}},I)$ is in $\mathcal{S}$ . We need to show that the columns of the matrix $Q$ indeed span the entire subspace $\mathcal{S}$ .

It is sufficient to show that a subset of the column span the entire subspace. Form a three-way even partition of the set $\mathcal{H}$ , i.e., $|\mathcal{H}^{(1)}|=|\mathcal{H}^{(2)}|=|\mathcal{H}^{(3)}|=|\mathcal{H}|/3=\sqrt{n}/3$ , and only consider the one-dimensional slices of $\widetilde{M}_{4}$ corresponding to the indices $j_{i}\in\mathcal{H}^{(i)}$ for $i=1,2,3$ . In particular, we define matrix $\widetilde{Q}_{S}$ with these one-dimensional slices of $\widetilde{M}_{4}$ :

Define matrix $\widetilde{P}_{S}$ with the corresponding columns of the covariance matrices, forming a basis (although not orthogonal) of the desired subspace $\mathcal{S}$ :

In the following two lemmas, we show that with high probability over the random perturbation, the column span of $\widetilde{Q}_{S}$ is exactly equal to the column span of $\widetilde{P}_{S}$ , and robustly so.

Given $\widetilde{M}_{4}$ , the exact 4-th order moment of the $\rho$ -smooth mixture of zero-mean Gaussians, for any subset $\mathcal{H}\in[n]$ with cardinality $|\mathcal{H}|=\sqrt{n}$ , let $\widetilde{Q}_{S}$ be the matrix defined as in (12) with the one-dimensional slices of $\widetilde{M}_{4}$ . For any $\epsilon>0$ , and for some absolute constant $C_{1},C_{2},C_{3}>0$ , with probability at least $1-(C_{1}\epsilon)^{C_{2}n}$ , the $k|\mathcal{H}|$ -th singular value of $\widetilde{Q}_{S}$ is bounded below by:

In order to prove this lemma, we first write $\widetilde{Q}_{S}$ as the product of three matrices.

Under the same assumptions of Lemma B.1, the matrix $\widetilde{Q}_{S}$ can be written as

With some fixed and known row permutation $\pi^{(2)}$ and $\pi^{(3)}$ , the matrix $\widetilde{B}^{(2)}$ and $\widetilde{B}^{(3)}$ can be made block diagonal with identical blocks equal to $\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(3)}}$ and $\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(2)}}$ , respectively. Note that the three parts $\widetilde{B}^{(1)},\widetilde{B}^{(2)},\widetilde{B}^{(3)}$ do not have any common entry, nor do they involve any diagonal entry of the covariance matrices, therefore the three parts are independent when the covariances are randomly perturbed in the smoothed analysis.

It is easier to understand the structure by picture, see Figure 3. The rows of the matrix should be indexed by $(j_{1},j_{2},j_{3})\in\mathcal{H}^{(1)}\times\mathcal{H}^{(2)}\times\mathcal{H}^{(3)}$ , which can also be interpreted as a cube (in the right). The block structure in the first part $\widetilde{B}^{(1)}$ correspond to a slice in $\mathcal{H}^{(2)}\times\mathcal{H}^{(3)}$ direction (for each block, the element in $\mathcal{H}^{(1)}$ is fixed, the elements in $\mathcal{H}^{(2)}$ and $\mathcal{H}^{(3)}$ take all possible values). Similarly for $\widetilde{B}^{(2)}$ and $\widetilde{B}^{(3)}$ (as shown in figure).

(of Claim B.2 ) The proof of this claim is using Claim 5.1, the definition of matrices and the rule of matrix multiplication. Consider the column in $\widetilde{Q}_{S}$ corresponding to the index $(j_{1},j_{2},j_{3})$ for $j_{1}\in\mathcal{H}^{(1)},j_{2}\in\mathcal{H}^{(2)},j_{3}\in\mathcal{H}^{(3)}$ , and the row of $\widetilde{B}_{S}$ together with the mixing wights specifies how this column is formed as a linear combination of $3k$ columns of $\widetilde{P}_{S}$ . By the structure of $M_{4}$ in Claim 5.1, the $(j_{1},j_{2},j_{3})$ -th row of $\widetilde{B}^{(1)}$ has exactly $k$ entries corresponding to $\widetilde{\Sigma}^{(i)}_{j_{2},j_{3}}$ for $i\in[k]$ , these entries are multiplied by $\widetilde{\omega}_{i}$ in the middle (diagonal) matrix. Therefore, these directly correspond to the $k$ terms in Claim 5.1. Similarly the entries in $\widetilde{B}^{(2)}$ and $\widetilde{B}^{(3)}$ correspond to the other $2k$ terms. ∎

Using Claim B.2, we need to bound the smallest singular value for each of the matrices in order to bound the $k|\mathcal{H}|$ -th singular value of $\widetilde{Q}_{S}$ , this is deferred to the end of this part. The most important tool is a corollary (Lemma G.16) of the random matrix result proved in Rudelson and Vershynin (2009), which gives a lowerbound on the smallest singular value of perturbed rectangular matrices.

By Lemma B.1, we know $\widetilde{Q}_{S}$ has exactly rank $k|\mathcal{H}|$ , and is robust in the sense that its $k|\mathcal{H}|$ -th singular value is large (polynomial in the amount of perturbation $\rho$ ). By standard matrix perturbation theory, if we get $\widehat{Q}_{S}$ close to $\widetilde{Q}_{S}$ up to a high accuracy (inverse polynomial in the relevant parameters), the top $k|\mathcal{H}|$ singular vectors will span a subspace that is very close to the span of $\widetilde{Q}_{S}$ . We formalize this in the following lemma.

Note that the columns of $S$ are the leading left singular vectors of $Q_{S}$ . We apply the standard matrix perturbation bound of singular vectors. Recall that $S$ is defined to be the first $k|\mathcal{H}|$ left singular vector of $Q_{S}$ , and we have

Therefore by Wedin’s Theorem (in particular the corollary Lemma G.5), we can conclude (16). ∎

We first use Claim B.2 to write $\widetilde{Q}_{S}=\widetilde{P}_{S}\left(D_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|}\right)(\widetilde{B}_{S})^{\top}$ , note that the matrix $(D_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|})$ has dimension $k|\mathcal{H}|\times k|\mathcal{H}|$ , therefore we just need to show with high probability each of the three factor matrix has large $k|\mathcal{H}|$ -th singular value, and that implies a bound on the $k|\mathcal{H}|$ -th singular value of $\widetilde{Q}_{S}$ by union bound. The smallest singular value of $\widetilde{P}_{S}$ and $\widetilde{B}_{S}$ are bounded below by the following two Claims.

With high probability $\sigma_{k|\mathcal{H}|}(\widetilde{P}_{S})\geq\Omega(\rho\sqrt{n})$ .

Now $\widetilde{P}^{\prime}_{S}$ is a randomly perturbed rectangular matrix, whose smallest singular value can be lower bounded using Lemma G.16, and we conclude that with probability at least $1-(C\epsilon)^{0.25n}$ ,

Next, we bound the smallest singular value of $\widetilde{B}_{S}$ .

With high probability $\sigma_{k|\mathcal{H}|}(\widetilde{B}_{S})\geq\Omega(\rho\sqrt{n})$ .

We make use of the special structure of the three blocks of $\widetilde{B}_{S}$ to lower bound its smallest singular value.

First, we prove that the block diagonal matrix $\widetilde{B}^{(1)}$ has large singular values, even after projecting to the orthogonal subspace of the column span of $\widetilde{B}^{(2)}$ and $\widetilde{B}^{(3)}$ . This idea appeared several times in our proof and is abstracted in Lemma G.12. Apply the lemma and we have:

where the $j$ -th block of $[\widetilde{B}^{(2)},\widetilde{B}^{(3)}]$ has dimension $(|\mathcal{H}|/3)^{2}\times 2k|\mathcal{H}|/3$ . Since

this means for each block, even after projection it has more than $3k$ rows. Note that by definition the three blocks $\widetilde{B}^{(1)}$ , $\widetilde{B}^{(2)}$ and $\widetilde{B}^{(3)}$ are independent and do not involve any diagonal elements of the covariance matrices, so each block after the two projections is again a rectangular random matrix. We can apply Lemma G.15, for any $j$ , for some absolute constant $C_{1},C_{2},C_{3}$ (not fixed throughout the discussion), with probability at least $1-(C_{1}\epsilon)^{C_{2}n}$ over the randomness of $\widetilde{\Sigma}_{\mathcal{H}^{(2)},\mathcal{H}^{(3)}}$ , we have:

Now we can take a union bound over the blocks and conclude that with high probability, the smallest singular value of each block is large.

In order to bound $\sigma_{k(2|\mathcal{H}|/3)}([\widetilde{B}^{(2)},\widetilde{B}^{(3)}])$ , we use the same strategy. Note that $\widetilde{B}^{(2)}$ also has a block structure that corresponds to the $\mathcal{H}^{(1)}\times\mathcal{H}^{(3)}$ faces (see Figure 3). Again check the condition on dimension $(|\mathcal{H}|/3)^{2}-k-k|\mathcal{H}|/3\geq\Omega(n)>3k$ , we can apply Lemma G.12 again to show that for any $j$ , with probability at least $1-(C_{1}\epsilon)^{C_{2}n}$ over the randomness of $\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(3)}}$ , we have:

Again by Lemma G.15, for any $j$ , with probability at least $1-(C_{1}\epsilon)^{C_{2}n}$ over the randomness of $\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(3)}}$ , we have:

Finally, for $\widetilde{B}^{(3)}$ it is a block diagonal structure with blocks correspond to $\mathcal{H}^{(1)}\times\mathcal{H}^{(2)}$ faces (see Figure 3). Each block is a perturbed rectangular matrix, therefore we apply Lemma G.15 to have that with high probability over the randomness of $\widetilde{\Sigma}_{\mathcal{H}^{(1)},\mathcal{H}^{(2)}}$ ,

Now plug in the lower bounds in (18) (20) (21) into the inequalities in (17) and (19). By union bound we conclude that with high probability:

Finally, the diagonal matrix in the middle is given by the Kronecker product of $I_{|\mathcal{H}|}$ and $D_{\widetilde{\omega}}$ . Recall that $D_{\widetilde{\omega}}$ is the diagonal matrix with the mixing weights $\widetilde{\omega}_{i}$ ’s on its diagonal. By property of Kronecker product and the assumption on the mixing weights, the smallest diagonal element of $D_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|}$ is at least $\omega_{0}$ . Therefore $\sigma_{k|\mathcal{H}|}(D_{\widetilde{\omega}}\otimes_{kr}I_{|\mathcal{H}|})\geq\omega_{0}$ .

We have shown that the smallest singular value of all the three factor matrices are large with high probability. Therefore, apply union bound, we conclude that with probability at least $1-\exp(-\Omega(n))$ ,

In Step 1 (b), given the subset of indices $\mathcal{H}$ and the subspace $\mathcal{S}$ obtained in Step 1 (a), we want to show that the projected two-dimensional slices of $\widetilde{M}_{4}$ span the subspace $\mathcal{U}_{S}$ defined in (5), which is the span of the covariance matrices with the columns projected the subspace $\mathcal{S}^{\perp}$ :

Recall that in Claim 5.6, we characterized the two dimensional slices of the 4-th moments $M_{4}$ of mixture of zero-mean Gaussians as below:

In the following claim, we show that the columns of $\widetilde{Q}_{U_{S}}$ indeed lie in the column span of $\widetilde{P}_{U_{S}}$ :

Given $S$ obtained in Step 1(a), the span of $\widetilde{\Sigma}_{[:,j]}^{(i)}$ for $j\in\mathcal{H}$ and for all $i$ , then for $j_{1},j_{2}\in\mathcal{H}$ , we have:

Similar as in Step 1(a), in the next lemma we show that the columns of $\widetilde{Q}_{U_{S}}$ indeed span the entire column span of $\widetilde{P}_{U_{S}}$ . Since the dimension of the column span of $\widetilde{P}_{U_{S}}$ is no larger than $k$ , it is enough to the $k$ -th singular value of $\widetilde{Q}_{U_{S}}$ :

Given $\widetilde{M}_{4}$ , the exact 4-th order moment of the $\rho$ -smooth mixture of Gaussians , define the matrix $\widetilde{Q}_{U_{S}}$ as in (23) with the two-dimensional slices of $\widetilde{M}_{4}$ . For any $\epsilon>0$ , and for some absolute constant $C_{1},C_{2},C_{3}>0$ , with probability at least $1-2(C_{1}\epsilon)^{C_{2}n}$ , the $k$ -th singular value of $\widetilde{Q}_{U_{S}}$ is bounded below by:

Similar as before, we first examine the structure of the matrix $\widetilde{Q}_{U_{S}}$ :

Under the same assumption as Lemma B.7, we can write $\widetilde{Q}_{U_{S}}$ in the following matrix product form:

This claim follows from Claim B.6, and the rule of matrix product. The coefficients $\widetilde{\omega}_{i}\widetilde{\Sigma}_{j_{1},j_{2}}^{(i)}$ for the linear combinations of $\text{vec}(\text{Proj}_{\mathcal{S}^{\perp}}\widetilde{\Sigma}^{(i)})$ are given by the columns of the product $D_{\widetilde{\omega}}\widetilde{\Sigma}_{J}^{\top}$ . The coefficients are then multiplied by $\widetilde{P}_{U_{S}}$ to select the correct columns. ∎

To prove Lemma B.7, similar to the proof ideas of Lemma B.1, we lower bound the $k$ -th singular value of all the three factors.

By the structural Claim B.8, we know the matrix $\widetilde{Q}_{U_{S}}$ can be written as a product of the three matrices as $\widetilde{Q}_{U_{S}}=\widetilde{P}_{U_{S}}D_{\widetilde{\omega}}{\widetilde{\Sigma}_{J}}^{\top}$ .

We lower bound the $k$ -th singular value of each of the three factors. It is easy for the last two matrices. Note that by assumption $\sigma_{k}(D_{\widetilde{\omega}})\geq\omega_{o}$ , and since ${\widetilde{\Sigma}_{J}}^{\top}$ is just a perturbed rectangular matrix, we can apply Lemma G.15 and with high probability we have $\sigma_{k}({\widetilde{\Sigma}_{J}})\geq\Omega(\rho\sqrt{n})$ .

However, we cannot apply the same trick to directly bound the smallest singular value of $D_{S^{\perp}}$ and $\text{Proj}_{D_{S^{\perp}}}\widetilde{\Sigma}$ separately. The problem here is that $D_{\mathcal{S}^{\perp}}$ and $\widetilde{\Sigma}$ are not independent, as the subspace $S$ obtained in Step 1(a) also depends on the perturbation on $\widetilde{\Sigma}$ , therefore $\text{Proj}_{D_{S^{\perp}}}\widetilde{\Sigma}$ is not simply a projected perturbed matrix. Instead, we show that even conditioned on the part of randomness that is common in $S$ and $\widetilde{\Sigma}$ , $\widetilde{\Sigma}$ still has sufficient randomness due to the high dimensions, and we can still extract a tall random matrix out of it. This is elaborated in the following claim:

Under the assumptions of Lemma B.7, with high probability the matrix $\widetilde{P}_{U_{S}}=D_{\mathcal{S}^{\perp}}\widetilde{\Sigma}$ has smallest singular value at least $\Omega(\rho n)$ .

Let $\mathcal{L}$ be the set of the $(j_{1},j_{2})$ -th entries of $\widetilde{\Sigma}^{(i)}$ for all $i$ and one of $j_{1},j_{2}$ is in the set $\mathcal{H}$ . By Step 1(a), the subspace $\mathcal{S}^{\prime}=\text{span}(S,e_{j}:j\in\mathcal{H})$ is only dependent on the entries in $\mathcal{L}$ . Here we need to include the span of $e_{j}$ ’s for $j\in\mathcal{H}$ because the diagonal entries can depend on the other random perturbations. By adding the span of the vector $e_{j}$ ’s for $j\in\mathcal{H}$ the subspace remains invariant no matter how the diagonal entries change.

Let $\mathcal{Z}=\text{span}(\Sigma,S^{\prime}\otimes_{kr}I_{n})$ , and recall that the columns of $\Sigma$ are the factorization of the unperturbed covariance matrices. The subspace $\mathcal{Z}$ has dimension no larger than $|\mathcal{H}|(k+1)n+k\leq n^{2}/10$ , and depends on the randomness of $\mathcal{L}$ .

Let $\widetilde{\Sigma}=\Sigma+E$ where $E$ is the random perturbation matrix. Now we condition on the randomness in $\mathcal{L}$ . By definition the subspace $\mathcal{Z}$ is deterministic conditional on $\mathcal{L}$ . However, even if we only consider entries of $E\backslash\mathcal{L}$ there are still at least ${n-k|\mathcal{H}|\choose 2}\geq n^{2}/4$ independent random variables. We shall show the randomness is enough to guarantee that the smallest singular value of $\text{Proj}_{D_{\mathcal{S}^{\perp}}}\widetilde{\Sigma}$ is lower bounded with high probability conditioned on $\mathcal{L}$ :

Here we used the fact that projection to a subspace cannot increase the singular values (Lemma G.11).

Conditioned on the randomness of entries in $\mathcal{L}$ , $E\backslash\mathcal{L}$ still has at least $n^{2}/4$ random directions, while the dimension of the deterministic subspace $\mathcal{Z}$ is at most $n^{2}/10$ . Therefore we can apply Lemma G.15 again to argue that conditionally, for every $\epsilon>0$ , with probability at least $1-(C_{1}\epsilon)^{C_{2}n^{2}}$ we have:

In summary, apply union bound and we can conclude that with probability at least $1-(C_{1}\epsilon)^{C_{2}n}$ ,

Next, we again use matrix perturbation bounds to prove the robustness of this step, which depends on the singular value decomposition of the matrix $\widetilde{Q}_{U_{S}}$ .

Given the empirical 4-th order moments $\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}$ , and given the output $\text{Proj}_{\widehat{S}^{\perp}}$ from Step 1 (a). Suppose that $\|\text{Proj}_{\widehat{S}^{\perp}}-\text{Proj}_{\widetilde{S}^{\perp}}\|\leq\delta_{1}$ , and suppose that the absolute value of entries of $E_{4}$ are at most $\delta_{2}$ for $\delta_{2}\leq\|\widetilde{Q}_{U_{S}}\|_{F}/\sqrt{n^{3}}$ . Conditioned on the high probability event $\sigma_{k}(\widetilde{Q}_{U_{S}})>0$ , we have:

Proof of Lemma B.10

Note that the columns of $U_{S}$ are the leading left singular vectors of $\widetilde{Q}_{U_{S}}$ . We want to apply the perturbation bound of singular vectors.

Similar to the proof of Lemma B.3, we first need to bound the spectral distance between $\widehat{Q}_{U_{S}}$ and $\widetilde{Q}_{U_{S}}$ . In fact we will even bound the Frobenius norm difference:

where we used the assumption $\|\widetilde{\Sigma}^{(i)}\|\leq 1$ to bound $\|\widetilde{Q}_{U_{0}}\|_{F}$ , used the upperbound on $\|\widehat{Q}_{U_{0}}-\widetilde{Q}_{U_{0}}\|_{F}$ to bound the term $\|(\widehat{D}_{S^{\perp}}-\widetilde{D}_{S^{\perp}})(\widehat{Q}_{U_{0}}-\widetilde{Q}_{U_{0}})\|_{F}\leq\|(\widehat{D}_{S^{\perp}}-\widetilde{D}_{S^{\perp}})\|_{F}\delta_{2}\sqrt{n^{2}|\mathcal{J}|}\leq\|(\widehat{D}_{S^{\perp}}-\widetilde{D}_{S^{\perp}})\|_{F}\|\widetilde{Q}_{U_{0}}\|_{F}$ , and used the fact that Frobenius norm is sub-multiplicative. Apply Wedin’s Theorem (in particular the corollary Lemma G.5), we can conclude (26). ∎

B.3 Step 1 (c). Finding 𝒰𝒰\mathcal{U} by Merging the Two Projected Span

Pick two disjoint sets of indices $\mathcal{H}_{1},\mathcal{H}_{2}$ , and repeat Step 1 (a) and Step 1 (b) on each of them to get $\widetilde{S}_{j}^{\perp}$ and $\widetilde{U}_{j}$ for $j=1,2$ . In Step 1 (c), we merge the two span $\widetilde{U}_{1}$ and $\widetilde{U}_{2}$ to get $\mathcal{U}$ .

If we are given two projections $\text{Proj}_{S_{1}^{\perp}}U$ and $\text{Proj}_{S_{2}^{\perp}}U$ of a matrix $U$ , and if the union of the two subspaces $S_{1}^{\perp}$ and $S_{2}^{\perp}$ have full rank, namely $\text{dim}(S_{1}\cup S_{2})=n$ , then we can recover $U$ by:

However, it is slightly different if we are given two projections of a subspace $\mathcal{U}$ , since a subspace can be equivalently represented by different orthonormal basis up to linear transformation.

In particular, in our setting for $j=1,2,$ we can write $\widetilde{U}_{j}=(\text{Proj}_{S_{j}^{\perp}}\otimes_{kr}I_{n})\widetilde{\Sigma}W_{j}$ for some fixed but unknown full rank matrix $W_{j}$ (which makes the columns of matrix $\widetilde{\Sigma}W_{j}$ an orthonormal basis of $\mathcal{U}$ ). Recall that we define $\widetilde{\Sigma}\equiv[\text{vec}(\widetilde{\Sigma}^{(i)}):i\in[k]]$ , and $D_{S_{j}^{\perp}}\equiv\text{Proj}_{S_{j}^{\perp}}\otimes_{kr}I_{n}$ for $j=1,2$ .

The following Lemma shows that we can still robustly recover the subspace $\mathcal{U}$ if the two projections have sufficiently large overlapping. The basic idea is to use the overlapping part to align the two basis of the subspace which the two projections act on.

This is the detailed statement of Condition 5.10.

For two $s$ -dimensional known subspaces $S_{1}$ and $S_{2}$ , Let the columns of $A$ be the first $2s$ singular vectors of $[S_{1},S_{2}]$ , and let the columns of $S_{3}$ correspond to the first $(n-2s)$ singular vectors of $(I_{n}-\text{Proj}_{A})$ , therefore $S_{3}\subset(S_{1}\cup S_{2})^{\perp}$ . Suppose that $\sigma_{k}(\text{Proj}_{S_{3}}U)>0$ and that $\sigma_{2s}([S_{1},S_{2}])>0$ . Define matrices $U_{1}=\text{Proj}_{S_{1}^{\perp}}V_{1}$ and $U_{2}=\text{Proj}_{S_{2}^{\perp}}V_{2}$ and we know that $U_{1}^{\top}U_{1}=U_{2}^{\top}U_{2}=I_{k}$ .

We are given $\widehat{S}_{1},\widehat{S}_{2}$ and $\widehat{U}_{1},\widehat{U}_{2}$ , and suppose that for $j=1,2,$ we have $\|\widehat{S}_{j}-S_{j}\|_{F}\leq\delta_{s}$ and $\|\widehat{U}_{j}-U_{j}\|_{F}\leq\delta_{u}$ , for $\delta_{s}\leq 1,\delta_{u}\leq 1$ .

If $\sigma_{k}(\text{Proj}_{S_{3}}U)>0$ and $\sigma_{2s}([S_{1},S_{2}])>0$ , then for some absolute constant $C$ we have:

The proof will proceed in two steps, we first show that if we are given the exact inputs, namely $\delta_{s}=\delta_{u}=0$ , then the column span of $\widehat{U}$ defined in (28) is identical to the desired subspace $\mathcal{U}$ . Then we give a stability result using matrix perturbation bounds.

1. Solving the problem using exact inputs.

Given the exact inputs $S_{1},S_{2}$ , $U_{1},U_{2}$ , first we show that under the conditions $\sigma_{2s}([S_{1},S_{2}])>0$ and $\sigma_{k}(\text{Proj}_{S_{3}}U)>0$ , then the column span of the matrix $\left[U_{2},\ U_{1}(S_{3}^{\top}U_{1})^{\dagger}(S_{3}^{\top}U_{2})\right]$ is indeed identical to $\mathcal{U}=span(V_{1})=span(V_{2})$ .

Given $V=V_{1}^{\dagger}V_{2}$ , then $U_{1}V=\text{Proj}_{S_{1}^{\perp}}V_{1}V=\text{Proj}_{S_{1}^{\perp}}V_{2}$ . Recall that by definition $U_{2}=\text{Proj}_{S_{2}^{\perp}}V_{2}$ , then the problem is now reduced to the simple problem of merging two projections ( $U_{2}=\text{Proj}_{S_{2}^{\perp}}V_{2}$ and $U_{1}V=\text{Proj}_{S_{1}^{\perp}}V_{2}$ ) of the same matrix ( $V_{2}$ ). Therefore, to show that the columns of $U_{0}=[U_{2},U_{1}V]$ indeed span $V_{2}$ and thus the desired subspace $U$ , we only need to show that $[\text{Proj}_{S_{1}^{\perp}},\text{Proj}_{S_{2}^{\perp}}]$ has full column span. We show this by bounding the smallest singular value of it:

where the last inequality is by the assumption that $\sigma_{2s}([S_{1},S_{2}])>0$ . ∎

Next, we show that in the exact case, the matrix $V=V_{1}^{{\dagger}}V_{2}$ can be computed by $V=(S_{3}^{\top}U_{1})^{\dagger}(S_{3}^{\top}U_{2})$ . The basic idea is to use the overlapping part of the two projections $U_{1}$ and $U_{2}$ to align the two basis $V_{1}$ and $V_{2}$ . Recall that by its construction, $S_{3}=(S_{1}\cup S_{2})^{\perp}=S_{1}^{\perp}\cap S_{2}^{\perp}$ , and $\text{Proj}_{S_{3}}=\text{Proj}_{S_{1}^{\perp}\cap S_{2}^{\perp}}$ . Then for $j=1$ and $2$ , we have:

Moreover, since $U_{j}=\text{Proj}_{S_{j}^{\perp}}V_{j}$ is an orthonormal matrix, we have that all singular values of $V_{j}$ are equal or greater than 1. Also note that $U$ is an orthonormal matrix, so we have that $\sigma_{k}(\text{Proj}_{S_{3}}V_{j})\geq\sigma_{k}(\text{Proj}_{S_{3}}U)>0$ . In other words, $S_{3}^{\top}V_{j}$ has full column rank $k$ . Therefore,

Given $\widehat{S}_{1},\widehat{S}_{2}$ and $\widehat{U}_{1},\widehat{U}_{2}$ which are close to the exact $S_{1},S_{2},U_{1}$ and $U_{2}$ , we then need to bound the distance $\|\text{Proj}_{\widehat{U}}-\text{Proj}_{U}\|$ . This follows the standard perturbation analysis. In order to apply Lemma G.5 we need to bound the distance between $\|\widehat{U}-U_{0}\|_{F}$ , and lower bound the smallest singular value of $U_{0}$ , namely $\sigma_{k}(U_{0})$ . Recall that we define $U_{0}$ same as in (28) for the exact case with $\delta_{s}=\delta_{u}=0$ .

First, we bound $\|\widehat{U}-U_{0}\|_{F}$ . Note that we can write $U_{0}^{\top}$ as $U_{0}^{\top}=U_{2}B$ , where $B=[I,\quad U_{1}(S_{3}^{\top}U_{1})^{{\dagger}}S_{3}]^{\top}$ .

Recall that $S_{3}=(S_{1}\cup S_{2})^{\perp}$ , apply Lemma G.5 and we have:

Next, note that $\|{\widehat{S}_{3}}-{S_{3}}\|<1$ and $\|\widehat{U}_{1}-U_{1}\|\leq\delta_{u}<1$ , apply Lemma G.6 we have:

Next, note that $\sigma_{k}(S_{3}^{\top}U_{1})=\sigma_{k}(\text{Proj}_{S_{3}}V_{1})>0$ by assumption. Apply Lemma G.8, we have:

Next, apply Lemma G.6 again we can bound the perturbation of matrix product:

where $C$ is some absolute constant, and the last inequality summarizes the previous three inequalities, and used the fact that $\sigma_{k}(\text{Proj}_{S_{3}}V_{1})<1$ . Note that $\|\widehat{U}-U_{0}\|_{F}\leq\sqrt{k}\|\widehat{U}-U_{0}\|$ .

We are left to bound $\sigma_{k}(U_{0})$ . Recall that $\sigma_{k}(V_{2})\geq\sigma_{k}(U_{2})=1$ , and we have shown that in the exact case $U_{0}=[\text{Proj}_{S_{2}^{\perp}}V_{2},\quad\text{Proj}_{S_{1}^{\perp}}V_{2}]$ . Then we can bound the smallest singular value of $U_{0}$ following the inequality in (47):

Finally we can apply Lemma G.5 to bound the distance between the projections by:

In Step 1 (c), we are given the output $\widetilde{U}_{1}$ and $\widetilde{U}_{2}$ from Step 1 (b), as well as the output $\widetilde{S}_{1}^{\perp}$ and $\widetilde{S}_{2}^{\perp}$ from Step 1 (a). Recall that $\mathcal{U}=span\{\text{vec}(\widetilde{\Sigma}^{(i)}):i\in[k]\}$ , and for $j=1,2$ , the matrix $\widetilde{U}_{j}$ given by Step 1 (b) corresponds to the subspace $\mathcal{U}$ projected to the subspace $\widetilde{B}_{j}=\widetilde{S}_{j}^{\perp}\otimes_{kr}I_{n}$ .

Let matrix $\widetilde{S}_{3}=\widetilde{S}_{1}^{\perp}\cap\widetilde{S}_{2}^{\perp}=(\widetilde{S}_{1}\cup\widetilde{S}_{2})^{\perp}$ (obtained by taking the singular vectors of $(I_{n}-AA^{\top})$ , where $A$ corresponds to the first $2k|\mathcal{H}|$ singular vectors of $[\widetilde{S}_{1},\widetilde{S}_{2}]$ ), and denote $\widetilde{B}_{3}=\widetilde{S}_{3}\otimes_{kr}I_{n}$ . Define the matrix $\widetilde{Q}_{U}$ to be:

and similarly define the perturbed version $\widehat{Q}_{U}$ to be:

Now we want to apply Lemma B.11 to show that $\text{Proj}_{\widetilde{Q}_{U}}=\text{Proj}_{\widetilde{\Sigma}}$ and bound the distance $\|\text{Proj}_{\widehat{Q}_{U}}-\text{Proj}_{\widetilde{\Sigma}}\|$ . In order to use the lemma, we first use smoothed analysis to show (in Lemma B.13 and Lemma B.14 )that the conditions required by the lemma are all satisfied with high probability over the $\rho$ -perturbation of the covariance matrices, then conclude the robustness of Step 1 (c) in Lemma B.15.

With high probability, for some constant $C$

This is in fact exactly the same as Claim B.9.

Given ${\widetilde{\Sigma}}=\Sigma+E$ , by the definition of $\widetilde{S}_{3}$ and $\widetilde{B}_{3}$ we know that $\widetilde{B}_{3}$ only depends on the randomness of $P_{J}E$ for $i=1,2$ , where

and $P_{J}$ denotes the mapping that only keeps the coordinates corresponding to the set $\mathcal{J}$ . Therefore, we have:

Note that the rank of $\widetilde{B}_{3}^{\perp}$ is $2nk|\mathcal{H}|)$ and $|\mathcal{J}|=2n|\mathcal{H}|$ , thus $n_{2}-|\mathcal{J}|-2nk|\mathcal{H}|-k=\Omega(n^{2})>2k$ . So we can apply Lemma G.15 to conclude that for some absolute constants $C_{1},C_{2},C_{3}$ , with probability at least $1-(C_{1}\epsilon)^{C_{2}n^{2}}$ , $\sigma_{k}(\widetilde{B}_{3}^{\top}{\widetilde{\Sigma}})\geq\epsilon\rho\sqrt{C_{3}n^{2}}.$ ∎

With high probability, for some constant $C$ ,

For $i=1,2$ , recall that $\widetilde{S}_{i}$ is the singular vectors of $\widetilde{Q}_{S_{i}}$ , where $\widetilde{Q}_{S_{i}}$ is defined with the set $\mathcal{H}_{i}$ as in (12). We can write the singular value decomposition of $\widetilde{Q}_{S_{i}}$ as $\widetilde{Q}_{S_{i}}=\widetilde{S}_{i}\widetilde{D}_{i}\widetilde{V}_{i}^{\top}$ for some diagonal matrix $\widetilde{D}_{i}$ and orthonormal matrix $\widetilde{V}_{i}$ , and

Note that we can write $[\widetilde{Q}_{S_{1}},\widetilde{Q}_{S_{2}}]=[\widetilde{P}_{S_{1}},\widetilde{P}_{S_{2}}](\text{diag}(B_{\widetilde{S}_{1}},B_{\widetilde{S}_{2}}))^{\top}$ , and following almost exactly with the proof of Lemma B.1, we can argue that, with probability at least $1-(C_{1}\epsilon)^{C_{2}n}$ ,

Moreover, by the structure of $M_{4}$ and the bounds on $\widetilde{\Sigma}^{(i)}\prec{1\over 2}I$ , we can bound $\|\widetilde{Q}_{S_{i}}\|\leq 3\sqrt{n(|\mathcal{H}|/3)^{3}}$ , and thus:

Therefore, we can conclude that, for some absolute constant $C$ , we have:

In the next lemma, we apply Lemma B.11 to show that under perturbation, with high probability the column span of $\text{Proj}_{\widetilde{Q}_{U}}=\text{Proj}_{\widetilde{\Sigma}}$ and this step is robust.

Note that $\sigma_{2k|\mathcal{H}|n}([\widetilde{B}_{1},\widetilde{B}_{2}])=\sigma_{2k|\mathcal{H}|}([\widetilde{S}_{1},\widetilde{S}_{2}])$ , and for $i=1,2$ , we have $\|\widehat{B}_{i}-\widetilde{B}_{i}\|_{F}\leq\sqrt{n}\|\widehat{S}_{i}-\widetilde{S}_{i}\|_{F}\leq\sqrt{n}\delta_{s}$ . Therefore, with the above two smoothed analysis Lemmas showing polynomial bound of $\sigma_{2k|\mathcal{H}|}([\widetilde{S}_{1},\widetilde{S}_{2}])$ and $\sigma_{k}(\text{Proj}_{\widetilde{B}_{3}}(\widetilde{\Sigma}))$ , the proof of Lemma B.15 follows by applying Lemma B.11.

Appendix C Step 2. Unfolding the Moments

In the second step of the algorithm, we solve two systems of linear equations to recover the unfolded moments.

Recall the first system of linear equations is

Since $U$ is the column span matrix of $\widetilde{\Sigma}$ , there must exist a $Y_{4}$ such that $X_{4}=\widetilde{\Sigma}D_{\widetilde{\omega}}\widetilde{\Sigma}^{\top}=UY_{4}U^{\top}$ (recall that $D_{\widetilde{\omega}}$ is the diagonal matrix with entries $\widetilde{\omega}_{i}$ ), so the system must have at least one solution.

The main theorem of this section shows that with high probability over the $\rho$ -perturbation the system has a unique solution:

With high probability over the $\rho$ -perturbation of $\widetilde{\Sigma}$ , the smallest singular value of the coefficient matrix $\widetilde{H}_{4}$ is lower bounded by $\sigma_{min}(\widetilde{H}_{4})\geq\Omega(\rho^{2}n/k)$ . As a corollary, the system has a unique solution.

In order to prove this theorem, we first need the following structural lemma:

Next we need to prove the bounds on the smallest singular values for $\widetilde{A}_{4}$ and $\widetilde{B}_{4}$ . The first matrix $\widetilde{A}_{4}$ is essentially a projection of the Kronecker product $(\widetilde{\Sigma}\otimes_{kr}\widetilde{\Sigma})$ . In particular, this projection satisfy the “symmetric off-diagonal” property defined below:

Since symmetric off-diagonal is a property on the structure of rows of the basis $P$ . If one basis of the subspace $\mathcal{P}$ is symmetric off-diagonal, then any basis is too. Moreover, any orthogonal basis of the subspace $\mathcal{P}$ will still be symmetric off-diagonal.

In the following main lemma, we show even after projection to any symmetric off-diagonal space with sufficiently many dimensions, the “unique” columns of a Kronecker product of random matrices still has good condition number.

Let us first see how Theorem C.1 follows from the two lemmas (Lemma C.2 and Lemma C.5 ).

(of Theorem C.1) Using the structural Lemma C.2, we know we only need to bound the smallest singular value of $\widetilde{A}_{4}$ and $\widetilde{B}_{4}$ separately. The following two claims directly imply the theorem.

$\sigma_{min}(\widetilde{A}_{4})\geq\Omega(\rho^{2}n_{2}).$

$\sigma_{min}(\widetilde{B}_{4})\geq 1/(4\|\widetilde{\Sigma}\|^{2})\geq 1/(4nk).$

We apply Lemma C.5 to prove Claim C.6. Note that the $\rho$ -perturbed covariances $\widetilde{\Sigma}$ is not a random Gaussian matrix, yet it is equal to the unperturbed matrix $\Sigma$ plus a random Gaussian matrix $E_{\Sigma}=\rho E$ Note that the diagonal entries are then arbitrarily perturbed, but we will project on a symmetric off-diagonal subspace so changes on diagonal entries do not change the result.. Since we consider arbitrary $\Sigma$ , the columns of $\widetilde{\Sigma}$ as well as the columns $\widetilde{A}_{4}$ may not be incoherent.

Instead, we project $\widetilde{A}_{4}$ to a subspace to strip away the terms involving the original matrix $\Sigma$ . Let $S$ be the range space corresponding to the projection $\mathcal{F}_{4}$ . Recall that $|S|=n_{4}=\Omega(n_{2}^{2})$ , and by the definition of $\mathcal{F}_{4}$ , $S$ is symmetric off-diagonal. Define the subspace $S^{\prime}=\mbox{span}(S^{\perp},\Sigma\otimes_{kr}I_{n_{2}},I_{n_{2}}\otimes_{kr}\Sigma)$ . Let $P=(S^{\prime})^{\perp}$ . By construction $|P|\geq|S|-2kn_{2}=\Omega(n_{2}^{2})$ . Also, since $P=(S^{\prime})^{\perp}$ is a subspace of $S$ , it must also be symmetric off-diagonal (see Remark C.4). After projecting $\widetilde{A}_{4}$ to $P$ , we know that the $(i,j)$ -th column $(1\leq i\leq j\leq k)$ of $P^{\top}\widetilde{A}_{4}$ is given by:

Thus in $P^{\top}\widetilde{A}_{4}$ all the terms involving $\Sigma$ disappears. Therefore

where the first inequality is because the smallest singular value cannot become larger after projection, the first equality is by definition, the second equality is by the property of $P$ , and the final step uses Lemma C.5Note that although diagonal entries are not perturbed, we also have $P_{[i,i]}=0$ so we can still apply the lemma..

In this part we prove the structural Lemma C.2.

(of Lemma C.2) First, assume we know the true $\widetilde{\Sigma}$ matrix, then in order to get the unfolded moments $X_{4}$ , we only need to solve the equation $\mathcal{F}_{4}(\widetilde{\Sigma}D_{4}\widetilde{\Sigma}^{\top})=\overline{M}_{4}$ with the $k\times k$ symmetric variable $D_{4}$ , and the solution should be equal to the diagonal matrix $D_{\widetilde{\omega}}$ .

However, we only know $U$ which is the column span of $\widetilde{\Sigma}$ , so we can only use $UY_{4}U^{\top}$ and let $UY_{4}U^{\top}=\widetilde{\Sigma}D_{4}\widetilde{\Sigma}^{\top}$ . Note that there is a one-to-one correspondence between $Y_{4}$ and $D_{4}$ . In particular we know $D_{4}=(\widetilde{\Sigma}^{\dagger}U)Y_{4}(\widetilde{\Sigma}^{\dagger}U)^{\top}$ , this is exactly the second part $\widetilde{B}_{4}$ .

Now the first matrix $\widetilde{A}_{4}$ should map $\mbox{vec}(D_{4})$ to $M_{4}$ . By construction, the $(i,j)$ -th column $(i<j)$ of $\widetilde{A}_{4}$ is equal to $\mathcal{F}_{4}(\widetilde{\Sigma}^{(i)}\odot\widetilde{\Sigma}^{(j)}+\widetilde{\Sigma}^{(j)}\odot\widetilde{\Sigma}^{(i)})=2\mathcal{F}_{4}(\widetilde{\Sigma}^{(i)}\odot\widetilde{\Sigma}^{(j)})$ , since $\mathcal{F}_{4}$ is symmetric off-diagonal we know $\mathcal{F}_{4}(v_{1}\odot v_{2})=\mathcal{F}_{4}(v_{2}\odot v_{1})$ for any two vectors $v_{1},v_{2}$ . For the $(i,i)$ -th column, by construction they are equal to $\mathcal{F}_{4}(\widetilde{\Sigma}^{(i)}\odot\widetilde{\Sigma}^{(i)})$ as we wanted. ∎

Main Lemma on Projection of Kronecker Product

The singular values of Kronecker Product between two matrices are well-understood: they are just the products of the singular values of the two matrices. Therefore, the Kronecker product of two rank $k$ matrices will have rank $k^{2}$ . However, in our case the problem becomes more complicated because we only look at a projection of the resulting matrix. The projected Kronecker product may no longer have rank $k^{2}$ because of symmetry. Here we are able to show that even with projection to a low dimensional space, the rank of the new matrix is still as large as ${k+1\choose 2}$ .

The basic idea of the proof is to consider the inner-products between columns, and show that the columns are incoherent even after projection.

(of Lemma C.5) Consider the matrix $(E\otimes_{kr}E)_{uniq}^{\top}PP^{\top}(E\otimes_{kr}E)_{uniq}$ , we shall show the matrix is diagonally dominant and hence its smallest singular value must be large. In order to do that we need to prove the following two claims:

For any $i,j\leq k$ , $i\leq j$ , with high probability $\|P^{\top}(E_{[:,i]}\odot E_{[:,j]})\|^{2}\geq\Omega(n_{2}^{2}).$

For any $i,j\leq k$ , $i\leq j$ , with high probability

With this two claims, we can apply Gershgorin’s Disk Theorem G.9 to conclude that $\sigma_{min}((E\otimes_{kr}E)_{uniq}^{\top}PP^{\top}(E\otimes_{kr}E)_{uniq})\geq\Omega(n_{2}^{2})$ . Therefore $\sigma_{min}(P^{\top}(E\otimes_{kr}E)_{uniq})\geq\Omega(n_{2})$ .

Now we prove the two claims. For Claim C.8, it essentially says the projection of a random vector to a fixed subspace should have large norm. If the vector has independent entries, this is first shown in Tao and Vu (2006). Recently Vu and Wang (2013) generalized the result to $K$ -concentrated vectors, see Lemma G.18. By Lemma G.19 we know conditioned on $\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}$ , $(E_{[:,i]}\odot E_{[:,j]})_{p,q}(p\neq q)$ is $O(\sqrt{n_{2}})$ -concentrated. By assumption $P$ ignores all the $(E_{[:,i]}\odot E_{[:,j]})_{p,p}$ entries. Therefore $\Pr[|\|P^{\top}(E_{[:,i]}\odot E_{[:,j]})\|^{2}-d_{2}|\geq 2t\sqrt{d_{2}}+t^{2}]\leq Ce^{-\Omega(t^{2}/n_{2})}+e^{-\Omega(n_{2})}$ . We then pick $t=\sqrt{d_{2}}/5\geq\Omega(n_{2})$ , which implies $\Pr[\|P(E_{[:,i]}\odot E_{[:,j]})\|^{2}\leq d_{2}/2]\leq Ce^{-\Omega(n_{2})}$ . This is what we need for Claim C.8.

For Claim C.9, we need to bound terms of the form $\left<P^{\top}(E_{[:,i]}\odot E_{[:,j]}),P^{\top}(E_{[:,i^{\prime}]}\odot E_{[:,j^{\prime}]})\right>$ . These are degree-4 Gaussian chaoses and are well-studied in Latała et al. (2006).

We break the terms according to how many of $i^{\prime},j^{\prime}$ appears in $i,j$ .

Case 1: $i^{\prime},j^{\prime}\not\in\{i,j\}$ . In this case we first randomly pick $E_{[:,i]},E_{[:,j]}$ , and condition on the high probability event that $\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}$ . In this case the inner-product can be rewritten as $\left<PP^{\top}(E_{[:,i]}\odot E_{[:,j]}),(E_{[:,i^{\prime}]}\odot E_{[:,j^{\prime}]})\right>$ , and we know $\|PP^{\top}(E_{[:,i]}\odot E_{[:,j]})\|\leq 4n_{2}$ . Also, since $P$ is symmetric off-diagonal we know in this degree-2 Gaussian chaos (only $E_{[:,i^{\prime}]}$ and $E_{[:,j^{\prime}]}$ are random now) there are no “diagonal” terms. Therefore the Decoupling Theorem G.23 shows without loss of generality we can assume $i^{\prime}\neq j^{\prime}$ . Apply Theorem G.21 we know this term is bounded by $O(n_{2}^{1+\epsilon})$ with high probability for any $\epsilon>0$ .

Case 2: One of $i^{\prime},j^{\prime}$ is in $\{i,j\}$ . Without loss of generality assume $i^{\prime}\in\{i,j\}$ (the other case is symmetric). Again we first randomly pick $E_{[:,i]},E_{[:,j]}$ and condition on the high probability event that $\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}$ (but this will also determine $E_{[:,i^{\prime}]}$ ). After the conditioning, only $E_{[:,j^{\prime}]}$ is still random, and the inner-product can be rewritten as $\left<\mbox{mat}(PP^{\top}(E_{[:,i]}\odot E_{[:,j]})E_{[:,i^{\prime}]},E_{[:,j^{\prime}]}\right>$ where the fixed vector $\mbox{mat}(PP^{\top}(E_{[:,i]}\odot E_{[:,j]}))E_{[:,i^{\prime}]}$ has norm bounded by $\|PP^{\top}(E_{[:,i]}\odot E_{[:,j]})\|\|E_{[:,i^{\prime}]}\|\leq 8n_{2}^{3/2}$ . By property of Gaussian with high probability the inner-product is bounded by $O(n_{2}^{3/2+\epsilon})$ for any $\epsilon>0$ .

Case 3: $i^{\prime},j^{\prime}\in\{i,j\}$ . Since $i^{\prime},j^{\prime}$ cannot be equal to $i,j$ , there is only one possibility: $i^{\prime},j^{\prime}$ are both equal to one of $i,j$ and $i\neq j$ . Without loss of generality assume $i^{\prime}=j^{\prime}=i\neq j$ . We can swap $i,j$ with $i^{\prime},j^{\prime}$ and this actually becomes Case 2. By the same argument we know this term is bounded by $O(n_{2}^{3/2+\epsilon})$ for any $\epsilon>0$ .

There are $O(k^{2})$ terms in Case 1, $O(k)$ terms in Case 2 and $O(1)$ terms in Case 3. Therefore by union bound we know the sum is bounded by $O(kn_{2}^{3/2+\epsilon}+k^{2}n_{2}^{1+\epsilon})$ with high probability. Recall we are assuming $n_{2}\geq k^{2+C}$ (which only requires $n\geq k^{1+C/2}$ ). Choose $\epsilon$ to be a small enough constant depending on $C$ gives the result. ∎

C.2 Unfolding 666-th Order Moments

Recall the second system of linear equations is

The second system of linear equations tries to unfold the $6$ -th order moment $\overline{M}_{6}$ to get $Y_{6}$ . Similar to Theorem C.1 the following theorem guarantees that with high probability over the perturbation the system has a unique solution.

With high probability over the perturbation, the coefficient matrix $\widetilde{H}_{6}$ has smallest singular value $\sigma_{min}(\widetilde{H}_{6})\geq\Omega(\rho^{3}(n/k)^{1.5})$ . As a corollary, the system has a unique solution.

The proof of this theorem is very similar to the proof of Theorem C.1. Here we list the important steps and highlight the differences.

As before the theorem relies on a structural lemma (Lemma C.11), and a main lemma about the symmetric off-diagonal projection of a Kronecker product of three identical matrices (Lemma C.13).

Before stating the main lemma, we update the definition of symmetric off-diagonal subspace.

It is easy to verify that since the moments in $\overline{M}_{6}$ all have indices corresponding to distinct variables, the projection $\mathcal{F}_{6}$ is indeed symmetric off-diagonal. The constraints in this definition is closely related to the decoupling Theorem G.23 of Gaussian chaoses.

The proofs of Theorem C.10 are based on the above two lemmas. The proof of Lemma C.11 is essentially the same as Lemma C.2. The proof of Lemma C.13 is very similar to that of Lemma C.5, and we highlight the only different case below:

As before we try to prove that the columns of $P^{\top}(E\otimes_{kr}E\otimes_{kr}E)_{uniq}$ are incoherent. Recall we needed the following two claims:

For any $1\leq i_{1}\leq i_{2}\leq i_{3}\leq k$ , with high probability $\|P^{\top}(E_{[:,i_{1}]}\odot E_{[:,i_{2}]}\odot E_{[:,i_{3}]})\|^{2}\geq\Omega(n_{2}^{3}).$

For any $1\leq i_{1}\leq i_{2}\leq i_{3}\leq k$ , with high probability

The first claim can still be proved by the projection Lemma G.18, except the vector $E_{[:,i_{1}]}\odot E_{[:,i_{2}]}\odot E_{[:,i_{3}]}$ is now $O(n_{2})$ -concentrated (the proof is an immediate generalization of Lemma G.19).

The second claim can be proved using similar ideas, however there is one new case. We again separate the terms according to the number of $i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime}$ that do not appear in $\{i_{1},i_{2},i_{3}\}$ .

Case 1: At least one of $i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime}$ does not appear in $\{i_{1},i_{2},i_{3}\}$ . Suppose there are $t$ of $i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime}$ that do not appear in $\{i_{1},i_{2},i_{3}\}$ , similar to before we first sample $E_{i_{1}},E_{i_{2}},E_{i_{3}}$ and condition on the event that they all have norm at most $2\sqrt{n_{2}}$ . The inner-product then becomes an order $t$ Gaussian chaos with Frobenius norm $n_{2}^{6-t/2}$ . By Theorem G.23 and Theorem G.21 we know with high probability all these terms are bounded by $n_{2}^{6-t/2+\epsilon}$ for any constant $\epsilon>0$ .

Case 2: All of $i_{1}^{\prime},i_{2}^{\prime},i_{3}^{\prime}$ appear in $\{i_{1},i_{2},i_{3}\}$ . In the previous proof (of Lemma C.5), there was only one possibility and it reduces to Case 1. However for $6$ -th moment we have a new case: $i=i_{1}=i_{2}=i_{1}^{\prime}<i_{2}^{\prime}=i_{3}^{\prime}=i_{3}=j$ (and the symmetric case $i_{1}=i_{1}^{\prime}=i_{2}^{\prime}<i_{2}=i_{3}=i_{3}^{\prime}$ ). For this we will treat $T=PP^{\top}$ as a $6$ -th order tensor with Frobenius norm at most $n_{2}^{3/2}$ (as a matrix it has spectral norm 1, and rank at most $n_{2}^{3}$ ). The tensor is applied to the vectors $E_{[:,i]}$ and $E_{[:,j]}$ as $T(E_{[:,i]},E_{[:,i]},E_{[:,j]},E_{[:,i]},E_{[:,j]},E_{[:,j]})$ . First we sample $E_{[:,i]}$ , by Lemma G.24 we know with high probability what remains will be a $3$ -rd order tensor $T(E_{[:,i]},E_{[:,i]},I,E_{[:,i]},I,I)$ with Frobenius norm bounded by $O(n_{2}^{2+\epsilon})$ . Notice that here it is important that Lemma G.24 can handle diagonal entries, because $E_{[:,i]}$ appears on the $1,2,4$ -th coordinate (instead of the first three). We the apply Lemma G.24 again on $T(E_{[:,i]},E_{[:,i]},I,E_{[:,i]},I,I)(E_{[:,j]},E_{[:,j]},E_{[:,j]})$ The notation might be confusing here: $T(E_{[:,i]},E_{[:,i]},I,E_{[:,i]},I,I)$ is a $3$ rd order tensor, and we are applying it to $E_{[:,j]},E_{[:,j]},E_{[:,j]}$ . The whole expression is equal to $T(E_{[:,i]},E_{[:,i]},E_{[:,j]},E_{[:,i]},E_{[:,j]},E_{[:,j]})$ ., and conclude that with high probability the term is bounded by $O(n_{2}^{2.5+2\epsilon})$ which is still much smaller than $n_{2}^{3}$ .

Finally we take the sum over all terms and choose $\epsilon$ to be small enough (depending on $C$ ), then when $k^{2+C}\leq n_{2}$ the sum is a lower-order term. ∎

C.3 Stability Bounds

For the two linear equation systems in (7), we can write them in canonical form with coefficient matrices $\widetilde{H}_{4},\widetilde{H}_{6}$ and the unknown variable $\text{vec}(Y_{4}),\text{vec}(Y_{6})$ , corresponding to the $k_{2},k_{3}$ distinct elements in symmetric $Y_{4},Y_{6}$ , namely:

When $\widehat{M}_{4},\widehat{M}_{6}$ , the empirical moment estimations for $\widetilde{M}_{4},\widetilde{M}_{6}$ , are used throughout the algorithm, both the coefficient matrices $\widetilde{H}_{4},\widetilde{H}_{6}$ and the constant terms $\overline{M}_{4},\overline{M}_{6}$ are affected by the noise from empirical estimation. In practice, instead of solving systems of linear equations, we solve the least square problem:

and the solution to the least square problems are given by: $\text{vec}(\widehat{Y}_{4})=\widehat{H}_{4}^{\dagger}\overline{\widehat{M}}_{4}$ and $\text{vec}(\widehat{Y}_{6})=\widehat{H}_{6}^{\dagger}\overline{\widehat{M}}_{6}$ .

Given the empirical 4-th and 6-th order moments $\widehat{M}_{4}=\widetilde{M}_{4}+E_{4}$ , $\widehat{M}_{6}=\widetilde{M}_{6}+E_{6}$ , and suppose that the absolute value of entries in $E_{4}$ and $E_{6}$ are at most $\delta_{1}$ . Let $\widehat{U}$ be the output of Step 1 for the span of the covariance matrices, and suppose that $\|\widehat{U}-\widetilde{U}\|\leq\delta_{2}$ . Suppose that $\delta_{1}\leq\min\{\|\widetilde{M}_{4}\|_{F}/\sqrt{n_{4}},\|\widetilde{M}_{6}\|_{F}/\sqrt{n_{6}}\}$ , and $\delta_{2}\leq\min\{1,\sigma_{k_{2}}(\widetilde{H}_{4})/2,\sigma_{k_{3}}(\widetilde{H}_{6})/2\}$ . Then, conditioned on the high probability event that both $\sigma_{k_{2}}(\widetilde{H}_{4}),\sigma_{k_{3}}(\widetilde{H}_{6})$ are bounded below, we have:

Recall that the coefficient matrix $\widetilde{H}_{4}$ corresponds to the composition of two linear mappings $\mathcal{F}_{4}(UY_{4}U^{\top})$ on the variable $Y_{4}$ . Since we have showed that $\mathcal{F}_{4}$ is a projection determined by the Isserlis’ Theorem and independent of the empirical estimation of the moments, we can bound the perturbation on the coefficient matrices by:

Similarly, we have $\|\widehat{H}_{6}-\widetilde{H}_{6}\|\leq\|\widehat{U}\odot^{3}-\widetilde{U}\odot^{3}\|\leq 7\delta_{2}\leq\|\widetilde{H}_{6}\|$ .

Therefore we can analyze the stability of the solution to the least square problems in (51) as follows:

where the first inequality is by applying Lemma G.6 and note that $\|(\overline{\widehat{M}}_{4}-\overline{\widetilde{M}}_{4})\|_{F}\leq\delta_{1}\sqrt{n_{4}}\leq\|\overline{\widetilde{M}}_{4}\|_{F}$ , the second inequality is because $\|\overline{\widetilde{M}}_{4}\|_{F}\leq O(\sqrt{n_{4}})$ , the third inequality is by applying the perturbation bound of pseudo-inverse in Theorem G.7, the fourth inequality is by the assumption that $\delta_{2}$ is sufficiently small compared to the smallest singular value of $\widetilde{H}_{4}$ thus $\sigma_{k_{2}}(\widehat{H}_{4})=O(\sigma_{k_{2}}(\widetilde{H}_{4}))$ .

Appendix D Step 3: Tensor Decomposition

Given the estimations of the unfolded moments $Y_{4}$ and $Y_{6}$ from Step 2, and given the span of covariance matrices $U$ from Step 1, Step 3 use tensor decomposition to robustly find the parameters of the mixture of zero-mean Gaussians.

Recall that in the coordinate system with basis $U$ , the covariance matrices (vectorized with distinct entries) are given by $\widetilde{\Sigma}^{(i)}=\widetilde{U}\widetilde{\sigma}^{(i)}$ for all $i$ . The unfolded moments in the same coordinate system are:

We will apply tensor decomposition algorithm to find the $\widetilde{\sigma}^{(i)}$ ’s. We restate the theorem for orthogonal symmetric tensor decomposition in Anandkumar et al. Anandkumar et al. (2014) below:

In order to reduce our problem to the orthogonal tensor decomposition so that the tensor power method (Algorithm 1, page 21 in Anandkumar et al. (2014)) can be applied, we use the same “whitening” technique as in Anandkumar et al. (2014). We first compute the SVD of the unfolded 4-th moments $\widetilde{Y}_{4}=\widetilde{V}_{2}\widetilde{\Lambda}_{2}\widetilde{V}_{2}^{\top}$ , then use the singular vectors to transform the unfolded 6-th moments $Y_{6}$ into an orthogonal symmetric tensor $\widetilde{Y}_{6}(\widetilde{V}_{2}\widetilde{\Lambda}_{2}^{-1/2},\widetilde{V}_{2}\widetilde{\Lambda}_{2}^{-1/2},\widetilde{V}_{2}\widetilde{\Lambda}_{2}^{-1/2})$ .

Next we complete the stability analysis for the two-step procedure, i.e. whitening and orthogonal tensor decomposition, which was not analyzed in Anandkumar et al. (2014).

There exists an algorithm that finds $\widehat{a}_{i}$ and $\widehat{\omega}_{i}$ in polynomial (in variables $(n,k,1/\sigma_{min}(G_{2}))$ ) running time with the following guarantee: with probability at least $1-e^{-n}$ , for some permutation $\pi$ over $[k]$ and for all $i\in[k]$ we have:

Note th at in the exact case with $G_{2}$ and $G_{3}$ , we have that:

where $\lambda_{i}=\omega_{i}^{-1/2}$ , and the vectors $v_{i}=\lambda_{i}^{-1}V_{2}^{\top}\Lambda_{2}^{-1/2}a_{i}$ and they are orthonormal. Also note that $\lambda_{min}\geq 1$ and $\lambda_{max}\leq\omega_{o}^{-1/2}$ . We can then apply orthogonal tensor decomposition (Algorithm 1 in Anandkumar et al. (2014)) to $\widehat{G}$ to robustly obtain estimations of $v_{i}$ ’s and $\lambda_{i}$ ’s. After obtaining the estimation $\widehat{v}_{i}$ and $\widehat{\lambda}_{i}$ ’s, we can further obtain the estimation of $a_{i}$ ’s and $\omega_{i}$ ’s as:

The estimation of the vectors and weights are given in (52). In order to bound the distance $\|\widehat{a}_{i}-a_{i}\|$ and $\|\widehat{\omega}_{i}-\omega_{i}\|$ , we show the stability of the estimation $\widehat{V}_{2}$ , $\widehat{\Lambda}_{2}$ , and $\widehat{v}_{i}$ , $\widehat{\lambda}_{i}$ separately.

First, note that by assumption $\|\widehat{G}_{2}-G_{2}\|_{F}\leq\delta_{2}$ , we can apply Lemma G.2 and Lemma G.3 to bound the singular values and the singular vectors of $\widehat{G}_{2}$ by:

Define $X=V_{2}\Lambda_{2}^{-1/2}$ and define $\Delta_{X}=\widehat{X}-X$ . By the assumption that $\delta_{2}\leq o(\gamma_{min})$ , we have $\|\widehat{V}_{2}-V_{2}\|\leq 1$ and $\|\widehat{\Lambda}_{2}^{-1/2}-\Lambda_{2}^{-1/2}\|\leq\|\Lambda_{2}^{-1/2}\|\leq\gamma_{min}^{-1/2}$ . Therefore we can apply Lemma G.6 to bound $\|\Delta_{X}\|$ :

Moreover, since $\delta_{2}\leq o(\gamma_{min})$ , we also have $\|\Delta_{X}\|\leq\|X\|=\gamma_{min}^{-0.5}$ .

Next, we bound the distance $\|\widehat{G}-G\|$ . Recall that $\widehat{G}=\widehat{G}_{3}(\widehat{X},\widehat{X},\widehat{X})$ . Using the fact that tensor is a multi-linear operator, and by the assumption that $\|\widehat{G}_{3}-G_{3}\|\leq\delta_{3}$ , we have:

Note that by the assumption $\delta_{2}\leq o({\gamma_{min}^{2.5}\over k\|G_{3}\|})$ , $\delta_{3}\leq o({\gamma_{min}^{1.5}\over k})$ , we have $\epsilon\leq o({1\over k})$ . Therefore we can apply Theorem D.2 to conclude that with probability at least $1-e^{-n}$ (over the randomness of the randomized algorithm itself), the tensor power algorithm runs in time $\text{poly}(n,k,1/\lambda_{min})$ and for some permutation $\pi$ over $[k]$ it returns:

Finally, since we also have $5\epsilon\leq 1/2\leq\lambda_{min}/2$ we can bound the estimation error of $\widehat{a}_{i}$ and $\widehat{\omega}_{i}$ as defined in (52) by:

Now we can apply Theorem D.2 to our case.

Given $\widehat{Y}_{4}$ , $\widehat{Y}_{6}$ , $\widehat{U}$ and suppose that $\|\widehat{Y}_{4}-\widetilde{Y}_{4}\|_{F}$ , $\|\widehat{Y}_{6}-\widetilde{Y}_{6}\|_{F}$ as well as $\|\widehat{U}-\widetilde{U}\|$ are bounded by some $inverse\ poly(n,k,1/\omega_{o},1/\rho)\delta$ . There exists an algorithm that with high probability, returns $\widehat{\Sigma}^{(i)}$ ’s and $\widehat{\omega}_{i}$ ’s such that for some permutation $\pi$ over $[k]$ , we have the distance $\|\widehat{\Sigma}^{(i)}-\widetilde{\Sigma}^{(i)}\|$ and $\|\widehat{\omega}_{i}-\widetilde{\omega}_{i}\|$ are bounded by $\delta$ . Moreover, the running time of the algorithm is upperbounded by $\text{poly}(n,k,1/\omega_{o},1/\rho)$ .

We apply Theorem D.2, and pick $G_{2}=\widetilde{Y}_{4}$ , $G_{3}=\widetilde{Y}_{6}$ . We only need to verify that $\|\widetilde{Y}_{6}\|$ and $1/\sigma_{min}(\widetilde{Y}_{4})$ are polynomials of the relevant parameters. This is easy to see, since $\sigma_{min}(\widetilde{Y}_{4})\geq\omega_{o}\sigma_{min}(\widetilde{\Sigma})^{2}$ , and the matrix $\widetilde{\Sigma}$ is a perturbed rectangular matrix which by Lemma G.15 has $\sigma_{min}(\widetilde{\Sigma})\geq\Omega(\rho\sqrt{n_{2}})$ with high probability.

Finally, given $\widehat{\sigma}^{(i)}$ , and given the output of Step 2, i.e. $\widehat{U}$ , with inverse polynomial accuracy, we can recover $\widehat{\Sigma}^{(i)}=\widehat{U}\widehat{\sigma}^{(i)}$ up to accuracy polynomial in the relevant parameters. ∎

Appendix E Proofs of Theorem 3.5

The results in all previous sections showed the correctness and robustness of each individual step for the algorithm for zero-mean case, In this section, we summarize those results to prove that the overall algorithm has polynomial time/sample complexity.

Given $N$ samples $x_{1},\dots,x_{N}$ drawn i.i.d. from the $n$ -dimensional mixture of $k$ Gaussians, if $N\geq n^{7}/\delta^{2}$ , then with high probability, we have that for all $j_{1},\dots,j_{6}\in[n]$ :

Let $x$ denote the random vector of this mixture of Gaussians. We first truncate its tail probabilities to make all the entries ( $[x]_{j}$ for $j\in[n]$ ) in the vector $x$ be in the range $[-\sqrt{n},\sqrt{n}]$ . Apply union bound, we know that with high probability (at least $1-O(e^{-n})$ ), for all indices $j_{1},\dots,j_{6}\in[n]$ , we have $\Big{|}[x]_{j_{1}}\dots[x]_{j_{6}}\Big{|}\leq n^{3}$ . Then we can apply Hoeffding’s inequality to bound the empirical moments by:

We show that, to achieve $\epsilon$ accuracy in the output of Step 3 in the algorithm for the zero-mean case, the number of samples we need to estimate the moments $M_{4}$ and $M_{6}$ is bounded by a polynomial of relevant parameters, namely $\text{poly}(n,k,1/\omega_{o},1/\epsilon,1/\rho)$ , and each step of the algorithm can be done in polynomial time.

We backtrack the input-output relations from Step 3 to Step 2 and to Step 1, and we show that the estimation error in the empirical moments and the inputs / outputs only polynomially propagate throughout the steps.

First note that we have shown that every steps fails with negligible probability ( $O(e^{-n^{C}})$ for any absolute constant $C$ ). Then apply union bound, we have that the entire algorithm works correctly with high probability.

By Lemma D.3, in order to achieve $\epsilon$ accuracy in the final estimation of the mixing weights and the covariance matrices, we need to drive the input accuracy of Step 3 (also the output accuracy of Step 2) to be bounded by some inverse polynomial in $(n,1/\epsilon,1/\rho,1/\omega_{o})$ , Also recall that this step has running time $\text{poly}(n,k,1/\rho,1/\omega_{o})$ .

Theorem C.1 and Theorem C.10 guarantee that with smoothed analysis $\sigma_{min}(\widetilde{H}_{4})$ and $\sigma_{min}(\widetilde{H}_{6})$ are lower bounded polynomially. Then by Lemma C.16, in order to have the output accuracy of Step 2 be bounded by inverse $\text{poly}(n,1/\epsilon,1/\rho,1/\omega_{o})$ , we need to drive the input accuracy of Step 2 ( $\widehat{U}$ , $\widehat{M}_{4}$ ) to be bounded by some other inverse polynomial. Step 2 involves solving linear systems of dimension $n_{4}k_{2}$ and $n_{6}k_{3}$ , thus it running time is polynomial.

Lemma B.13 and B.14 guarantees that with smoothed analysis $\sigma_{k}(\widetilde{Q}_{U})$ is lower bounded polynomially. Then by Lemma B.15, in order to have the output accuracy of Step 1 (c) ( $\widehat{U}$ ) be bounded by inverse polynomial, we need to drive the input accuracy (output $\widehat{S}_{i}$ of Step 1 (a) and output $\widehat{U}_{i}$ of Step 1 (b) ) to be bounded by some other inverse polynomial. Step 1 (c) involves multiplications and factorization of matrices of polynomial size, and thus the running time is also polynomial.

Lemma B.7 guarantees that with smoothed analysis $\sigma_{k}(\widetilde{Q}_{U_{S}})$ is lower bounded polynomially. Then by Lemma B.10, in order to have the output accuracy of Step 1 (b) ( $\widehat{U}_{S}$ ) be bounded by inverse polynomial, we need to drive the input accuracy (output $\widehat{S}_{i}$ of Step 1 (a) ) to be bounded by some other inverse polynomial. Step 1 (b) involves multiplications and factorization of matrices of polynomial size, and thus the running time is also polynomial.

Lemma B.1 guarantees that with smoothed analysis $\sigma_{k}(\widetilde{Q}_{S})$ is lower bounded by inverse polynomial. Then by Lemma B.3, in order to have the output accuracy of Step 1 (a) ( $\widehat{S}$ ) be bounded by inverse polynomial, we need to drive the input accuracy (the moment estimation $\widehat{M}_{4}$ ) to be bounded by some other inverse polynomial. Step 1 (a) involves multiplications and factorization of matrices of polynomial size, and thus the running time is also polynomial.

Finally, by Lemma E.1, in order to have the accuracy of moment estimation $(\widehat{M}_{4},\widehat{M}_{6})$ be bounded by inverse polynomial, we need the number of samples $N$ polynomial in all the relevant parameters, including $k$ .

Appendix F General Case

In this section, we present the algorithm for learning mixture of Gaussians with general means. The algorithm generalizes the insights obtained from the algorithm for the zero-mean case. The steps are very similar, and we will highlight the differences.

In this step, we find the following two subspaces:

This is very similar to Step 1 in the algorithm for the zero-mean case, and can be achieved in three small steps:

Step 1 (a). For a subset $\mathcal{H}$ of size $12\sqrt{n}$ , find the span $\mathcal{S}$ of the mean vectors and a subset of columns of the covariance matrices:

Step 1 (b). Find the span of covariance matrices projected to the subspace $S^{\perp}$ :

Step 1 (c). Run 1(a) and 1(b) on two disjoint subsets $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ . Merge the two spans $U_{1}$ and $U_{2}$ to get $\widetilde{Z}$ and $span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}:i\in[k]\}$ .

Next, we discuss each small step and compare it with the similar analysis of the algorithm for the zero-mean case.

Step 1 (a). Find the span 𝒮𝒮\mathcal{S} of the means and a subset of the columns of the covariance matrices

Similar to Step 1 (a) for the zero-mean case, in this step we want to find a subspace $\mathcal{S}$ which contains the span of a subset of columns of $\widetilde{\Sigma}^{(i)}$ ’s. However, with the mean vector $\widetilde{\mu}^{(i)}$ ’s appearing in the moments, the subspace we find also contains the span of all the mean vectors. In particular, for a subset $\mathcal{H}\in[n]$ with $|\mathcal{H}|=\sqrt{n}$ , we aim to find the following subspace:

Similar to Claim 5.1 for the zero-mean case, the key observation for finding the subspace is the structure of the one-dimensional slices of the $4$ -th order moments for the general case:

For any indices $j_{1},j_{2},j_{3}\in[n]$ , the one-dimensional slices of $\widetilde{M}_{4}$ are given by:

The proof of this step is similar to the Lemmas B.1 (for smoothed analysis) and B.3 (for stability analysis). The main difference is that in the matrix $\widetilde{B}$ defined in the structural Claim B.2, there is now another block $\widetilde{B}^{(0)}$ with $k$ columns that corresponds to the $\widetilde{\mu}^{(i)}$ directions, which we can again handle with Lemma G.12.

Lemma F.2 shows the deterministic conditions for Step 1 (a) to correctly identify the subspace $\mathcal{S}$ from the columns of $\widetilde{Q}_{S}$ , and uses smoothed analysis to show that the conditions hold with high probability.

Given $\widetilde{M}_{4}$ of a general mixture of Gaussians , for any subset $\mathcal{H}\in[n]$ and $|\mathcal{H}|=c_{2}{k}$ with the constant $c_{2}>9$ , let $\widetilde{Q}_{S}$ be the matrix defined as in (55). The columns of $\widetilde{Q}_{S}$ give the desired span $S$ defined in (53) if the matrix $\widetilde{Q}_{S}$ achieves the maximal column rank $k+k|\mathcal{H}|$ . With probability (over the $\rho$ -perturbation) at least $1-C\epsilon^{0.5n}$ for some constant $C$ , the $k(1+|\mathcal{H}|)$ -th singular value of $\widetilde{Q}_{S}$ is bounded below by:

Note that the dimension of the subspace $\mathcal{S}$ is at most $k(|\mathcal{H}|+1)<n/3$ . Then we show by the Claim about the moment structure that the matrix $\widetilde{Q}_{S}$ can be written as a product of $\widetilde{P}_{S}$ and some coefficient matrix $\widetilde{B}_{S}$ . Then we bound the smallest singular value of the two matrices $\widetilde{P}_{S}$ and $\widetilde{B}_{S}$ via smoothed analysis separately. The coefficient matrix $\widetilde{B}_{S}$ is slightly different than that in the zero-mean case, but has similar block-diagonal structure properties.

Similar to structural property in Claim B.2 for the zero-mean case, we can write the matrix $\widetilde{Q}_{S}$ in a product form:

We will bound the smallest singular value for each of the factor, and apply union bound to conclude the lower bound of $\sigma_{k(1+|\mathcal{H}|)}(\widetilde{Q}_{S})$ .

In order to lower bound $\sigma_{min}(\widetilde{B}_{S})$ , we first analyze the structure of this coefficient matrix. The matrix $\widetilde{B}_{S}$ has the following block structure:

Note that we can write the block $\widetilde{B}^{(0)}$ as:

where it is easy to see the first summand $(\widetilde{\mu}_{\mathcal{H}^{(3)}}\odot\widetilde{\mu}_{\mathcal{H}^{(2)}}+\widetilde{\Sigma}_{\mathcal{H}^{(3)},\mathcal{H}^{(2)}})\odot\widetilde{\mu}_{\mathcal{H}^{(1)}}$ is a linear combination of the columns of the block diagonal matrix $\widetilde{B}^{(1)}$ , and similarly the second and third summands are linear combinations of the columns of $\widetilde{B}^{(2)}$ and $\widetilde{B}^{(3)}$ , and the last summand is simply $-2\widetilde{B}_{0}^{(0)}$ . Therefore for some absolute constant $C$ (the smallest singular value corresponding to the linear transformation) we have that:

Note that $\widetilde{B}_{0}^{(0)}=\widetilde{\mu}_{\mathcal{H}^{(3)}}\odot\widetilde{\mu}_{\mathcal{H}^{(2)}}\odot\widetilde{\mu}_{\mathcal{H}^{(1)}}$ only depends on the randomness over the mean vectors. Note that the Khatri-Rao product is a submatrix of the Kronecker product, therefore for tall matrices $Q_{1}$ and $Q_{2}$ , we have that $\sigma_{min}(Q_{1}\odot Q_{2})\leq\sigma_{min}(Q_{1}\otimes_{kr}Q_{2})=\sigma_{min}(Q_{1})\sigma_{min}(Q_{2})$ . In particular, we can bound the smallest singular value of $\widetilde{B}_{0}^{(0)}$ with high probability (at least $1-C\epsilon^{0.5n}$ ) as follows:

Then condition on the value of the means, we further exploit the randomness over the covariance matrices to lower bound $\sigma_{k|\mathcal{H}|}\left(\text{Proj}_{\widetilde{B}_{0}^{(0)\perp}}[\widetilde{B}^{(1)},\widetilde{B}^{(2)},\widetilde{B}^{(3)}]\right)$ . It is almost the same as the argument of the proof for Proposition B.1. For example, compared to (18) we have the following inequality instead:

and note that any block in $\widetilde{B}^{(0)}$ is independent of the randomness of covariance matrices, and we have $(|\mathcal{H}|/3)^{2}-k-2k|\mathcal{H}|/3\geq 2k$ . Similar modifications apply to the inequalities in (20),(21).

Finally by the argument of Lemma G.12 we can bound $\sigma_{min}(\widetilde{B}_{S})$ with probability at least $1-C\epsilon^{0.5n}$ (over the randomness of both the perturbed means and covariance matrices):

as we assume $\rho$ to be small perturbation and $\rho\epsilon\sqrt{n}<1$ .

Step 1 (b). Find the projected span of covariance matrices

Given the subspace $\mathcal{S}=span\{\widetilde{\mu}^{(i)},\widetilde{\Sigma}_{[:,\mathcal{H}]}^{(i)}:i\in[k]\}$ obtained from Step 1 (a), Step 1(b) finds the span of the covariance matrices with the columns projected to $S^{\perp}$ , namely:

This is in parallel with Step 1 (b) for the zero-mean case, and we rely on the structure of the two-dimensional slices of $\widetilde{M}_{4}$ to find the span of the projected covariance matrices. Similar to Claim B.6 for the zero-mean case, the following claim shows how the structure of the two-dimensional slices is related to the desired span.

For a mixture of general Gaussians, the two-dimensional slices of $\widetilde{M}_{4}$ are given by:

Note that given the set of indices $\mathcal{H}$ we chose in Step 1 (a) and the subspace $S$ , if we pick the indices $j_{1},j_{2}\in\mathcal{H}$ , project the two-dimensional slice to $S^{\perp}$ , all the rank one terms in the sum are eliminated and the projected slice lies in the desired span $\mathcal{U}_{S}$ :

Applying the same argument as in Lemma B.7 for the zero-mean case, we can show that with high probability over the perturbation, all the projected slices span the subspace $\mathcal{U}_{S}$ .

Step 1 (c). Merge the two projections of covariance matrices

Pick two disjoint index set $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ and repeat the previous two steps 1 (a) and 1 (b), we can obtain the two spans $U_{1}$ and $U_{2}$ , corresponding to the subspace of the covariance matrices projected to $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ , respectively.

In this step, we apply similar techniques as in Step 1 (c) for the zero-mean case to merge the two spans $U_{1}$ and $U_{2}$ : we first use the overlapping part of the two projections $\text{Proj}_{S_{1}^{\perp}}$ and $\text{Proj}_{S_{2}^{\perp}}$ to align the basis of $U_{1}$ and $U_{2}$ , then merge the two spans using the same basis.

Note that for the general case, by definition the span of the mean vectors $\widetilde{Z}$ lie in both subspaces $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ , therefore we have $\mathcal{S}_{1}^{\perp}\subset\widetilde{Z}^{\perp}$ and $\mathcal{S}_{2}^{\perp}\subset\widetilde{Z}^{\perp}$ . We can show that $\mathcal{S}_{1}^{\perp}\cup\mathcal{S}_{2}^{\perp}=\widetilde{Z}^{\perp}$ by lower bounding $\sigma_{n-k}([\text{Proj}_{\mathcal{S}_{1}^{\perp}},\text{Proj}_{\mathcal{S}_{2}^{\perp}}])$ with high probability, similar to that in (47). This gives us the span of the mean vectors $\widetilde{Z}$ .

Moreover, in the general case, from merging $U_{1}$ and $U_{2}$ we are only able to find the span of covariance matrices projected to the subspace $\widetilde{Z}^{\perp}$ . In particular, we can follow Lemma B.11 and Lemma B.15 in Step 1 (c) for the zero-mean case to show that for the general case, we can merge $U_{1}$ and $U_{2}$ to obtain the span $span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}:i\in[k]$ . By further projecting the span to $\widetilde{Z}^{\perp}$ from the right side, we can also obtain $\widetilde{\Sigma}_{o}=span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}:i\in[k]\}$ .

Step 2. Find the covariance matrices in the subspace orthogonal to the means

Given the subspace $\widetilde{Z}$ and $\widetilde{\Sigma}_{o}=span\{\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}:i\in[k]\}$ obtained from Step 1, Step 2 applies the zero-mean case algorithm to find the covariance matrices projected to the subspace $\widetilde{Z}^{\perp}$ , i.e., $\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}}$ ’s, as well as find the mixing weights $\widetilde{\omega}_{i}$ ’s.

This follows the same arguments as in Step 2 and Step 3 for the zero mean case. Consider projecting all the samples to $\widetilde{Z}^{\perp}$ , the subspace orthogonal to all the means. In this subspace, the samples are like from a mixture of zero-mean Gaussians with the projected covariance matrices, and the $4$ -th and $6$ -th order moment are given by $\widetilde{M}_{4}(\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}})$ and $\widetilde{M}_{6}(\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}},\text{Proj}_{\widetilde{Z}^{\perp}})$ . Since $\widetilde{Z}$ is of dimension $k$ , the dimension of the zero-mean Gaussian in the projected space is at least $n-k=O(n)$ .

Note that the subspace $\widetilde{Z}^{\perp}$ only depends on the randomness of the means, and random perturbation on the covariance matrices is independent of that of $\widetilde{\mu}$ . The smoothed analysis for the moment unfolding in Step 2 and tensor decomposition in Step 3 for the zero-mean case, which only depend on the randomness of the covariance matrices, still go through in the projected space.

Step 3. Find the means

This step finds the mean vectors based on the outputs of the previous steps. The key observation for this step is about the structure of the 3-rd order moments in the following claim:

The following lemma shows how to extract the means $\widetilde{\mu}^{(i)}$ ’s from $\widetilde{M}_{3(1)}$ using the information of the covariance matrices projected to the subspace orthogonal to the means, i.e. $\widetilde{\Sigma}_{o}$ , and the mixing weights $\widetilde{\omega}_{i}$ ’s.

The mean $\widetilde{\mu}^{(i)}$ of the $i$ -th component can be obtained by:

This step correctly finds the means if the $\widetilde{\Sigma}_{o}$ is full rank with good condition number, and this holds with high probability over the perturbation.

The basic idea is that since $\widetilde{\Sigma}_{o}$ lies in the span of $\widetilde{P}=\text{Proj}_{\widetilde{Z}^{\perp}}\otimes_{kr}\text{Proj}_{\widetilde{Z}^{\perp}}$ , and the last three summands in the parenthesis in (57) all lie in $span\{I_{n}\otimes_{kr}\text{Proj}_{\widetilde{Z}},\ \text{Proj}_{\widetilde{Z}}\otimes_{kr}I_{n}\}=span\{\widetilde{P}^{\perp}\}$ . Therefore hitting the matrix $\widetilde{M}_{3(1)}$ with $\widetilde{\Sigma}_{o}^{\dagger}$ from the right will eliminate those summands and pull out only the mean vectors.

Recall that the columns of the matrix $\widetilde{\Sigma}_{o}$ are $\text{vec}(\text{Proj}_{\widetilde{Z}^{\perp}}\widetilde{\Sigma}^{(i)}\text{Proj}_{\widetilde{Z}^{\perp}})=\widetilde{P}\text{vec}(\widetilde{\Sigma}^{(i)})$ ’s, and the columns of $\widetilde{\Sigma}$ are $\text{vec}(\widetilde{\Sigma}^{(i)})$ ’s.

Note that $\widetilde{T}=(\widetilde{P}{\widetilde{\Sigma}})^{{\dagger}\top}=\widetilde{P}{\widetilde{\Sigma}}^{{\dagger}\top}$ , and the columns of $\widetilde{T}$ lie in $span\{\widetilde{P}\}$ . Also note that for all $i,j\in[k]$ the vectors $\widetilde{\mu}^{(i)}\odot\widetilde{\mu}^{(i)}$ , $\widetilde{\Sigma}^{(i)}_{[:,j]}\odot\widetilde{\mu}^{(i)}$ and $\widetilde{\mu}^{(i)}\odot\widetilde{\Sigma}^{(i)}_{[:,j]}$ all lie in the subspace $span\{I_{n}\otimes_{kr}\text{Proj}_{\widetilde{Z}},\ \text{Proj}_{\widetilde{Z}}\otimes_{kr}I_{n}\}=span\{\widetilde{P}^{\perp}\}$ . Therefore these terms will be eliminated if we multiply the columns of $\widetilde{T}$ to the right of $\widetilde{M}_{3(1)}$ . For the first term $\widetilde{\mu}_{j}^{(i)}\text{vec}(\widetilde{\Sigma}^{(i)})$ , since $\text{vec}(\widetilde{\Sigma}^{(j)})^{\top}\widetilde{T}_{[:,i]}=(\widetilde{P}\text{vec}(\widetilde{\Sigma}^{(j)}))^{\top}\widetilde{T}_{[:,i]}=1_{[i=j]}$ . Therefore, we have $\widetilde{M}_{3(1)}\widetilde{T}_{[:,i]}=\widetilde{\omega}_{i}\widetilde{\mu}^{(i)}$ .

The smoothed analysis for the correctness of this step is easy. We only need to show that both ${\widetilde{\Sigma}}_{o}$ and ${\widetilde{\Sigma}}$ robustly have full column rank with high probability over perturbation of the covariance matrices, and thus the pseudo-inverse $\widetilde{T}$ is well defined. This follows from Lemma G.15.

Finally, the stability analysis for this step is also straightforward using the perturbation bound for pseudo-inverse in Theorem G.7. ∎

Step 4. Find the unprojected covariance matrices

Note that by definition $\widetilde{Z}=span\{\widetilde{\mu}^{(i)}:i\in[k]\}$ , the projected covariance $\text{Proj}_{\widetilde{Z}^{\perp}}(\widetilde{\Sigma}^{(i)})$ we obtained in Step 2 is also equal to $\text{Proj}_{\widetilde{Z}^{\perp}}(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})$ . In Step 4 we try to recover the missing part of the covariance matrices in the subspace $\widetilde{Z}$ . Note that since we have also obtained the means in Step 3, it is equivalent to finding $(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})$ for all $i$ . We will show that if we can find the $span\{(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}):i\in[k]\}$ , the projected vector $\text{Proj}_{\widetilde{Z}^{\perp}}(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})$ can be used as anchor to pin down the unprojected vector.

They key observation for finding the span of $span\{(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}):i\in[k]\}$ is to first construct a 4-th order tensor $\widetilde{M}^{\prime}_{4}$ which corresponds to the 4-th moment of a mixture of zero-mean Gaussians with covariance matrices $(\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top})$ , and then follow Step 1 in the algorithm for zero-mean case to find the span of the covariance matrices for this new mixture of Gaussians.

The next lemma shows how to construct such 4-th order tensor:

Given the $4$ -th moment $\widetilde{M}_{4}$ for a mixture of Gaussians with parameters $\{\widetilde{\omega}_{i},\widetilde{\mu}^{(i)},\widetilde{\Sigma}^{(i)}\}$ , define the 4-th order tensor $\widetilde{M}^{\prime}_{4}$ to be:

then $\widetilde{M}^{\prime}_{4}$ is equal to the $4$ -th moment of a mixture Gaussians with parameters $\{\widetilde{\omega}_{i},0,\widetilde{\Sigma}^{(i)}+\widetilde{\mu}^{(i)}(\widetilde{\mu}^{(i)})^{\top}\}$ .

The proof follows directly from Isserlis’ Theorem. Therefore we can repeat Step 1 in the zero-mean case here to find the span of the space $\{\text{vec}(\widetilde{\Sigma}^{(i)})+\widetilde{\mu}^{(i)}\odot\widetilde{\mu}^{(i)}:i\in[k]\}$ . Since we also know the projection of $\widetilde{\Sigma}^{(i)}$ ’s in a large subspace (in the subspace $\text{Proj}_{\widetilde{Z}^{\perp}}\otimes_{kr}\text{Proj}_{\widetilde{Z}^{\perp}}$ obtained from Step 2), we can easily recover $\widetilde{\Sigma}^{(i)}$ ’s:

Further, this procedure is stable if $\sigma_{min}(P^{\top}S)$ is lower bounded.

This is a special case of the Step 1 (c) where we merge two projections of an unknown subspace.

The span $S$ is equal to $UV$ for some unknown matrix $V$ . We can compute $V=(P^{\top}U)^{\dagger}P^{\top}S$ , and hence $U=SV^{-1}=S(P^{\top}S)^{\dagger}(P^{\top}U)$ . The stability analysis is similar (and simpler than) Lemma B.11. ∎

We will apply this lemma to where the subspace $P$ is $\text{Proj}_{\widetilde{Z}^{\perp}}\otimes_{kr}\text{Proj}_{\widetilde{Z}^{\perp}}$ . Since the perturbation of the means and the covariance matrices are independent, we can lower bound the smallest singular value of $P^{\top}S$ .

F.1 Proof Sketch of the Main Theorem 3.4

The proof follows the same strategy as Theorem 3.5. First we apply the union bound to all the smoothed analysis lemmas, this will ensure the matrices we are inverting all have good condition number, and the whole algorithm is robust to noise.

Then in order to get the desired accuracy $\epsilon$ , we need to guarantee inverse polynomial accuracy in different steps (through the stability lemmas). The flow of the algorithm is illustrated in Figure 5. In the end all the requirements becomes a inverse polynomial accuracy requirement on $\widehat{M}_{4}$ and $\widehat{M}_{6}$ , which we obtain by Lemma E.1.

Appendix G Matrix Perturbation, Concentration Bounds and Auxiliary Lemmas

In this section we collect known results on matrix perturbation and concentration bounds. In general, matrix perturbation bounds are the key for the perturbation lemmas, and concentration bounds are crucial for the smoothed analysis lemmas. We also prove some corollaries of known results that are very useful in our settings.

Given a matrix $\widehat{A}=A+E$ where $E$ is a small perturbation, how does the singular values and singular vectors of $A$ change? This is a well-studied problem and many results can be found in Stewart and Sun Stewart (1977). Here we review some results used in this paper, and prove some corollaries.

Given $\widehat{A}=A+E$ , the perturbation in individual singular values can be bounded by Weyl’s theorem:

Given $\widehat{A}=A+E$ , we know $\sigma_{k}(A)-\|E\|\leq\sigma_{k}(\widehat{A})\leq\sigma_{k}(A)+\|E\|$ .

For singular vectors, the perturbation is bounded by Wedin’s Theorem:

Let $\widehat{A}=A+E$ , with analogous singular value decomposition. Let $\Phi$ be the matrix of canonical angles between the column span of $U_{1}$ and that of $\widehat{U}_{1}$ , and $\Theta$ be the matrix of canonical angles between the column span of $V_{1}$ and that of $\widehat{V}_{1}$ . Suppose that there exists a $\delta$ such that

We do not go into the definition of canonical angles here. The only way we will be using this lemma is by combining it with the following:

Let $\Phi$ be the matrix of canonical angles between the column span of $U$ and that of $\widehat{U}$ , then

Moreover, if $\|E\|_{F}\leq{\sigma_{k}(A)/\sqrt{2}}$ we have $\|\widehat{S}-\widetilde{S}\|\leq{\sqrt{2}\|E\|\over\sigma_{k}(A)}.$

The equality is because $\text{Proj}_{S^{\perp}}=I-\text{Proj}_{S}$ so the two differences are the same. The final step follows from Wedin’s Theorem and Lemma G.4. ∎

Often we need to bound the perturbation of a product of perturbed matrices, where we apply the following lemma:

Consider a product of matrices $A_{1}\cdots A_{k}$ , and consider any sub-multiplicative norm on matrix $\|\cdot\|$ . Given $\widehat{A}_{1},\dots,\widehat{A}_{k}$ and assume that $\|\widehat{A}_{i}-A_{i}\|\leq\|A_{i}\|$ , then we have:

The proof of this lemma is straightforward by induction.

When we have a lowerbound on $\sigma_{min}(A)$ , it is easy to get bounds for the perturbation of pseudoinverse.

We first apply Theorem G.7, and then bound $\|A^{\dagger}\|$ and $\|B^{{\dagger}}\|$ . By definition we know $\|A^{\dagger}\|=1/\sigma_{min}(A)$ . By Weyl’s theorem $\sigma_{min}(B)\geq\sigma_{min}(A)-\|E\|\geq\sigma_{min}(A)/2$ , hence $\|B^{\dagger}\|=\sigma_{min}(B)^{-1}\leq 2\sigma_{min}(A)^{-1}$ . ∎

G.2 Lowerbounding the Smallest Singular Value

Gershgorin’s Disk Theorem is very useful in bounding the singular values.

Sometimes, it is easier to consider the projection of a matrix. Lowerbounding the smallest singular value of a projection will imply the same lowerbound on the original matrix:

Observe that $(P^{\top}A)^{\top}(P^{\top}A)=A^{\top}(PP^{\top})A\preceq A^{\top}A$ (because $P$ is a subspace). Therefore the eigenvalues of $(P^{\top}A)^{\top}(P^{\top}A)$ must be dominated by the eigenvalues of $A^{\top}A$ . Then the lemma follows from the definition of singular values. ∎

As a corollary we have the following lemma:

In several places of this work we want to bound the singular value of a matrix, where part of the matrix has a block structure.

The smallest singular value is bounded by:

The idea is to break the matrix into two parts $A=\text{Proj}_{B}A+\text{Proj}_{B^{\perp}}A$ .Since these two spaces are orthogonal we know $\sigma_{(n+dn^{\prime})}(A)\geq\min\{\sigma_{n}(\text{Proj}_{B}A),\sigma_{dn^{\prime}}(\text{Proj}_{B^{\perp}}A)\}$ .

For the first part, clearly $\sigma_{n}(\text{Proj}_{B}A)\geq\sigma_{n}(B)$ , as $B$ is a submatrix of $\text{Proj}_{B}A$ .

For the second part, we actually do the projection to a smaller subspace: for each block we project to the orthogonal subspace of $B^{(i)}$ . Under this projection, the block structure is preserved. The $dn^{\prime}$ -th singular value must be at least the minimum of the $n^{\prime}$ -th singular value of the blocks. In summary we have:

In our analysis, we often also want to bound the smallest singular value of a matrix whose entries are Gaussian random variables. Our analysis mostly builds on the following results in random matrix theory.

For a random rectangular matrix, Rudelson and Vershynin (2009) gives the following nice result:

We will mostly use an immediate corollary of the above lemma with slightly simpler form:

This lemma can also be applied to a projection of a Gaussian matrix:

Then for any $\epsilon>0$ , we have that with probability at least $1-(C\epsilon)^{0.5(m-|\mathcal{J}|-r)}$ , for some absolute constant $C$ , the smallest singular value of the projected random matrix is bounded by:

We justify the last equality below. Note that

and note that the columns of $(P_{J^{c}}\text{Proj}_{S}E)$ lie in the column span of $P_{J^{c}}S$ , therefore,

Finally, note that $P_{J^{c}}S$ , with column rank no more than $r$ , is independent of $P_{J^{c}}E$ , which is a random Gaussian matrix of size $(m-|\mathcal{J}|)\times n$ , therefore we have that $\text{Proj}_{(P_{J^{c}}S)^{\perp}}P_{J^{c}}E$ is equivalent to a $(m-|\mathcal{J}|-r)\times n$ random Gaussian matrix. Since (58) is satisfied, we can apply Lemma G.13 and conclude (59) with high probability. ∎

However, in the smoothed analysis setting, the matrix we are interested in are often not random Gaussian matrices. Instead they are fixed matrices perturbed by Gaussian variables. We call these “perturbed rectangular matrices”, their singular values can be bounded as follows:

Since $m-n>2n$ , we can apply Lemma G.15 to conclude that for any $\epsilon>0$ ,

with probability at least $1-(C\epsilon)^{0.5(m-n)}\leq 1-(C\epsilon)^{0.25m}$ . ∎

G.3 Projection of random vectors

In Step 2, we need to bound the norm of a random vector of the form $u\odot v$ after a projection, where $u$ and $v$ are two Gaussian vectors. In order to show this, we apply the result in Vu and Wang (2013) which provides a concentration bound of projection of well-behaved ( $K$ -concentrated) random vectors.

First we cite the definition of “ $K$ -concentrated” below:

where $med(\cdot)$ denotes the median of a random variable (choose an arbitrary one if there are many).

In order to apply this lemma in our setting, we need to prove the vectors that we are interested in is $K$ -concentrated:

Conditioned on the high probability event that $\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}$ , the vector $[[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}]$ is $2\sqrt{n_{2}})$ -concentrated.

For any $1$ -Lipschitz function $F$ on $[[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}]$ , we can define a function $G(E_{[:,i]},E_{[:,j]})=F([[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}])$ (if $i=j$ then the function $G$ only takes $E_{[:,i]}$ as the variable). Under the assumption that $\|E_{[:,i]}\|,\|E_{[:,j]}\|\leq 2\sqrt{n_{2}}$ , this new function $G$ is $2\sqrt{n_{2}}$ -Lipschitz.

Now we extend $G$ to $G^{*}$ when the input $\|E_{[:,i]}\|,\|E_{[:,j]}\|>2\sqrt{n_{2}}$ . Define the truncation function $\mbox{trunc}(v)=v$ for $\|v\|\leq 2\sqrt{n_{2}}$ , and $\mbox{trunc}(v)=2\sqrt{n_{2}}v/\|v\|$ for $\|v\|>2\sqrt{n_{2}}$ . Define the extended function $G^{*}(E_{[:,i]},E_{[:,j]})=G(\mbox{trunc}(E_{[:,i]}),\mbox{trunc}(E_{[:,j]}))$ , which is still $2\sqrt{n_{2}}$ -Lipschitz since the truncation function is $1$ -Lipschitz.

Note that for the two Gaussian random vectors ${E_{[:,i]},E_{[:,j]}\sim N(0,I)}$ , we can apply Gaussian concentration bound in Theorem G.20 on $G^{*}$ , which implies

Since the probability of the event $\|E_{[:,i]}\|,\|E_{[:,j]}\|>2\sqrt{n_{2}}$ is very small ( $\sim\exp(-\Omega(n_{2}))$ ), we have $\delta=|\mbox{med}(G(E_{[:,i]},E_{[:,j]}))-\mbox{med}(G^{*}(E_{[:,i]},E_{[:,j]}))|$ in the order of $O(\sqrt{n_{2}})$ . Therefore, for $t\sim\Omega(\sqrt{n_{2}})$ , we have

Therefore the random vector $[[E_{[:,i]}\odot E_{[:,j]}]_{s,s^{\prime}}:s<s^{\prime}]$ is $2\sqrt{n_{2}}$ -concentrated. ∎

for all $s>0$ and some absolute constant $C>0$ .

G.4 Gaussian Chaoses

In Step 2, we want to show that the inner product of two random vectors of the form $<\text{Proj}(u\odot v),\text{Proj}(u\odot v)>$ is small, where $u,u^{\prime}$ and $v,v^{\prime}$ are Gaussian vectors. In order to show this, we treat the inner product as a (homogeneous) Gaussian chaos, which is defined to be a homogeneous polynomial over Gaussian random variablesIn fact, the squared norm of projected random vectors considered previously is a special case of Gaussian chaos, and we treat it separately.. Our analysis builds on the results of many works studying the concentration bound of Gaussian chaoses.

For decoupled Gaussian chaoses, we mostly use the following theorem, which is a simple corollary of Lemma G.22.

Suppose $a=(a_{i_{1},...,i_{d}})_{1\leq i_{1},...,i_{d}\leq n}$ is a $d$ -indexed array, and $\|a\|_{F}$ denotes its Frobenius norm. Let $(X^{(j)}_{i})_{1\leq i\leq n,j=1,...,d}$ be independent copies of $X\sim\mathcal{N}(0,I_{n})$ . For any fixed $\epsilon>0$ , with probability at least $1-C\text{exp}\left(-C^{\prime}{n^{2\epsilon/d}}\right)$ ,

Suppose $a=(a_{i_{1},...,i_{d}})_{1\leq i_{1},...,i_{d}\leq n}$ is a $d$ -indexed array. Consider a decoupled Gaussian chaos $G=\sum_{i_{1},...,i_{d}}a_{i_{1},...,i_{d}}X_{i_{1}}^{(1)}\cdots X_{i_{d}}^{(d)}$ , where $X_{i}^{(k)}$ are independent copies of the standard normal random variable for all $i\in[n],k\in[d]$ .

where $C_{d}\in(0,\infty)$ depends only on $d$ , and $S(k,d)$ denotes a set of all partitions of $\{1,\dots,d\}$ into $k$ nonempty disjoint sets $I_{1},\dots,I_{k}$ , and the norm $\|\cdot\|_{I_{1},\dots,I_{k}}$ is given by:

For coupled Gaussian chaoses, namely when $X^{(j)}$ ’s are identical copies of the same $X$ , we first cite the following decoupling theorem in de la Peña and Montgomery-Smith (1995).

(Decoupling) Let $(a_{i_{1},...,i_{d}})_{1\leq i_{1},...,i_{d}\leq n}$ be a symmetric $d$ -indexed array such that $a_{i_{1},...,i_{d}}=0$ whenever there exists $k\neq l$ such that $i_{k}=i_{l}$ . Let $X_{1},...,X_{n}$ be independent random variables and $(X^{(j)}_{i})_{1\leq i\leq n}$ for $j=1,dots,d$ , be independent copies of the sequence $(X_{i})_{1\leq i\leq n}$ , then for all $t\geq 0$ ,

where $L_{d}\in(0,\infty)$ depends only on $d$ .

Essentially this theorem shows for a symmetric tensor with no “diagonal” terms, i.e., $a_{i_{1},...,i_{d}}=0$ whenever there exists $k\neq l$ such that $i_{k}=i_{l}$ ), there is only a constant factor difference between the coupled and decoupled Gaussian chaos distribution.

In most of our applications, we do have symmetric tensors with no “diagonal” terms. However there is one case where we do have diagonal terms, for which we need the following lemma.

Let $(a_{i_{1},i_{2},i_{3}})_{1\leq i_{1},...,i_{3}\leq n}$ be a symmetric $3$ -indexed array and let $\|a\|_{F}$ denote its Frobenius norm. Let $X\sim\mathcal{N}(0,I_{n})$ , then for any $\epsilon>0$ , with probability at least $1-Cn\text{exp}(-C^{\prime}n^{2\epsilon/3})$ ,

The sum of the “diagonal” terms is equal to $3\sum_{i\neq j}a_{i,i,j}X_{i}^{2}X_{j}+1/2\sum_{i}a_{i,i,i}X_{i}^{3}$ . Since $X_{i}$ are independent standard Gaussian random variables, with probability at least $1-Cn\text{exp}(-C^{\prime}n^{2\epsilon/3})$ (union bound), $|X_{i}|\leq n^{\epsilon/3}$ for all $i\in[n]$ . Conditioned on this high probability event, the absolute value of the sum is bounded by:

By Theorem G.21, we know that with probability at least $1-C\text{exp}\left(-C^{\prime}{n^{2\epsilon/3}}\right)$ , the absolute value of the sum of the “non-diagonal” terms is bounded by $\|a\|_{F}n^{\epsilon}$ . Therefore we can conclude the proof by applying the union bound. ∎