Settling the Polynomial Learnability of Mixtures of Gaussians

Ankur Moitra, Gregory Valiant

Introduction

Given access to random samples generated from a mixture of (multivariate) Gaussians, the algorithmic problem of learning the parameters of the underlying distribution is of fundamental importance in physics, biology, geology, social sciences – any area in which such finite mixture models arise . Starting with Dasgupta , a series of work in theoretical computer science has sought to find (or disprove the existence of) an efficient algorithm for this task . In this paper, we settle the polynomial-time learnability of mixtures of Gaussians, giving an algorithm that uses a polynomial amount of data and estimates the components at an inverse polynomial rate under provably minimal assumptions on the mixture (specifically, that the mixing weights and the statistical distance between the components are bounded away from zero). As a corollary, our efficient learning algorithm can be employed to yield the first provably efficient algorithm for near-optimal clustering and density estimation, without any restrictions on the Gaussian mixture. Finally, we note that the runtime and data requirements of our algorithm are exponential in the number of Gaussian components; however, as we show in Section 6, this exponential dependence is necessary. In the remainder of this section, we briefly summarize previous work on this problem, formally state our main result, and then discuss the differences between learning mixtures of 22 Gaussians, and mixtures of many Gaussians, which motivates the high-level outline of our algorithm presented in Section 2. We first define a Gaussian Mixture Model (GMM).

The most popular solution for recovering reasonable estimates of the components of GMMs in practice is the EM algorithm given by Dempster, Laird and Rubin . This algorithm is a local-search heuristic that converges to a set of parameters that locally maximizes the probability of generated the observed samples. However, the EM algorithm is a heuristic only, and makes no guarantees about converging to an estimate that is close to the true parameters. Worse still, the EM algorithm (even for univariate mixtures of just two Gaussians) has been observed to converge very slowly (see Redner and Walker for a thorough treatment ).

In order to even hope for an algorithm (not necessarily even polynomial time), we would need a uniqueness property – that two distinct mixtures of Gaussians must have different probability density functions. Teicher demonstrated that a mixture of Gaussians can be uniquely identified (up to a relabeling components) by considering the probability density function at points sufficiently far from the centers (in the tails). However, such a result sheds little light on the rate of convergence of an estimator: If distinguishing Gaussian mixtures really required analyzing the tails of the distribution, then we would require an enormous number of data samples!

Dasgupta introduced theoretical computer science to the algorithmic problem of provably recovering good estimates for the parameters in polynomial time (and a polynomial number of samples). His technique is based on projecting data down to a randomly chosen low-dimensional subspace, finding an accurate clustering. Given enough accurately clustered points, the empirical means and co-variances of these points will be a good estimate for the actual parameters. Arora and Kannan extended these ideas to work in the much more general setting in which the co-variances of each Gaussian component could be arbitrary, and not necessarily almost spherical as in . Yet both of these techniques are based on the concentration of distances (under random projections), and consequently required that the centers of the components be separated by at least n\sqrt{n} times the largest variance. Vempala and Wang and Achlioptas and McSherry introduced the use of spectral techniques, and were able to overcome this barrier (of relying on distance concentration) by choosing a subspace on which to project based on large principle components. Brubaker and Vempala later gave the first affine-invariant algorithm for learning mixtures of Gaussians, and these ideas proved to be central in subsequent work .

Yet all of these approaches for provably learning good estimates require, at the very least, that the statistical overlap (i.e. one minus the statistical distance) between each pair of components be at least smaller than some constant (in some cases, it is even required that the statistical overlap be exponentially small). Recently, Felman et al gave a polynomial time algorithm for the related problem of density estimation (without any separation condition) for the special case of axis-aligned GMMs (GMMs where each component has principle coordinates aligned with the coordinate axes). Also without any separation requirements, Belkin and Sinha showed that one can efficiently learn GMMs in the special case that all components are identical spherical Gaussians. Most similar to the present work is the recent work of Kalai et al , that gave a learning algorithm for the case of mixtures of two arbitrary Gaussians with provably minimal assumptions.

2 Main Results

In this section we state our main results. To motivate these results, we first state three obvious lower bounds for recovering the parameters of a GMM F=i=1kwiFi,F=\sum_{i=1}^{k}w_{i}F_{i}, which motivate our defintion of ϵ\epsilon-statistically learnable. We provide a formal definition of statistical distance in Section 2.1.

Permuting the order of the components does not change the resulting density, thus at best the hope is to recover the parameter set, {(w1,μ1,Σ1),,(wk,μk,Σk)}.\{(w_{1},\mu_{1},\Sigma_{1}),\ldots,(w_{k},\mu_{k},\Sigma_{k})\}.

We require at least Ω(1/mini(wi))\Omega(1/\min_{i}(w_{i})) samples to estimate the parameters, since we require this number of samples to ensure that we have seen, with reasonable probability, any sample from each component.

If Fi=FjF_{i}=F_{j}, then it is impossible to accurately estimate wi,w_{i}, and in general we require at least Ω(1/D(Fi,Fj))\Omega(1/D(F_{i},F_{j})) samples to estimate wiw_{i}, where D(Fi,Fj)D(F_{i},F_{j}) denotes the statistical distance between the two distributions.

We call a GMM F=iwiFiF=\sum_{i}w_{i}F_{i} ϵ\epsilon-statistically learnable if miniwiϵ\min_{i}w_{i}\geq\epsilon and minijD(Fi,Fj)ϵ\min_{i\neq j}D(F_{i},F_{j})\geq\epsilon.

We now consider what it means to “accurately recover the mixture components”.

Given two nn-dimensional GMMs of kk Gaussians, F=iwiN(μi,Σi)F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\Sigma_{i}) and F^=iw^iN(μ^i,Σ^i)\hat{F}=\sum_{i}\hat{w}_{i}\mathcal{N}(\hat{\mu}_{i},\hat{\Sigma}_{i}), we call F^\hat{F} an ϵ\epsilon-close estimate for FF if there is permutation function π:[k][k]\pi:[k]\rightarrow[k] such that for all i[k]i\in[k]

D(N(μi,Σi),N(μ^π(i),Σ^π(i)))ϵD(\mathcal{N}(\mu_{i},\Sigma_{i}),\mathcal{N}(\hat{\mu}_{\pi(i)},\hat{\Sigma}_{\pi(i)}))\leq\epsilon

Note that the above definition of an ϵ\epsilon-close estimate is affine invariant. This is more natural than defining a good estimate in terms of additive errors, since in general, even estimating the mean of an arbitrary Gaussian to some fixed additive precision is impossible without restrictions on the covariance, as scaling the data will scale the error linearly. We can now state our main theorem:

Given any nn dimensional mixture of kk Gaussians FF that is ϵ\epsilon-statistically learnable, we can output an ϵ\epsilon-close estimate F^\hat{F} and the running time and data requirements of our algorithm (for any fixed kk) are polynomial in nn, and 1ϵ\frac{1}{\epsilon}.

The guarantee in the main theorem implies that the estimated parameters are off by an additive O(ϵσmax2)O(\epsilon\sigma^{2}_{max}), where σmax2\sigma^{2}_{max} is the largest (projected) variance of any Gaussian in any direction.

Throughout this paper, we favor clarity of proof and exposition above optimization of runtime. Since our main goal is show that these problems can be solved in polynomial time, we make very little effort to optimize the exponent. Our algorithms are polynomial in the dimension, inverse of the success probability, and inverse of the target accuracy for any fixed number of Gaussians, kk. The dependency on kk, however, is severe: the degree of our polynomials are linear in kk. In Section 6, we give a natural construction of two GMMs F,FF,F^{\prime} of kk components that are each 1/k1/k-statistically learnable, satisfy D(F,F)ekD(F,F^{\prime})\leq e^{-k}, but FF is not even a 1/41/4-close estimate of FF. Thus we require an exponential in kk number of samples to even distinguish these two mixtures, demonstrating that the exponential dependency on kk in our learning algorithms is inevitable.

There exists two GMMs F,FF,F^{\prime} of kk components each that satisfies the following properties:

F,FF,F^{\prime} are 1/k1/k-statistically learnable.

FF is not a 1/41/4-close estimate of FF^{\prime}.

3 Applications

We can leverage our main theorem to show that we can efficiently perform density estimation for arbitrary GMMs. For density estimation—as opposed to parameter recovery—we only care to recover a distribution that is similar to the GMM, without worrying about matching each component; in particular, if the true weight of one of the components is negligible, we can simply disregard that component with negligible effect on the statistical distance; if two components are nearly identical in statistical distance, we can simply regard them as being merged into one component. For these reasons, we can perform density estimation efficiently without the restriction to ϵ\epsilon-statistically learnable distributions, that was required for Theorem 1.

For any n1,n\geq 1, ϵ,δ>0,\epsilon,\delta>0, and any nn-dimensional GMM F=i=1kwiFi,F=\sum_{i=1}^{k}w_{i}F_{i}, given access to independent samples from FF, there is an algorithm that outputs F^=i=1kwi^Fi^\hat{F}=\sum_{i=1}^{k}\hat{w_{i}}\hat{F_{i}} such that with probability at least 1δ1-\delta over the randomization in the algorithm and in selecting the samples, D(F,F^)ϵ.D(F,\hat{F})\leq\epsilon. Additionally, the runtime and number of samples is bounded by poly(n,1/ϵ,1/δ).poly(n,1/\epsilon,1/\delta).

The proof of this corollary follows immediately from combining our main theorem, with the arguments in Appendix D. In fact, an almost identical approach to how we construct the General Univariate Algorithm from the Basic Univariate Algorithm (again in Appendix D) will work because we can run our main algorithm with many different parameter ranges so that most estimates are correct, and determine a consensus among the estimate so that we can recover a good statistical approximation to FF without any assumptions on the mixture - not even ϵ\epsilon-statistical learnability.

For any n1,n\geq 1, ϵ,δ>0,\epsilon,\delta>0, and any nn-dimensional ϵ\epsilon-statistically learnable GMM F=i=1kwiFi,F=\sum_{i=1}^{k}w_{i}F_{i}, given access to independent samples from FF, there is an algorithm that outputs a classifier CFC^{\prime}_{F} such that with probability at least 1δ1-\delta over the randomization in the algorithm and in selecting the samples, the error of CFC_{F} is at most ϵ\epsilon larger than the error of any classifier CC^{\prime}. Additionally, the runtime and number of samples used is bounded by poly(n,1/ϵ,1/δ).poly(n,1/\epsilon,1/\delta).

The proof of this corollary follows immediately from our main theorem (yet here we need the assumption of ϵ\epsilon-statistical learnability in this case).

4 Comparing Learning Two Gaussians to Learning Many

This work leverages several key ideas initially presented in which were used to show that learning mixtures of two arbitrary Gaussians can be done efficiently. Nevertheless, additional high-level insights, and technical details were required to extend the previous work to give an efficient learning algorithm for an arbitrary mixture of many Gaussians. In this section we briefly summarize the algorithm for learning mixtures of two Gaussians given in , and then describe the hurdles to extending it to the general case. This discussion will provide insights and motivate the high-level structure of the algorithm presented in this paper, as well as clarify which components of the proof are new, and which are straight-forward adaptations of ideas from .

Throughout this discussion, it will be helpful to refer to parameters ϵ1,ϵ2,ϵ3,\epsilon_{1},\epsilon_{2},\epsilon_{3}, which are polynomially related to each other, and satisfy ϵ1<<ϵ2<<ϵ3\epsilon_{1}<<\epsilon_{2}<<\epsilon_{3}.

There are three key components to the proof that mixtures of two Gaussians can be learned efficiently: the 1-d Learnability Lemma, the Random Projection Lemma, and the Parameter Recovery Lemma. The 1-d Learnability Lemma states that given a mixture of two univariate Gaussians whose two components have nonnegligible statistical distance, one can efficiently recover accurate estimates of the parameters of the mixture. It is worth noting that in the univariate case, saying that the statistical distance between two Gaussians is non-negligible is roughly equivalent (polynomially related) to saying that the two sets of parameters are non-negligibly different, ie. the parameter distance, μμ+σ2σ2,|\mu-\mu^{\prime}|+|\sigma^{2}-\sigma^{\prime 2}|, is non-negligible. The Random Projection Lemma states that, given an nn-dimensional mixture of two Gaussians which is in isotropic position and whose components have nonnegligible statistical distance, with high probability over the choice of a random unit vector r,r, the projection of the mixture onto rr will yield a univariate mixture of two Gaussians that have nonnegligible statistical distance (say ϵ3\epsilon_{3}). The final component—the Parameter Recovery Lemma—states that, given a Gaussian GG in nn dimensions, if one has extremely accurate estimates (say to within some ϵ1\epsilon_{1}) of the mean and variance of GG projected onto n2n^{2} sufficiently distinct directions (directions that differ by at least ϵ2>>ϵ1\epsilon_{2}>>\epsilon_{1}) one can accurately recover the parameters of GG.

Given these three pieces, the high-level algorithm for learning mixtures of two Gaussians is straight-forward:

Pick n2n^{2} vectors r1,,rn2,r_{1},\ldots,r_{n^{2}}, that are “close” to rr, say rirϵ2.|r_{i}-r|\approx\epsilon_{2}.

For each i=1,,n2,i=1,\ldots,n^{2}, learn extremely accurate (to accuracy ϵ1<<ϵ2\epsilon_{1}<<\epsilon_{2}) univariate parameters wi,μi,σi,μi,σiw_{i},\mu_{i},\sigma_{i},\mu^{\prime}_{i},\sigma^{\prime}_{i} for the projection of the mixture onto the vector rir_{i}.

Since rirjϵ2,|r_{i}-r_{j}|\approx\epsilon_{2}, it is not hard to show that with high probability, μiμj<<ϵ3,σiσj<<ϵ3|\mu_{i}-\mu_{j}|<<\epsilon_{3},|\sigma_{i}-\sigma_{j}|<<\epsilon_{3} and by the Random Projection Lemma, (μi,σi)(μi,σi)>>ϵ3||(\mu_{i},\sigma_{i})-(\mu^{\prime}_{i},\sigma^{\prime}_{i})||>>\epsilon_{3} thus it will be easy to accurately match up which parameters come from which component in the different projections, and we can apply the Parameter Recovery Lemma to each of the two components.

Some of the above ideas are immediately applicable to the problem of learning mixtures of many Gaussians: we can clearly use the Parameter Recovery Lemma without modification. Additionally, we prove a generalization of the 1-d Learnability Lemma for mixtures of arbitrary numbers of Gaussians, provided each component has non-negligible statistical distance (which, while technically tedious, employs the key idea from of “deconvolving” by a suitably chosen Gaussian—see Appendix B). Given this extension, if we were given a mixture of kk Gaussians in isotropic position, and were guaranteed that the projection onto some vector rr resulted in a univariate mixture of Gaussians for which all pairs of components either had reasonably different means or reasonably different variances, then we could piece together the parts more-or-less as in the 2-Gaussians case.

Unfortunately, however, the Random Projection Lemma, ceases to hold in the general setting. There exist mixtures of just three Gaussians with significant pairwise statistical distances, that are in isotropic position, but have the property that with extremely high probability over choices of random unit vector rr, the projection of the mixture onto rr yields a distribution that is extremely close to a univariate mixture of two Gaussians. This observation would foil the approach employed in the case of just two Gaussians! Another difficulty is that if we take n2n^{2} slightly different projections of our mixture of kk Gaussians, then it is possible that in some of the projections we see what looks like a mixture of k<kk^{\prime}<k univariate Gaussians, and in some other projections we see what looks like a mixture of kk^{\prime\prime} univariate Gaussians. How do we match up estimates from projections onto different directions when the number of Gaussians in the estimate can differ? Or what if each projection results in an estimate that is a mixture of k<kk^{\prime}<k Gaussians. Then how can we recover an nn-dimensional estimate that is a mixture of kk Gaussians?

Outline and Definitions

We now discuss the high-level structure of our learning algorithm, building from the intuition given in the preceding section. At the highest level, our learning algorithm has the following form: Given access to samples from a mixture of kk Gaussians,

Learn the parameters of some mixture of kkk^{\prime}\leq k Gaussians, where each learned Gaussian component roughly corresponds to one or more of the Gaussians in the original mixture.

If k<kk^{\prime}<k, for each of the kk^{\prime} components recovered in the previous step, examine it closely and figure out whether it corresponds to a single Gaussian component of the original mixture, or whether it is a mixture of several of the original components (in which case we will then need to learn the parameters of these sub-components).

To accomplish the first step, we will require accurate parameters of the projection of each of the kk^{\prime} “clusters” of components, onto n2n^{2} univariate projections. To do this, we employ a robust univariate algorithm which, given access to samples from a univariate GMM, essentially searches for some target resolution window (w1,w2)(w_{1},w_{2}) with w1<<w2,w_{1}<<w_{2}, such that the GMM is very close (w1w_{1}-close) to a GMM of kkk^{\prime}\leq k statistically very distinct components (each pair of components is at least w2w_{2} far apart).

Given our robust univariate algorithm, we embark on a partition pursuit where we try to find n2n^{2} vectors that yield consistent and compatible univariate parameter sets–in particular, we require that each of the n2n^{2} univariate projections yields parameters that satisfy three conditions: 1) they have the same number of components, 2) the recovered parameters are much more precise than the distances between the n2n^{2} projections, and 3) that the distance between the components is large enough so as to ensure an accurate matching of the components in the different projections.

Finally, given the ability to accurately recover kkk^{\prime}\leq k high-dimensional Gaussians, where each learned Gaussian component roughly corresponds to one or more of the Gaussians in the original mixture, we want to be able to examine each recovered component, and determine whether it corresponds to a single component of the original mixture, or a set of original components. We first claim that, with high probability, the only way a subset of original components will end up being grouped into a single recovered component is if the covariance of the mixture of that subset of components has a very small minimum eigenvalue. The existence of such an eigenvalue implies that we can accurately cluster the given sample points (whose covariance, recall, is roughly 1). Thus, given a recovered set of k<kk^{\prime}<k parameters, we examine one of these kk^{\prime} components; if the minimum eigenvalue is sufficiently small, we project the set of data samples onto the corresponding eigenvector, and then partition the sample points into two clusters (provided the eigenvalue is sufficiently small, since the overall mixture is in roughly isotropic position, we cluster so as to almost exactly respect some partition of the original components). Given the set of sample points corresponding (roughly) to the recovered component that had small eigenvalue, we simply re-scale the data so that this subsample is now in isotropic position, and recursively run the entire algorithm on this rescaled subsample of the data, which, as we argue, consists of a mixture of k<kk^{\prime\prime}<k components of the original mixture, with high probability. We call this clustering step hierarchical clustering.

We give a detailed summary in Appendix A of the main elements of each of these three main components: the robust univariate algorithm, partition pursuit, and hierarchical clustering.

Given two probability distributions f(x),g(x)f(x),g(x) on n\Re^{n} we can define the statistical distance between these distributions as

We will also be interested in a related notion of the parameter distance between two univariate Gaussians:

Given two univariate Gaussians, F1=N(μ1,σ12),F2=N(μ2,σ22)F_{1}=\mathcal{N}(\mu_{1},\sigma_{1}^{2}),F_{2}=\mathcal{N}(\mu_{2},\sigma_{2}^{2}) we define the parameter distance as

In general, the parameter distance and the statistical distance between two univariate Gaussians can be unrelated. There are pairs of univariate Gaussians with arbitrarily small parameter distance, and yet statistical distance close to 11, and there are pairs of univariate Gaussians with arbitrarily small statistical distance, and yet arbitrarily large parameter distances. But these scenarios can only occur if the variances can be arbitrarily small or arbitrarily large. In many instances in this paper, we will have reasonable upper and lower bounds on the variances and this will allow us to move back and forth from statistical distance and parameter distance, but we will highlight when we are doing so and note why we are able to assume an upper and lower bound on variance in that particular situation.

As we noted, there are ϵ\epsilon-statistically learnable mixtures of three Gaussians that are in isotropic position, but for which with overwhelming probability over a random direction rr, in the projection onto rr, there will be some pair of univariate Gaussians that are arbitrarily close in parameter distance. In these cases, our univariate algorithm may not return an estimate with three components, but will return a mixture which has only two components but is still a good estimate for the parameters of the projected mixture. To formalize this notion, we introduce what we call an ϵ\epsilon-correct sub-division.

Given a GMM of kk Gaussians, F=iwiN(μi,σi2)F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2}) and a GMM of kkk^{\prime}\leq k Gaussians F^=iw^iN(μ^i,σ^i2)\hat{F}=\sum_{i}\hat{w}_{i}\mathcal{N}(\hat{\mu}_{i},\hat{\sigma}_{i}^{2}), we call F^\hat{F} an ϵ\epsilon-correct subdivision of FF if there is a function π:[k][k]\pi:[k]\rightarrow[k^{\prime}] that is onto and

j[k]iπ(i)=jwiw^jϵ\forall_{j\in[k^{\prime}]}|\sum_{i|\pi(i)=j}w_{i}-\hat{w}_{j}|\leq\epsilon

i[k]Dp(Fi,F^π(i))ϵ\forall_{i\in[k]}D_{p}(F_{i},\hat{F}_{\pi(i)})\leq\epsilon

When considering high-dimensional mixtures, we replace the above parameter distance by μiμ^π(i)+ΣiΣ^π(i)Fϵ,\|\mu_{i}-\hat{\mu}_{\pi(i)}\|+\|\Sigma_{i}-\hat{\Sigma}_{\pi(i)}\|_{F}\leq\epsilon, where F\|_{F} denotes the Frobenius norm.

Notationally, we will write (F^,π)Dϵ(F)(\hat{F},\pi)\in\mathcal{D}_{\epsilon}(F) as shorthand for the statement that F^\hat{F} is an ϵ\epsilon-correct subdivision for FF and π\pi is the (onto) function from kk to kk^{\prime} that groups FF into F^\hat{F} as above.

Note that this definition, unlike the definition for ϵ\epsilon-close estimate, uses parameter distance as opposed to statistical distance. This is critical because our univariate algorithm will only be able to return an estimate that is an ϵ\epsilon-correct subdivision when the notion of “close” is in parameter distance, and not statistical distance because in general there could be a component of the univariate mixture of arbitrarily small variance, and we will only be able to match this to an additive guarantee and this implies nothing about the statistical distance between our estimate and the actual component.

A Robust Univariate Algorithm

In this section, we give a learning algorithm for univariate mixtures of Gaussians that will be the building block for our learning algorithm in nn-dimensions. Unlike in the case of , our univariate algorithm will not necessarily be given a mixture of Gaussians for which all pairwise parameter distances are reasonably large. Instead, it could happen that we are given a mixture of (say) three Gaussians so that some pair has arbitrarily small parameter distance.

In the case in which we are guaranteed that all pairwise parameter distances are reasonably large, we can iterate the technical ideas in to give an inductive proof that a simple brute force search algorithm will return good estimates. We call this algorithm the Basic Univariate Algorithm. From this, we build a General Univariate Algorithm that will return a good estimate regardless of the parameter distances, although in order to do so we will need to relax the notion of a good estimate to something weaker: the algorithm return an ϵ\epsilon-correct subdivision.

In this section, we show that we can efficiently learn the parameters of univariate mixtures of Gaussians, provided that the components of the mixture have nonnegligible pairwise parameter distances. We refer to this algorithm as the Basic Univariate Algorithm. Such an algorithm will follow easily from Theorem 4—the polynomially robust identifiability of univariate mixtures. Throughout this section we will consider two univariate mixtures of Gaussians:

We will call the pair F,FF,F^{\prime} ϵ\epsilon-standard if σi2,σi21\sigma_{i}^{2},\sigma_{i}^{\prime 2}\leq 1 and if ϵ\epsilon satisfies:

μi,μi1ϵ|\mu_{i}|,|\mu^{\prime}_{i}|\leq\frac{1}{\epsilon}

μiμj+σi2σj2ϵ|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon and μiμj+σi2σj2ϵ|\mu^{\prime}_{i}-\mu^{\prime}_{j}|+|\sigma_{i}^{\prime 2}-\sigma_{j}^{\prime 2}|\geq\epsilon for all iji\neq j

ϵminπi(wiwπ(i)+μiμπ(i)+σi2σπ(i)2)\epsilon\leq\min_{\pi}\sum_{i}\left(|w_{i}-w_{\pi(i)}^{\prime}|+|\mu_{i}-\mu_{\pi(i)}^{\prime}|+|\sigma_{i}^{2}-\sigma_{\pi(i)}^{\prime 2}|\right), where the minimization is taken over all mappings π:{1,,n}{1,,k}.\pi:\{1,\ldots,n\}\rightarrow\{1,\ldots,k\}.

There is a constant c>0c>0 such that, for any ϵ\epsilon-standard F,FF,F^{\prime} and any ϵ<c\epsilon<c,

While the dependency on kk in Theorem 4 is very bad, as we show in Section 6, this exponential dependency on kk is necessary. Specifically, we give a construction of two 1/k1/k-standard distributions whose statistical distance is O(ek).O(e^{-k}).

Given the polynomially robust identifiability guaranteed by the above theorem, and simple concentration bounds on the ithi^{th} sample moment, it is easy to see that a brute-force search over a set of candidate parameter sets will yield an efficient algorithm that recovers the parameters for a univariate mixtures of Gaussians whose components have pairwise parameter distance at least ϵ\epsilon: roughly, the Basic Univariate Algorithm will take a polynomial number of samples, compute the first 4k24k-2 sample moments, and compare those with the first 4k24k-2 moments of each of the candidate parameter sets. The algorithm then returns the parameter set whose moments most closely match the sample moments. Theorem 4 guarantees that if the first 4k24k-2 sample moments closely match those of the chosen parameter set, then the parameter set must be nearly accurate. To conclude the proof, we argue that a polynomial-sized set of candidate parameters suffices to guarantee that at least one set of parameters will yield moments sufficiently close to the sample moments, which, by simple concentration bounds, will be close to the true moments of the GMM. We state the corollary below, and defer the details of the algorithm and the proof of its correctness to Appendix C.

Suppose we are given access to independent samples from a GMM i=1kwiN(μi,σi2,x)\sum_{i=1}^{k}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x) with mean 0 and variance in the interval [1/2,2],[1/2,2], where wiϵw_{i}\geq\epsilon, and μiμj+σi2σj2ϵ|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon. There exists a Basic Univariate Algorithm that, for any fixed kk, has runtime at most poly(1ϵ,1δ)poly(\frac{1}{\epsilon},\frac{1}{\delta}) samples and with probability at least 1δ1-\delta will output mixture parameters w^i,μ^i,σi^2\hat{w}_{i},\hat{\mu}_{i},\hat{\sigma_{i}}^{2}, so that there is a permutation π:[k][k]\pi:[k]\rightarrow[k] and

2 The General Univariate Algorithm

In this section we seek to extend the Basic Univariate Algorithm of Corollary 5 to the general setting of a univariate mixture of kk Gaussians without any requirements that the components have significant pair-wise parameter distances. In particular, given some target accuracy ϵ,\epsilon, and access to independent samples from a mixture FF of kk univariate Gaussians, we want to efficiently compute a mixture FF^{\prime} of kkk^{\prime}\leq k Gaussians that is an ϵ\epsilon-correct subdivision of F.F.

There is a General Univariate Algorithm which, given ϵ,δ>0\epsilon,\delta>0, and access to a GMM of kk Gaussians, F=iwiN(μi,σi2)F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2}) that is in near isotropic position and satisfies wiϵw_{i}\geq\epsilon, will run in time polynomial in 1/ϵ1/\epsilon and 1/δ,1/\delta, and will return with probability at least 1δ1-\delta a GMM of kkk^{\prime}\leq k Gaussians F^\hat{F} that is an ϵ\epsilon-correct subdivision of FF.

The critical insight in building up such a General Univariate Algorithm is that if two components are actually close enough (in statistical distance), then the Basic Univariate Algorithm could never tell these two components apart from a single (appropriately) chosen Gaussian, because this algorithm only requires a polynomial number of samples. So given a target precision ϵ1\epsilon_{1} for the Basic Univariate Algorithm, there is some window that describes whether or not the algorithm will work correctly. If all pairwise parameter distances are either sufficiently large or sufficiently small, then the Basic Univariate Algorithm will function as if it were given sample access to a mixture that actually does meet the requirements of the algorithm. So as long as no parameter distance falls inside a particular window (which characterizes whether or not the algorithm will behave properly), the algorithm will return a correct computation.

However, when there is some parameter distance that falls inside the Basic Univariate Algorithm’s window, we are not guaranteed that the Basic Univariate Algorithm will fail safely. The idea, then, is to use many disjoint windows (each of which corresponds to running the Basic Univariate Algorithm with some target precision). If we choose enough such windows, each pairwise parameter distance can only corrupt a single run of the Basic Univariate Algorithm so a majority of the computations will be correct. We will never know which computations resulted from cases when no parameter distance fell inside the corresponding window, but we will be able to define a notion of consensus among these different runs of the Basic Univariate Algorithm so that a majority of the runs will agree, and any run which agrees with some computation that was correct will also be close to correct.

We defer the algorithm and proof of correctness to Appendix D

Partition Pursuit

In this section we demonstrate how to use the General Univariate Algorithm to obtain good additive approximations in nn-dimensions. Roughly, we will project the nn-dimensional mixture FF onto many close-by directions, and run the General Univariate Algorithm on each projection. This is also how the algorithm in is able to recover good additive estimates in nn-dimensions. However we will have to cope with the additional complication that our univariate algorithm (the General Univariate Algorithm) does not necessarily return an estimate that is a mixture of kk Gaussians.

We explain in detail how the algorithm in is able to obtain additive approximation guarantees in nn-dimensions, building on a univariate algorithm for learning mixtures of two Gaussians. Let ϵ3>>ϵ2>>ϵ1\epsilon_{3}>>\epsilon_{2}>>\epsilon_{1}. Given any ϵ\epsilon-statistically learnable mixture of two Gaussians in nn-dimensions, with high probability, for a direction rr chosen uniformly at random the parameter distance between the two Gaussians in Pr[F]P_{r}[F] will be at least ϵ3\epsilon_{3}. Then given such a direction rr, we can choose n2n^{2} different directions rx,yr_{x,y} each of which are ϵ2\epsilon_{2}-close to rr (i.e. rrx,yϵ2\|r-r_{x,y}\|\approx\epsilon_{2}). The mean and variance of a component in Pu[F]P_{u}[F] change continuously as we vary the direction uu from rr to rx,yr_{x,y}, and this implies that for ϵ2<<ϵ3\epsilon_{2}<<\epsilon_{3}, we will be able to consistently pair up estimates recovered from each projection, so that for each Gaussian we have n2n^{2} different estimates in different directions of the projected mean and variance. Each of these estimates are accurate to within ϵ1\epsilon_{1} (i.e. this is the target precision that is given to the univariate algorithm). For any Gaussian, an estimate for the projected mean and the projected variance for a direction rr gives a linear constraint on the mean vector μ\mu and the co-variance matrix Σ\Sigma. As a result, if ϵ1<<ϵ2\epsilon_{1}<<\epsilon_{2} then the precision is much finer than the condition number of this system of linear constraints on μ,Σ\mu,\Sigma and this yields an accurate estimate in nn-dimensions.

Let ϵ2,ϵ1>0\epsilon_{2},\epsilon_{1}>0. Suppose m0μr|m^{0}-\mu\cdot r|,mijμrij|{m}^{ij}-\mu\cdot r^{ij}|, v0rTΣr|v^{0}-r^{T}\Sigma r|,vij(rij)TΣrij|v^{ij}-(r^{ij})^{T}\Sigma r^{ij}| are all at most ϵ1\epsilon_{1}. Then Solve outputs μ^Rn\hat{\mu}\in{\bf R}^{n} and Σ^Rn×n\hat{\Sigma}\in{\bf R}^{n\times n} such that μ^μ<ϵ1nϵ2\|\hat{\mu}-\mu\|<\frac{\epsilon_{1}\sqrt{n}}{\epsilon_{2}}, and Σ^ΣF6nϵ1ϵ22\|\hat{\Sigma}-\Sigma\|_{F}\leq\frac{6n\epsilon_{1}}{\epsilon_{2}^{2}}. Furthermore, Σ^0\hat{\Sigma}\succeq 0 and Σ^\hat{\Sigma} is symmetric.

The algorithm to which this lemma refers is given in Appendix F.2

However, the General Univariate Algorithm does not always return a mixture of kk Gaussians, and can in fact return a mixture F^u\hat{F}^{u} of k<kk^{\prime}<k Gaussians provided that this mixture is still an ϵ1\epsilon_{1}-correct subdivision of Pu[F]P_{u}[F] (for some direction uu). But then what happens if we consider two close-by directions, uu and vv and the number of Gaussians in the estimate F^u\hat{F}^{u} is different from the number of Gaussians in the estimate F^v\hat{F}^{v}?

The key insight is that if we choose some direction rr, and close-by directions rx,yr_{x,y}, if any estimate returned for rx,yr_{x,y} has more components than the estimate returned for the direction rr, then we have made progress because we have identified another Gaussian in the original mixture FF. So here, rather than trying to use this estimate for rx,yr_{x,y}, we just start the algorithm over using rx,yr_{x,y} as the original direction, and considering n2n^{2} close-by directions.

The additional complication is that we must make sure every time we see a different number of components, that we’ve made progress. We can do so by maintaining a Window from ϵ1\epsilon_{1} to ϵ3\epsilon_{3}, and we say that a Window is satisfied if the estimate F^r\hat{F}^{r} returned for some direction rr has all pairs of Gaussians either at parameter distance at least ϵ3\epsilon_{3}, or at most the precision ϵ1\epsilon_{1} of the General Univariate Algorithm. Then if we consider close-by directions rx,yr_{x,y} (that are ϵ2\epsilon_{2}-close to rr, for ϵ1<<ϵ2<<ϵ3\epsilon_{1}<<\epsilon_{2}<<\epsilon_{3}), we can ensure that whenever we see a different number of components in the estimate corresponding to some direction rx,yr_{x,y}, there are more components. When we see more components, we may need to shift the Window WW to a Window WW^{\prime} so that in this new direction rx,yr_{x,y}, the Window WW^{\prime} is satisfied. We take rx,yr_{x,y} as the new base direction. But we have made progress because we have identified a new component in the mixture.

We state our main theorem in this section, and defer the algorithm and proof to Appendix F

Given an ϵ\epsilon-statistically learnable GMM FF in isotropic position, the Partition Pursuit Algorithm will recover an ϵ\epsilon-correct sub-division F^\hat{F} and if FF has more than one component, F^\hat{F} also has more than one component.

Clustering and Recursion

In this section, we give an efficient algorithm for learning an estimate F^\hat{F} that is ϵ\epsilon-close to the actual mixture FF. Partition Pursuit assumes that the mixture FF is in isotropic position, and even though FF is not necessarily in isotropic position, we will be able to get around this hurdle by first taking enough samples to compute a transformation that places the mixture FF in nearly isotropic position and then applying this transformation to each sample from the oracle. The main technical challenge in this section is actually what to do when the mixture F^\hat{F} returned by Partition Pursuit is a good additive approximation to FF (i.e. it is an ϵ1\epsilon_{1}-correct subdivision with ϵ1<<ϵ\epsilon_{1}<<\epsilon), but is not ϵ\epsilon-close to the mixture FF. This can only happen if there is a component in FF that has a very small variance in some direction. Consider for example, two Gaussians in one dimension N(0,γ)\mathcal{N}(0,\gamma) and N(0,γ+ϵ1)\mathcal{N}(0,\gamma+\epsilon_{1}). Even if ϵ1\epsilon_{1} is very small, if γ\gamma is much smaller, then the statistical distance between these two Gaussians can be arbitrarily close to 11.

So the high-level idea is that if the estimate F^\hat{F} returned by Partition Pursuit is not ϵ\epsilon-close to FF (but F^\hat{F} is an ϵ1\epsilon_{1}-correct subdivision of FF for ϵ1<<ϵ\epsilon_{1}<<\epsilon), then it must be the case that some component FiF_{i} of FF has a co-variance matrix Σi\Sigma_{i} so that for some direction vv, vTΣivv^{T}\Sigma_{i}v is very small. Then we can use this direction vv to still make progress: If we project the mixture FF onto vv, we will be able to cluster accurately. There will be some partition of the Gaussians in FF into two disjoint, non-empty sets of components S,TS,T and some clustering scheme that can accurately clusters points sampled from FF into points that originated from a component in SS and points that originated from a component in TT. So we can hope to accurately cluster enough points sampled from FF into sets of points that originated from SS and sets of points that originated from TT, and then we can run our learning algorithm (with a smaller maximum of at most k1k-1 components) on each set of points. By induction, this learning algorithm will return close estimates, and if we take a convex combination of these estimates we obtain a new estimate F^\hat{F}^{\prime} that is ϵ\epsilon-close to FF. The main technical challenge is in showing that if there is some component of FF with a small enough variance in some direction vv, then we can accurately cluster points sampled from FF. Given this, our main result follows almost immediately from an inductive argument.

2 How to Cluster

Here we give formalize the notion of a clustering scheme. Additionally, we state the key lemmas that will be useful in showing that if F^\hat{F} is not an ϵ\epsilon-close estimate to FF, then we can use F^\hat{F} to construct a good clustering scheme that makes progress on our learning problem.

We will call A,BnA,B\subset\Re^{n} a clustering scheme if AB=A\cap B=\emptyset

For AnA\subset\Re^{n}, we will write P[Fi,A]P[F_{i},A] to denote PrxFi[xA]Pr_{x\sim F_{i}}[x\in A] - i.e. the probability that a randomly chosen sample from FiF_{i} is in the set AA.

If we have a direction vv and some component F^i\hat{F}_{i} which has small variance in direction vv, we want to use this direction to cluster accurately. The intuition is clearest in the case of mixtures of two Gaussians: Suppose one of the components, say F^1\hat{F}_{1}, had small variance on direction vv. If the entire mixture is in isotropic position, then the variance of the mixture when projected onto direction vv is 11. This can only happen if either the difference in projected means vT(μ^1μ^2)|v^{T}(\hat{\mu}_{1}-\hat{\mu}_{2})| is large or the variance of F^2\hat{F}_{2} on direction vv is large. In the first case, we can choose an interval around each projected (estimate) mean vTμ^1v^{T}\hat{\mu}_{1} and vTμ^2v^{T}\hat{\mu}_{2} so that with high probability, any point sampled from F1F_{1} is contained in the interval around vTμ^1v^{T}\hat{\mu}_{1} and similarly for F2F_{2}. If, instead, the variance of F2F_{2} when projected onto vv is large, then again a small interval around the point vTμ^1v^{T}\hat{\mu}_{1} will contain most samples from F1F_{1}, but because the maximum density of vTF2v^{T}F_{2} is never large and the interval around vTμ^1v^{T}\hat{\mu}_{1} is not too large either, most samples from F2F_{2} will not be contained in the interval. This idea is the basis of our clustering lemmas, although there will be additional complications when the mixture contains more than two Gaussians, the intuition is close to the same.

Let (F^,π)Dϵ1(F)(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F). Suppose also that F^\hat{F} is a mixture of kk^{\prime} components.

Suppose that for some direction vv, for all ii: vTΣ^ivϵ2v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{2}, for ϵ1ϵ22ϵ3\epsilon_{1}\leq\frac{\sqrt{\epsilon_{2}}}{2\epsilon_{3}}. If there is some bi-partition S[k]S\subset[k^{\prime}] s.t. iS,j[k]SvTμ^ivTμ^j3ϵ2ϵ3\forall_{i\in S,j\in[k^{\prime}]-S}|v^{T}\hat{\mu}_{i}-v^{T}\hat{\mu}_{j}|\geq\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}} then there is a clustering scheme (A,B)(A,B) (based only on F^\hat{F}) so that for all iS,jπ1(i)i\in S,j\in\pi^{-1}(i), P[Fi,A]1ϵ3P[F_{i},A]\geq 1-\epsilon_{3} and for all iS,jπ1(i)i\notin S,j\in\pi^{-1}(i), Pr[Fi,B]1ϵ3Pr[F_{i},B]\geq 1-\epsilon_{3}.

This lemma corresponds to the first case in the above thought exercise when there is some bi-partition of the components so that all pairs of projected means across the bi-partition are reasonably separated.

Suppose that for some direction vv and some i[k]i\in[k^{\prime}] such that: vTΣ^ivϵmv^{T}\hat{\Sigma}_{i}v\leq\epsilon_{m}, for ϵm>>ϵ1\epsilon_{m}>>\epsilon_{1}. If there is some bi-partition S[k]S\subset[k^{\prime}] s.t.

(and ϵt<<ϵ33\epsilon_{t}<<\epsilon_{3}^{3}) then there is a clustering scheme A,BA,B such that for all iS,jπ1(i)i\in S,j\in\pi^{-1}(i), P[Fi,A]1ϵ3P[F_{i},A]\geq 1-\epsilon_{3} and for all iS,jπ1(i)i\notin S,j\in\pi^{-1}(i), Pr[Fi,B]1ϵ3Pr[F_{i},B]\geq 1-\epsilon_{3}.

This lemma corresponds to the second case to the second case, when there is some bi-partition of the components so that one side of the bi-partition has projected variances that are much larger than the other.

The proofs of these lemmas, along with additional technical details are given in Appendix G.2

3 Making Progress when there is a Small Variance

We state a lemma from which formalizes the intuition that if there is no component in F^\hat{F} with small variance in any direction, the F^\hat{F} is a good statistical estimate to FF:

Suppose μ^iμiϵ1\|\hat{\mu}_{i}-\mu_{i}\|\leq\epsilon_{1}, Σ^iΣiFϵ1\|\hat{\Sigma}_{i}-\Sigma_{i}\|_{F}\leq\epsilon_{1}, and w^iwiϵ1|\hat{w}_{i}-w_{i}|\leq\epsilon_{1}, if either Σi1212ϵm\|\Sigma^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}} or Σ^i1212ϵm\|\hat{\Sigma}^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}} then

We will use this lemma as a building block to prove:

The Hierarchical Clustering Algorithm either returns an ϵ\epsilon-close statistical estimate F^\hat{F} for FF, or returns a clustering scheme A,BA,B such that there is some bipartition S[k]S\subset[k] such that for all iS,jπ1(i)i\in S,j\in\pi^{-1}(i), P[Fi,A]1ϵ3P[F_{i},A]\geq 1-\epsilon_{3} and for all iS,jπ1(i)i\notin S,j\in\pi^{-1}(i), Pr[Fi,B]1ϵ3Pr[F_{i},B]\geq 1-\epsilon_{3}. And also S,[k]SS,[k]-S are both non-emtpy.

We defer the algorithm and the proof of correctness to Appendix G.

4 Recursion

[Isotropic Projection Lemma] Given a mixture of kk nn-Dimensional Gaussians F=iwiFiF=\sum_{i}w_{i}F_{i} that is in isotropic position and is ϵ\epsilon-statistically learnable, with probability 1δ\geq 1-\delta over a randomly chosen direction uu, there is some pair of Gaussians Fi,FjF_{i},F_{j} s.t. Dp(Pu[Fi],Pu[Fj])ϵ5δ250n2D_{p}(P_{u}[F_{i}],P_{u}[F_{j}])\geq\frac{\epsilon^{5}\delta^{2}}{50n^{2}}.

We defer a proof of this lemma to Appendix H

Let Ha(ϵ,δ,k),Hi(ϵ,δ,k)H_{a}(\epsilon,\delta,k),H_{i}(\epsilon,\delta,k) be the inverse of the number of samples needed by the High Dimensional Anisotropic Algorithm and the High Dimensional Isotropic Algorithm respectively when given target precision ϵ\epsilon (and access to an ϵ\epsilon-statisically learnable distribution), an upper bound kk on the number of Gaussians, and an error parameter δ\delta.

Given k,ϵk,\epsilon, and a mixture of at most kk Gaussians FF that is ϵ\epsilon-statistically learnable High Dimensional Anisotropic Algorithm returns an estimate F^\hat{F} that is ϵ\epsilon-close to the actual mixture FF.

We prove this theorem by induction. Let ϵk1=Ha(ϵ2,δ,k1)\epsilon_{k-1}=H_{a}(\frac{\epsilon}{2},\delta,k-1).

We assume by induction that both the High Dimensional Isotropic Algorithm and the High Dimensional Anisotropic Algorithm return an ϵ\epsilon-close estimate for all values of kk1k^{\prime}\leq k-1. We then consider both algorithms for the case of kk:

Consider the High Dimensional Isotropic Algorithm which is given k,ϵk,\epsilon, and a mixture of at most kk Gaussians FF that is ϵ\epsilon-statistically learnable and is in isotropic position: We first run the Hierarchical Clustering Algorithm with parameters ϵ,δ,ϵ3,k\epsilon,\delta,\epsilon_{3},k where ϵ3=12ϵϵk1δ\epsilon_{3}=\frac{1}{2}\epsilon\epsilon_{k-1}\delta. If this algorithm returns an estimate F^\hat{F}, we can return this estimate and it is guaranteed to be ϵ\epsilon-close to the actual mixture.

Note that if the number of components in FF is 11, then the Hierarchical Clustering Algorithm will necessarily return an estimate F^\hat{F}, because there is no partitioning scheme that partitions FF into two subsets of components that are both non-empty. This establishes the base case in the inductive argument.

Otherwise the output of the Hierarchical Clustering Algorithm is a clustering scheme (A,B)(A,B) with the property that there is some partition S,TS,T of the Gaussians in FF (S,TS,T\neq\emptyset) and for all iSi\in S, PrxFi[xiA]1ϵ3Pr_{x\sim F_{i}}[x_{i}\in A]\geq 1-\epsilon_{3}, and jTj\in T, PrxFi[xiB]1ϵ3Pr_{x\sim F_{i}}[x_{i}\in B]\geq 1-\epsilon_{3}. Let FS,FTF_{S},F_{T} be the (re-weighted) mixtures that result from placing every component in SS from FF into FSF_{S}, and every component in TT from FF into FTF_{T}. Note that FS,FTF_{S},F_{T} are still ϵ\epsilon-statistically learnable, but may not be in isotropic position any longer.

All samples x1,x2,...,xmx_{1},x_{2},...,x_{m} are either in AA or BB

The number of samples in AA and the number of samples in BB will each be at least 1ϵk1\frac{1}{\epsilon_{k-1}}

All samples are clustered correctly - i.e. if xiAx_{i}\in A, then xix_{i} was generated by some Gaussian FjF_{j} with jSj\in S and if xiBx_{i}\in B, then xix_{i} was generated some Gaussian FjF_{j} with jTj\in T.

Let XS,XTX_{S},X_{T} be the samples from x1,x2,...,xmx_{1},x_{2},...,x_{m} that are in A,BA,B respectively. We can then run the the High Dimensional Anisotropic Algorithm with parameters ϵ2,δ,k1\frac{\epsilon}{2},\delta,k-1 on each set XSX_{S} and XTX_{T}. Let the algorithm return the mixtures F^A,F^B\hat{F}_{A},\hat{F}_{B} respectively. We return a convex combination of these mixtures, F^=XSmF^A+XTmF^B\hat{F}=\frac{|X_{S}|}{m}\hat{F}_{A}+\frac{|X_{T}|}{m}\hat{F}_{B}. The estimates F^A,F^B\hat{F}_{A},\hat{F}_{B} are ϵ\epsilon-close estimates to FS,FTF_{S},F_{T} respectively. We can write F=wAFA+wBFBF=w_{A}F_{A}+w_{B}F_{B}, and with high probability XSm\frac{|X_{S}|}{m}, XTm\frac{|X_{T}|}{m} will be close to wA,wBw_{A},w_{B} respectively. Then this implies that F^\hat{F} is ϵ\epsilon-close to FF. Thus by induction, the output of the High Dimensional Isotropic Algorithm is an estimate F^\hat{F} that is ϵ\epsilon-close to FF.

We need to also verify by induction that the output of the High Dimensional Anisotropic Algorithm is also an ϵ\epsilon-close estimate to FF. So suppose that the input to the High Dimensional Anisotropic Algorithm is a mixture of at most kk Gaussians, that is ϵ\epsilon-statistically learnable and is not necessarily in isotropic position.

Exponential Dependence on k𝑘k is Inevitable

In this section, we present a lower bound, showing that the inverse exponential dependency on the number of Gaussian components in each mixture is necessary, even for mixtures in just one dimension. We show this by giving a simple construction of two 1-dimensional distributions, D1,D2D_{1},D_{2} that are 1/(2m)1/(2m)-standard. Specifically, they are mixtures of at most mm Gaussians, such that the weights of all components of each mixture are at least 1/(2m)1/(2m), and the parameter distance between the pair of distributions is at least 1/(2m),1/(2m), but D1D21em/30,||D_{1}-D_{2}||_{1}\leq e^{-m/30}, for sufficiently large mm. The construction hinges on the inverse exponential (in kmk\approx\sqrt{m}) statistical distance between N(0,2),\mathcal{N}(0,2), and the mixtures of infinitely many Gaussians of unit variance whose components are centered at multiples of 1/k1/k, with the weight assigned to the component centered at i/ki/k being given by N(0,1,i/k).N(0,1,i/k). Verifying that this is true is a straight-forward exercise in Fourier analysis. The final construction truncates the mixture of infinitely many Gaussians by removing all the components with centers a distance greater than kk from 0. This truncation clearly has negligibly small effect on the distribution. Finally, we alter the pair of distributions by adding to both distributions, Gaussian components of equal weight with centers at k2/k,(k2+1)/k,,k,-k^{2}/k,(-k^{2}+1)/k,\ldots,k, which ensures that in the final pair of distributions, all components have significant weight.

There exists a pair D1,D2D_{1},D_{2} of 1/(4k2+2)1/(4k^{2}+2)-standard distributions that are each mixtures of k2+1k^{2}+1 Gaussians such that

We can define Fk(x)N=cki=NN1πe(i/k)2N(i/k,1/2,x),F_{k}(x)^{N}=c_{k}\sum_{i=-N}^{N}\frac{1}{\sqrt{\pi}}e^{-(i/k)^{2}}\mathcal{N}(i/k,1/2,x),, and we give a plot of FkNF_{k}^{N} for k=2,N=2k=2,N=2 in Figure 1a and the corresponding plot of each component, and in Figure 1b we give a plot of each component of FkNF_{k}^{N} for k=4,N=8k=4,N=8.

Conclusions

We give an estimator that converges to the true distribution at an inverse polynomial rate, and this result has implications for polynomial-time clustering and density estimation. A natural question is: “What is the optimal rate of convergence?” This question is wide open, and all we can say for certain is that the rate of convergence is at worst polynomial in the dimension and the inverse of the desired accuracy, and exponential in the number of components. We made no attempt here to optimize the constants in the exponent of the rate of convergence and even if we had, the theoretical runtime would still be extremely impractical. This, however, raises the practically relevant question of whether aspects of our algorithm can be combined with existing heuristics that seem to perform well in most applications. For example, the brute-force-search component of our univariate algorithm is clearly expensive; perhaps employing existing heuristics (such as the EM algorithm) for the univariate problems, in conjunction with aspects of our dimension-reduction machinery might yield improved efficiency on real-world instances.

Additionally, we note that much of the machinery we developed—from the “deconvolution” argument for the polynomially robust identifiability, to the partition pursuit and hierarchical clustering for the dimension reduction arguments, seem to be relatively general and robust. We suspect that such tools could be applied to yield corresponding results for other families of distributions.

Acknowledgements

We are grateful to Paul Valiant, for suggesting the lower-bound construction of Section 6 and many helpful discussions throughout; and are indebted to Adam Tauman Kalai for introducing us to this beautiful line of research, and for all his guidance, encouragement and deep insights about mixtures of Gaussians.

References

Appendix A In-Depth Outline

To start, suppose that we are given access to independent samples from a mixture of kk Gaussians, and given a unit vector rr with the following promise: for each pair G1,G2G_{1},G_{2} of components, in the projection of the mixture onto rr, either the projections of G1G_{1} and G2G_{2} have reasonably different parameters (>ϵ2>\epsilon_{2}), or their projections are so close that our algorithm could never tell them apart from a single Gaussian (parameter distance at most ϵ0<<ϵ1,\epsilon_{0}<<\epsilon_{1}, where ϵ1<<ϵ2\epsilon_{1}<<\epsilon_{2} is the desired accuracy of the 1-d parameter learning algorithm. In this case, our 1-d parameter recovery algorithm will perform correctly, and return some ϵ1\epsilon_{1}-accurate parameters for a mixture of kkk^{\prime}\leq k components.

Thus in general, for a given desired accuracy ϵ1\epsilon_{1}, there is some critical window, namely [ϵ0,ϵ2],[\epsilon_{0},\epsilon_{2}], associated with the 1-d learning algorithm that determines if it will function correctly. In a given projection, as long as no pair of components have parameter distances that fall within this window, then any pair of Gaussians is either reasonably different in parameters, or so close in parameters that the algorithm will never be able to tell the difference.

In this way, if an algorithm designer is told the parameters of a given mixture of kk Gaussians, he could construct an algorithm that would have been able to find some of these parameters. The algorithm would project onto a random direction rr, and based on the pairwise parameter distances between the univariate Gaussians, there will be some window (i.e. some choice of a target precision with which to run the algorithm), bounded below by some polynomial in the desired output accuracy, so that the algorithm would function correctly. The problem is that while there is always some window that would work for any mixture of kk univariate Gaussians, we don’t know what window to use, and in general if we run the algorithm on a bad window, we aren’t guaranteed that the algorithm will fail in a safe way.

To get around this, we run the 1-d Learning Algorithm algorithm many times on different windows that do not intersect. Because there are only kk univariate Gaussians, and thus at most k2k^{2} different distances between component parameters in any given projection, at most k2k^{2} of these windows can be corrupted. If we choose sufficiently many windows (but still a polynomial number), a majority of the windows will yield correct parameters. Even though we can never determine which windows were good and which were bad, we can return the parameters generated by some window in consensus with a majority, and in this way, regardless of whether the window was good or bad, it is in consensus with a good window and must also be close to the correct parameters.

It is important to stress that even after the above consensus is conducted on a given projection, we still cannot be guaranteed that our univariate algorithm returns a mixture of kk Gaussians. Instead, it will return some mixture of k<kk^{\prime}<k Gaussians, where an element in the mixture might correspond to (say) a pair of Gaussians in the original mixture that were too close to differentiate in the given projection.

A.2 Partition Pursuit

This brings us to the second obstacle outlined in Section 1.4: in order to recover the nn-dimensional parameters, we will need estimates of the parameters of the Gaussians when projected on many different directions. But, as mentioned above, the univariate algorithm will not necessarily return a mixture of kk Gaussians, and even if we choose a direction rr^{\prime} that is sufficiently close to rr (but still rr>>ϵ1,|r-r^{\prime}|>>\epsilon_{1}, the accuracy of the 1-d algorithm), it may be the case that the univariate algorithm for direction rr returned kk^{\prime} Gaussians, and the univariate algorithm for direction rr returned kkk^{\prime\prime}\neq k^{\prime} Gaussians. How do we pair up these estimates in a consistent manner?

The key insight is that we are actually making progress if we see more Gaussians when projecting onto a different direction. If we choose a new direction, and we see a mixture of Gaussians with more components, we should backtrack and start over as if this was the direction we originally chose. We may have to slide the window corresponding to our 1-d algorithm and learn at a finer precision than what we chose previously, but this finer precision will still be polynomially bounded. Effectively, we are clustering the Gaussian components into clusters with the property that the components of each cluster are indistinguishable in each of the one-dimensional projections that we have considered. In order to make this idea work properly, we also need to ensure that we maintain a minimum parameter distance between all Gaussians clusters that we have seen (i.e. this distance is much larger than our 1-d accuracy ϵ1\epsilon_{1}), so that when we choose a new direction rr^{\prime} sufficiently close to r,r, Gaussian component cannot switch clusters. Thus at each stage, each cluster of Gaussians either continues to be a cluster, or it gets partitioned into several clusters of Gaussians.

A.3 Hierarchical Clustering

The final obstacle outlined in Section 1.4 can be addressed easily via an accurate clustering of the input samples together with a kk-Gaussian analog of the Random Projection Lemma. Intuitively, the only way that a set of high-dimensional Gaussians with significant statistical distance, when projected onto a random vector, will appear nearly identical is if the re-weighted mixture of the Gaussians in this set is very far from isotropic position. This motivates the hope that if we have recovered some mixture of k<kk^{\prime}<k components, then it must be the case that whichever of these components contains multiple original Gaussians has covariance matrix very far from isotropic. Thus such a component must have at least one very small eigenvalue. Given the eigenvector corresponding to such an eigenvalue, we should be able to very accurately cluster the sample points into some partition of the original Gaussians. This motivates the following slightly more specific version of the high-level algorithm approach: Given that we have recovered parameters for a mixture of k<kk^{\prime}<k components:

Learn the parameters of some mixture of kkk^{\prime}\leq k Gaussians, where each learned Gaussian component corresponds to one or more of the Gaussians in the original mixture.

If k<kk^{\prime}<k, for each of the kk^{\prime} components recovered in the previous step:

If the ithi^{th} component has covariance matrix “not too far” from isotropic, then conclude that it corresponds to a single Gaussian in the original mixture.

there is a very small eigenvalue of the covariance matrix, so project the sample points onto the corresponding eigenvector, and accurately cluster the sample points that come from this component

Given the sample points corresponding to one of the components, rescale these data points so this component (which was very far from isotropic), is now in isotropic position, and repeat the entire algorithm on this sub-mixture

The final observation that guarantees that our algorithm will make progress with every iteration, and thus terminate after a polynomial number of steps is the following analog of the Random Projection Lemma for the kk-Gaussians setting. Given a mixture of kk Gaussians in isotropic position, with high probability over random unit vectors rr, there will be some pair of projected Gaussians whose parameters are reasonably different. Thus, in every projection, we will, with high probability, see what appears to be a mixture of at least two components.

Appendix B Polynomially Robust Identifiability

We now sketch the rough outline of the proof of Theorem 4. While there are considerable technical details, the main proof ideas are identical to those used in to prove the analogous theorem in the case that n=k=2.n=k=2.

Our proof will be via induction on max(n,k)\max(n,k). We start by considering the constituent Gaussian of minimal variance in the mixtures. Assume without loss of generality that this minimum variance component is the first component of F,F, and denote it by N1N_{1}. If there is no component of FF^{\prime} whose mean, variance, and mixing weight very closely matches those of N1N_{1}, then we argue that there is a significant disparity in the low order moments of FF and FF^{\prime}, no matter what the other Gaussian components are. (This argument is rather involved, and we will give the high-level sketch in the next paragraph.) If there is a component N1N_{1}^{\prime} of FF^{\prime} whose mean, variance, and mixture weight very closely matches those of N1N_{1}, then we argue that we can remove N1N_{1} from FF and N1N_{1}^{\prime} from FF^{\prime} with only negligible effect on the discrepancy in the low-order moments. More formally, let HH be the mixture of n1n-1 Gaussians obtained by removing N1N_{1} from FF, and rescaling the weights so as to sum to one, and define H,H^{\prime}, a mixture of k1k-1 Gaussians analogously. Then, assuming that NN and NN^{\prime} are very similar, the disparity in the low-order moments of HH and HH^{\prime} is almost the same as the disparity in low-order moments of FF andFF^{\prime}. We can then apply the induction hypothesis to the mixtures HH and HH^{\prime}.

We now return to the problem of showing that if the skinniest Gaussian in FF cannot be paired with a component of FF^{\prime} with similar mean, variance, and weight, that there must be a polynomially-significant discrepancy in the low-order moments of FF and FF^{\prime}. This step relies on ’deconvolving’ by a Gaussian with an appropriately chosen variance (this corresponds to running the heat equation in reverse for a suitable amount of time). We define the operation of deconvolving by a Gaussian of variance α\alpha as Fα\mathcal{F}_{\alpha}; applying this operator to a mixture of Gaussians has a particularly simple effect: subtract α\alpha from the variance of each Gaussian in the mixture (assuming that each constituent Gaussian has variance at least α\alpha). If α\alpha is negative, this is just convolution.

Let F(x)=i=1nwiN(μi,σi2,x)F(x)=\sum_{i=1}^{n}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x) be the probability density function of a mixture of Gaussian distributions, and for any α<miniσi2,\alpha<\min_{i}\sigma_{i}^{2}, define

The key step will be to show that if the skinniest Gaussian in either of the two mixtures cannot be paired with a nearly identical Gaussian in the other mixture, then there is some α\alpha for which the resulting mixtures, after applying the operation Fα\mathcal{F}_{\alpha}, have large statistical distance. Intuitively, this deconvolution operation allows us to isolate Gaussians in each mixture and then we can reason about the statistical distance between the two mixtures locally, without worrying about the other Gaussians in the mixture.

Given this statistical distance between the transformed pair of mixtures, we the fact that there are relatively few zero-crossings in the difference in probability density functions of two mixtures of Gaussians (Proposition 19) to show that this statistical distance gives rise to a discrepancy in at least one of the low-order moments of the pair of transformed distributions. To complete the argument, we then show that applying this transform to a pair of distributions, while certainly not preserving statistical distance, roughly preserves the combined disparity between the low-order moments of the pair of distributions. The complete proof can be found in Appendix B.

B.2 Theorem 4

In this section we give the complete proof of the polynomially robust identifiability of univariate mixtures of kk Gaussians (Theorem 4). For convenience, we restate the theorem and all necessary definitions. We make frequent reference to the simple properties of Gaussians and tail bounds provided in Appendix J. Throughout this section we will consider two univariate mixtures of Gaussians:

Definition 6. We will call the pair F,FF,F^{\prime} ϵ\epsilon-standard if σi2,σi21\sigma_{i}^{2},\sigma_{i}^{\prime 2}\leq 1 and if ϵ\epsilon satisfies:

μi,μi1ϵ|\mu_{i}|,|\mu^{\prime}_{i}|\leq\frac{1}{\epsilon}

μiμj+σi2σj2ϵ|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon and μiμj+σi2σj2ϵ|\mu^{\prime}_{i}-\mu^{\prime}_{j}|+|\sigma_{i}^{\prime 2}-\sigma_{j}^{\prime 2}|\geq\epsilon for all iji\neq j

ϵminπi(wiwπ(i)+μiμπ(i)+σi2σπ(i)2)\epsilon\leq\min_{\pi}\sum_{i}\left(|w_{i}-w_{\pi(i)}^{\prime}|+|\mu_{i}-\mu_{\pi(i)}^{\prime}|+|\sigma_{i}^{2}-\sigma_{\pi(i)}^{\prime 2}|\right), where the minimization is taken over all mappings π:{1,,n}{1,,k}.\pi:\{1,\ldots,n\}\rightarrow\{1,\ldots,k\}.

Theorem 4. There is a constant c>0c>0 such that, for any ϵ\epsilon-standard F,FF,F^{\prime} and any ϵ<c\epsilon<c,

The following definition of the deconvolution operation will be central to our proof of Theorem 4:

Definition 10. Let F(x)=i=1nwiN(μi,σi2,x)F(x)=\sum_{i=1}^{n}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x) be the probability density function of a mixture of Gaussian distributions, and for any α<miniσi2,\alpha<\min_{i}\sigma_{i}^{2}, define

The following lemma argues that if the skinniest Gaussian in mixture FF can not be matched with a sufficiently similar component in the mixture FF^{\prime}, then there is some α\alpha, possibly negative, such that maxxFα(F)(x)Fα(F)(x)\max_{x}|\mathcal{F}_{\alpha}(F)(x)-\mathcal{F}_{\alpha}(F^{\prime})(x)| is significant. Furthermore, every component in the transformed mixtures have variances that are not too small.

Suppose F,FF,F^{\prime} are ϵ\epsilon-standard. Suppose without loss of generality that the Gaussian of minimal variance is N(μ1,σ12),\mathcal{N}(\mu_{1},\sigma_{1}^{2}), and there is some γ\gamma satisfying ϵ/4>γ>0\epsilon/4>\gamma>0 such that for all i,i, at least one of the following holds:

σ12σi2>γ8|\sigma_{1}^{2}-\sigma_{i}^{\prime 2}|>\gamma^{8}

Then there is some α\alpha such that either

maxx(Fα(F)(x)Fα(F)(x))12γ2π\max_{x}(|\mathcal{F}_{\alpha}(F)(x)-\mathcal{F}_{\alpha}(F^{\prime})(x)|)\geq\frac{1}{2\gamma\sqrt{2\pi}} and the minimum variance in any component of Fα(F)\mathcal{F}_{\alpha}(F) or Fα(F)\mathcal{F}_{\alpha}(F^{\prime}) is at least γ4,\gamma^{4},

maxx(Fα(F)(x)Fα(F)(x))12γ82π\max_{x}(|\mathcal{F}_{\alpha}(F)(x)-\mathcal{F}_{\alpha}(F^{\prime})(x)|)\geq\frac{1}{2\gamma^{8}\sqrt{2\pi}} and the minimum variance in any component of Fα(F)\mathcal{F}_{\alpha}(F) or Fα(F)\mathcal{F}_{\alpha}(F^{\prime}) is at least γ18.\gamma^{18}.

We start by considering the case when there is no Gaussian in FF^{\prime} that matches both the mean and variance to within γ8.\gamma^{8}. Consider applying Fσ12γ18\mathcal{F}_{\sigma_{1}^{2}-\gamma^{18}}. Fσ12γ18(F)(μ1)ϵN(0,γ18,0)=ϵγ92π.\mathcal{F}_{\sigma_{1}^{2}-\gamma^{18}}(F)(\mu_{1})\geq\epsilon\mathcal{N}(0,\gamma^{18},0)=\frac{\epsilon}{\gamma^{9}\sqrt{2\pi}}. Next, by Corollary 60,

Next, consider the case where we have at least one Gaussian component of FF^{\prime} that matches both μ1\mu_{1} and σ12\sigma_{1}^{2} to within γ8,\gamma^{8}, but whose weight differs from w1w_{1} by at least γ.\gamma. By the definition of ϵ\epsilon-standard, there can be at most one such Gaussian component, say the ithi^{th}. If w1>wi,w_{1}>w_{i}^{\prime}, then Fσ12γ4(F)(μ1)Fσ12γ4(F)(μ1)1γ2π+2ϵ2πe,\mathcal{F}_{\sigma_{1}^{2}-\gamma^{4}}(F)(\mu_{1})-\mathcal{F}_{\sigma_{1}^{2}-\gamma^{4}}(F^{\prime})(\mu_{1})\geq\frac{1}{\gamma\sqrt{2\pi}}+\frac{2}{\epsilon\sqrt{2\pi e}}, where the second term is a bound on the contribution of the other Gaussian components, using the fact that F,FF,F^{\prime} are ϵ\epsilon-standard and Corollary 60. Since γ<ϵ/4,\gamma<\epsilon/4, this quantity is at least 12γ2π.\frac{1}{2\gamma\sqrt{2\pi}}.

If w1wi,w_{1}\leq w_{i}^{\prime}, then consider applying Fσ12γ4\mathcal{F}_{\sigma_{1}^{2}-\gamma^{4}} to the pair of distributions. Using the fact that 1x+x21xx,\frac{1}{\sqrt{x+x^{2}}}\geq\frac{1-x}{\sqrt{x}}, we have

Let f(x)Mf(x^{*})\geq M for some x(0,r)x^{*}\in(0,r) and suppose that f(x)0f(x)\geq 0 on (0,r)(0,r) and f(0)=f(r)=0f(0)=f(r)=0. Suppose also that f(x)m|f^{\prime}(x)|\leq m everywhere. Then 0rf(x)dxM2m\int_{0}^{r}f(x)dx\geq\frac{M^{2}}{m}

Consider the continuous function g(x)g(x) that is defined to be for x[0,xM/m][x+M/m,r],x\in[0,x^{*}-M/m]\cup[x^{*}+M/m,r], and has slope mm on the interval (xM/m,x),(x^{*}-M/m,x^{*}), and slope m-m on the interval (x,x+M/m).(x^{*},x^{*}+M/m). Clearly f(x)g(x)f(x)\geq g(x) for x(0,r),x\in(0,r), and thus

The above claim together with Lemma 16 yields the following

For α,γ\alpha,\gamma as defined in Lemma 16,

Let f(x)=Fα(F)(x),Fα(F)(x)f(x)=\mathcal{F}_{\alpha}(F)(x),\mathcal{F}_{\alpha}(F^{\prime})(x), then f(x)Mf(x*)\geq M for M=Ω(1γ)M=\Omega(\frac{1}{\gamma}) and for some xx^{*} contained in an interval II in which f(x)f(x) does not change sign. Similarly, because the minimum variance in any component of Fα(F)\mathcal{F}_{\alpha}(F) or Fα(F)\mathcal{F}_{\alpha}(F^{\prime}) is at least γ18\gamma^{18}, this implies that f(x)=O(1γ18)=mf^{\prime}(x)=O(\frac{1}{\gamma^{18}})=m. So we can apply Claim 17 using m,Mm,M and get that If(x)Ω(γ18)\int_{I}f(x)\geq\Omega(\gamma^{18}) and this implies the corollary. ∎

We use the following proposition from that shows that Fα(D)(x)Fα(D)(x)\mathcal{F}_{\alpha}(D)(x)-\mathcal{F}_{\alpha}(D^{\prime})(x) has few zeros.

[Prop. 7 from .] Given f(x)=i=1maiN(μi,σi2,x),f(x)=\sum_{i=1}^{m}a_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x), the linear combination of mm one-dimensional Gaussian probability density functions, such that σi2σj2\sigma_{i}^{2}\neq\sigma_{j}^{2} for iji\neq j, assuming that not all the aia_{i}’s are zero, the number of solutions to f(x)=0f(x)=0 is at most 2(m1)2(m-1).

Suppose that D(F,F)Ω(γ18)D(F,F^{\prime})\geq\Omega(\gamma^{18}) and that the minimum variance in any component of F,FF,F^{\prime} is at least γ18\gamma^{18} and also let F,FF,F^{\prime} be mixture of nn and kk Gaussians respectively, and the mean of each component of FF and FF^{\prime} is at most 1γ\frac{1}{\gamma}. Then there is some moment i[2(n+k1)]i\in[2(n+k-1)] s.t. EF[xi]EF[xi]Ω(γc)|E_{F}[x^{i}]-E_{F^{\prime}}[x^{i}]|\geq\Omega(\gamma^{c}) for some constant c=c(n,k)c=c(n,k) that depends on n,kn,k.

Using Proposition 19, there are at most 2(n+k1)2(n+k-1) zero crossings of the function f(x)=F(x)F(x)f(x)=F(x)-F^{\prime}(x). Consider the interval I=[2γ,2γ]I=[\frac{-2}{\gamma},\frac{2}{\gamma}]. Using Corollary 62, the contribution to EF[xi]E_{F}[x^{i}] of I\Re-I is at most O(γie18γ2)O(\gamma^{-i}e^{-\frac{1}{8\gamma^{2}}}), and for sufficiently small γ\gamma, this is negligible.

Because D(F,F)Ω(γ18)D(F,F^{\prime})\geq\Omega(\gamma^{18}) and the fact that there are at most 2(n+k1)2(n+k-1) zero crossings of the function f(x)f(x), there must be some interval JJ for which f(x)f(x) does not change signs and Jf(x)dxΩ(γ182n+2k)\int_{J}|f(x)|dx\geq\Omega(\frac{\gamma^{18}}{2n+2k}). If we choose p(x)=±Πzi(xzi)p(x)=\pm\Pi_{z_{i}}(x-z_{i}) for all zeros ziIz_{i}\in I. We can then choose signs so that p(x)p(x) matches f(x)f(x) on JI=JJ\cup I=J^{\prime}. Then Jp(x)f(x)dxJp(x)f(x)dxIp(x)f(x)dxJp(x)f(x)dxO(γi2(n+k1)e18γ2)\int_{J^{\prime}}p(x)|f(x)|dx\geq|\int_{J}p(x)f(x)dx|-\int_{\Re-I}|p(x)f(x)|dx\geq|\int_{J}p(x)f(x)dx|-O(\gamma^{-i-2(n+k-1)}e^{-\frac{1}{8\gamma^{2}}}) because each coefficient in p(x)p(x) is bounded by 1γ2(n+k1)\frac{1}{\gamma^{2(n+k-1)}}. Let JJJ^{\prime\prime}\subset J be the interval [aδ,b+δ]J=[a,b][a-\delta,b+\delta]\subset J=[a,b].

Then Jp(x)f(x)dxδ2(n+k1)Jf(x)dx|\int_{J^{\prime\prime}}p(x)f(x)dx|\geq\delta^{2(n+k-1)}|\int_{J^{\prime\prime}}f(x)dx| and Jf(x)dxJf(x)dxO(δ2γ18)|\int_{J^{\prime\prime}}f(x)dx|\geq|\int_{J}f(x)dx|-O(\frac{\delta^{2}}{\gamma^{18}}) because the derivative of f(x)f(x) is bounded by O(1γ18)O(\frac{1}{\gamma^{18}}), and f(a)=f(b)=0f(a)=f(b)=0. So choosing δ=O(γ19)\delta=O(\gamma^{19}) yields that Jf(x)dxΩ(γ18)|\int_{J^{\prime\prime}}f(x)dx|\geq\Omega(\gamma^{18}) (where the constant hidden in O()O() depends on n,kn,k).

So this implies that Jp(x)f(x)dxΩ(γc(n+k1))|\int_{J^{\prime\prime}}p(x)f(x)dx|\geq\Omega(\gamma^{c(n+k-1)}) for some constant cc (that does not depend on n,kn,k. Using the fact that the coefficients of p(x)p(x) are bounded by 1γ2(n+k1)\frac{1}{\gamma^{2(n+k-1)}}, this implies that there is some i[2(n+k1)]i\in[2(n+k-1)] such that Jxif(x)dxΩ(γc(n+k1))|\int_{J^{\prime\prime}}x^{i}f(x)dx|\geq\Omega(\gamma^{c^{\prime}(n+k-1)}) for some constant cc^{\prime} that does not depend on n,kn,k.

Then using the bound of O(γi2(n+k1)e18γ2)O(\gamma^{-i-2(n+k-1)}e^{-\frac{1}{8\gamma^{2}}}) for EI[p(x)f(x)]E_{\Re-I}[p(x)f(x)], for sufficiently small γ\gamma this implies that EF[xi]EF[xi]Ω(γc(n+k1))|E_{F}[x^{i}]-E_{F^{\prime}}[x^{i}]|\geq\Omega(\gamma^{c^{\prime\prime\prime}(n+k-1)})

Unfortunately, the transformation Fα\mathcal{F}_{\alpha} does not preserve the statistical distance between two distributions. However, we show that it, at least roughly, preserves (up to a polynomial) the disparity in low-order moments of the distributions.

[Lemma 6 from .] Suppose that each constituent Gaussian in FF or FF^{\prime} has variances in the interval [α,1][\alpha,1]. Then

The proof of the above lemma follows easily from the observation that the moments of FF and Fα(F)\mathcal{F}_{\alpha}(F) are related by a simple linear transformation, which can also be viewed as a recurrence relation for Hermite polynomials.

Proof of Theorem 4: The base case for our induction is when n=k=1,n=k=1, and follows from the fact that given parameters μ,μ,σ2,σ2,\mu,\mu^{\prime},\sigma^{2},\sigma^{\prime 2}, such that σ2,σ21,\sigma^{2},\sigma^{\prime 2}\leq 1, and μμ+σ2σ2ϵ,|\mu-\mu^{\prime}|+|\sigma^{2}-\sigma^{\prime 2}|\geq\epsilon, then one of the first two moments of N(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) differs from that of N(μ,σ2)\mathcal{N}(\mu^{\prime},\sigma^{\prime 2}) by at least ϵ/2.\epsilon/2.

For the induction step, assume that for all pairs of ϵ\epsilon-standard mixtures of n,n, and kk Gaussians, respectively, one of the first 2(n+k1)2(n+k-1) moments differ by at least f(ϵ,n+k)f(\epsilon,n+k). Consider ϵ\epsilon-standard mixtures F,F,F,F^{\prime}, mixtures of n,kn^{\prime},k^{\prime} Gaussians, respectively, where either n=n+1,n^{\prime}=n+1, or k=k+1,k^{\prime}=k+1, and either n=nn^{\prime}=n or k=k.k^{\prime}=k. Assume without loss of generality that σ12\sigma_{1}^{2} is the minimal variance in the mixtures, and that it occurs in mixture FF.

We first consider the case that there exists a component of FF^{\prime} whose mean, variance, and weight match μ1,σ12,w1\mu_{1},\sigma_{1}^{2},w_{1} to within an additive xx, where xx is chosen so that each of the first 2(n+k1)2(n+k-1) moments of any pair of Gaussians whose parameters are within xx of each other, differ by at most f(ϵ/2,n+k1)/2;f(\epsilon/2,n+k-1)/2; specifically, letting q(y)q(y) be the polynomial (dependent on n,kn,k) of Lemma 63 bounding the discrepancy in the first 2(n+k1)2(n+k-1) moments of Gaussians whose parameters differ by yy, we set xx so that q(x)=f(ϵ/2,n+k1)/2.q(x)=f(\epsilon/2,n+k-1)/2. Note that for fixed n,kn,k, xx will be polynomial in ϵ.\epsilon. Since Lemma 63 requires that σ12x,,\sigma_{1}^{2}\geq\sqrt{x},, if this is not the case, we convolve the pair of mixtures by N(0,ϵ),\mathcal{N}(0,\epsilon), which by Lemma 21 changes the disparity in low-order moments by a polynomial amount, and proceed with the chosen value of xx and the transformed pair of GMMs.

Now, consider the mixtures H,H,H,H^{\prime}, obtained from F,FF,F^{\prime} by removing the two nearly-matching Gaussian components, and rescaling the weights so that they still sum to 1. The pair H,HH,H^{\prime} will now be mixtures of k1k^{\prime}-1 and n1n^{\prime}-1 components, and will still be (ϵϵ2)(\epsilon-\epsilon^{2})-standard, and the discrepancy in their first 2(n+k1)2(n^{\prime}+k^{\prime}-1) moments is at most f(ϵ/2,n+k1)/2f(\epsilon/2,n+k-1)/2 different from the discrepancy in the pair F,FF,F^{\prime}. By our induction hypothesis, there is a discrepancy in one of the first 2(n+k3)2(n^{\prime}+k^{\prime}-3) moments of at least f(ϵ/2,n+k3)f(\epsilon/2,n+k-3) and thus the original pair F,FF,F^{\prime} will have discrepancy in moments at least half of this, which is still poly(ϵ),poly(\epsilon), for any fixed n,kn,k.

Appendix C The Basic Univariate Algorithm

In this section we formally state the Basic Univariate Algorithm, and prove its correctness. In particular, we will prove the following corollary to the polynomially robust identifiability of GMMs (Theorem 4).

Corollary 5. Suppose we are given access to independent samples from a GMM

with mean 1 and variance in [1/2,2],[1/2,2], where wiϵw_{i}\geq\epsilon, and μiμj+σi2σj2ϵ|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon. The Basic Univariate Algorithm, for any fixed kk, has runtime at most polyk(1ϵ,1δ)poly_{k}(\frac{1}{\epsilon},\frac{1}{\delta}) samples and with probability at least 1δ1-\delta will output mixture parameters w^i,μ^i,σi^2\hat{w}_{i},\hat{\mu}_{i},\hat{\sigma_{i}}^{2}, so that there is a permutation π:[k][k]\pi:[k]\rightarrow[k] and

Our proof of the above Corollary will consist of three parts; first, we will show that for any αϵ,\alpha\leq\epsilon, a there is some polynomial pp such that p(α,ϵ)p(\alpha,\epsilon) samples suffices to guarantee that with probability at least 1δ,1-\delta, the first 4k24k-2 sample moments will all be within α\alpha of the corresponding true moments. Next, we show that it suffices to perform brute-force search over a polynomially-fine mesh of parameters in order to ensure that at least one point (w1^,μ1^,σ1^2,,wk^,μk^,σk^2)(\hat{w_{1}},\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{w_{k}},\hat{\mu_{k}},\hat{\sigma_{k}}^{2}) in our parameter-mesh will have the first 4k24k-2 moments that are each within α\alpha from the true moments. Finally, we will use Theorem 4 to conclude that the recovered parameter set (μ1^,σ1^2,,μk^,σk^2)(\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{\mu_{k}},\hat{\sigma_{k}}^{2}) must be close to the true parameter set, because the first 4k24k-2 moments nearly agree. We now formalize these pieces.

Let x1,x2,,xmx_{1},x_{2},\ldots,x_{m} be independent draws from a univariate GMM FF that is in isotropic position, and each of whose components has weight at least ϵ\epsilon. With probability 1δ\geq 1-\delta,

where the hidden constant on the big-Oh depends on kk.

By Chebyshev’s inequality, with probability at most δ\delta,

We now argue that a polynomially-fine mesh suffices to guarantee that there is some parameter set in our mesh whose first 4k24k-2 moments are all close to the corresponding true moments.

Given a univariate mixture FF of kk Gaussians centered at 0 with variance at most 22, each of whose weights are at least ϵ\epsilon, such that each pair of components has parameter distance at least ϵ,\epsilon, and a target accuracy αϵ,\alpha\leq\epsilon, there exists a γ=poly(α),\gamma=poly(\alpha), and set of parameters (w1^,μ1^,σ1^2,,wk^,μk^,σk^2)(\hat{w_{1}},\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{w_{k}},\hat{\mu_{k}},\hat{\sigma_{k}}^{2}) such that each parameter is a multiple of γ,\gamma, each is bounded by 2/ϵ2/\epsilon, each weight is at least ϵ/2,\epsilon/2, each pair of components has parameter distance at least ϵ/2\epsilon/2, and the first 4k24k-2 moments of FF are within α\alpha of the corresponding moments of F^,\hat{F}, the mixture corresponding to the recovered parameters.

Consider the parameter set obtained by rounding the true parameter set, excluding the weights, to the nearest multiple of γ.\gamma. For each weight wiw_{i}, we set wi^\hat{w_{i}} to be either the multiple of γ\gamma just above, or just below wiw_{i}, ensuring that iwi^=1,\sum_{i}\hat{w_{i}}=1, which can clearly be down. That the rounded mixture has component weights at least ϵ/2,\epsilon/2, pairwise parameter distances at least ϵ/2,\epsilon/2, and values bounded in magnitude by 2/ϵ2/\epsilon is obvious. We now analyze how much the rounding has effected the moments.

From Claim 65, the ithi^{th} moment of each component is just some polynomial in μ,σ2,\mu,\sigma^{2}, which is a polynomial of degree at most ii, with coefficients bounded in magnitude by (i+2)!(i+2)! Thus changing the mean or variance by at most γ\gamma will change the ithi^{th} moment by at most

Thus if we used the true mixing weights, the error in each moment of the entire mixture would be at most kk times this. To conclude, note that for each mixing weight wjwj^γ,|w_{j}-\hat{w_{j}}|\leq\gamma, and since, as noted in the proof of the previous lemma, each moment is at most O(ϵi)O(\epsilon^{-i}) (where the hidden constant depends on ii), thus the rounding of the weight will contribute at most an extra O(γϵi).O(\gamma\epsilon^{-i}). Adding these bounds together, we get that each of the first 4k24k-2 moments of F^\hat{F} can be off from the true ones by at most k(O(γϵ4k+2)+2(4k2)2(4k)!ϵ4k+3γ=O(γϵ4k+2),k(O(\gamma\epsilon^{-4k+2})+2(4k-2)^{2}(4k)!\epsilon^{-4k+3}\gamma=O(\gamma\epsilon^{-4k+2}), where the hidden constant depends on kk. Thus letting γ=ckα4k1,\gamma=c_{k}\alpha^{4k-1}, where the constant ckc_{k} depends on kk suffices to ensure that all moments are within α\alpha of their true values. ∎

We now piece together the above two lemmas to prove Corollary 5.

Proof of Corollary 5: Given a desired moment accuracy αϵ,\alpha\leq\epsilon, by applying a union bound to Lemma 22, O(αϵ8kδ2)O(\alpha\epsilon^{-8k}\delta^{-2}) samples suffices to guarantee that with probability at least 1δ,1-\delta, the first 4k24k-2 sample moments are within α\alpha from the true moments. Thus with at least probability 1δ,1-\delta, by Lemma 23, our polynomial mesh of parameters suffices to recover a set of parameters (w1^,μ1^,σ1^2,,wk^,μk^,σk^2)(\hat{w_{1}},\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{w_{k}},\hat{\mu_{k}},\hat{\sigma_{k}}^{2}) whose weights and pairwise parameter-distances are at least ϵ/2,\epsilon/2, and whose first 4k24k-2 sample moments will all be within 2α2\alpha from the sample moments, and hence within 3α3\alpha from the true moments.

To conclude, note that the pair of mixtures F,F^,F,\hat{F}, after rescaling by at most (ϵ/2)1/2(\epsilon/2)^{1/2} so as to ensure each component in the mixture has variance at most 1 (which scales the kthk^{th} moments by (ϵ/2)k/2(\epsilon/2)^{k/2}), satisfies the first three conditions of being ϵ/2\epsilon/2-standard, and thus, if the first 4k24k-2 moments (after rescaling) agree to within (ϵ/2)2k1(ϵ/2)O(k),(\epsilon/2)^{2k-1}\cdot(\epsilon/2)^{O(k)}, Theorem 4 guarantees that the recovered parameters must be accurate to within ϵ\epsilon (where the first O(k)O(k) in the exponent is from Theorem 4). Thus setting 3α(ϵ/2)O(k)=polyk(ϵ)3\alpha\leq(\epsilon/2)^{O(k)}=poly_{k}(\epsilon) will ensure that with the desired high probability, the recovered parameters are ϵ/2\epsilon/2 accurate. \Box

Appendix D The General Univariate Algorithm

Suppose that FF, GG and HH are GMM of k1k2k3k_{1}\leq k_{2}\leq k_{3} Gaussians respectively. If (G,π1)Dϵ(F)(G,\pi_{1})\in\mathcal{D}_{\epsilon}(F) and (H,π2)Dϵ(G)(H,\pi_{2})\in\mathcal{D}_{\epsilon}(G), then (H,π2π1)DO(k1)ϵ(F)(H,\pi_{2}\circ\pi_{1})\in\mathcal{D}_{O(k_{1})\epsilon}(F).

Note that π1:[k1][k2]\pi_{1}:[k_{1}]\rightarrow[k_{2}] and π2:[k2][k3]\pi_{2}:[k_{2}]\rightarrow[k_{3}]. Consider π3:[k1][k3]=π2π1\pi_{3}:[k_{1}]\rightarrow[k_{3}]=\pi_{2}\circ\pi_{1}. This function π3\pi_{3} is onto, because both π1\pi_{1} and π2\pi_{2} are both onto.

Also consider any jπ31(h)j\in\pi_{3}^{-1}(h) (for some h[k3]h\in[k_{3}]). In fact, let iπ21(h)i\in\pi_{2}^{-1}(h) and jπ11(i)j\in\pi_{1}^{-1}(i). Then because parameter distance is a distance (i.e. satisfies triangle-inequality):

because (G,π1)Dϵ(F)(G,\pi_{1})\in\mathcal{D}_{\epsilon}(F) and (H,π2)Dϵ(G)(H,\pi_{2})\in\mathcal{D}_{\epsilon}(G) and π2(i)=h\pi_{2}(i)=h and π1(j)=i\pi_{1}(j)=i. We write wjFw^{F}_{j} for the weight of the jthj^{th} component of FF to simplify notation, and similarly for G,HG,H. Then using this notation:

Convolving two Gaussians F1,F2F_{1},F_{2} by the same Gaussian N(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) preserves the parameter distance between F1F_{1} and F2F_{2}. Also, given an estimate F^i\hat{F}_{i} which is within DD in parameter distance from NFi\mathcal{N}\circ F_{i}, by subtracting μ\mu from the mean of F^i\hat{F}_{i} and σ2\sigma^{2} from the variance of F^i\hat{F}_{i}, we obtain an estimate for FiF_{i} which is within DD in parameter distance from FiF_{i}.

Suppose (F^,π)Dϵ(F)(\hat{F},\pi)\in\mathcal{D}_{\epsilon}(F) and that each Gaussian FiF_{i} in the mixture FF has variance at least 12\frac{1}{2}. Then D(F,F^)O(k)ϵD(F,\hat{F})\leq O(k^{\prime})\epsilon, where kk^{\prime} is the number of components in the GMM F^\hat{F}.

Let kk be the number of components in FF. Then

We can then apply Fact 25 and the assumption that each Gaussian has variance at least 12\frac{1}{2} (and if ϵ<<1\epsilon<<1) implies that F^iFj1=O(Dp(F^i,Fj))=O(ϵ)\|\hat{F}_{i}-F_{j}\|_{1}=O(D_{p}(\hat{F}_{i},F_{j}))=O(\epsilon) for all jπ1(i)j\in\pi^{-1}(i). And so D(F,F^)O(k)ϵD(F,\hat{F})\leq O(k^{\prime})\epsilon

D.2 Windows

Here we define the notion of a Window. Suppose we run the Basic Univariate Algorithm with target precision of ϵ\epsilon (and an error parameter δ\delta). Then Basic Univariate Algorithm uses at most some polynomial in 1ϵ\frac{1}{\epsilon} and 1δ\frac{1}{\delta} number of samples.

Here that we assume the Basic Univariate Algorithm run with precision ϵ\epsilon and an error parameter δ\delta requires some polynomial in 1ϵ\frac{1}{\epsilon} and 1δ\frac{1}{\delta} samples. We in fact assume that the number of samples is at most CB(ϵδ)cBC_{B}(\epsilon\delta)^{-c_{B}} for some universal constants CB,cb>0C_{B},c_{b}>0. Then we denote Q(ϵ,δ)Q(\epsilon,\delta) as 1CB(ϵδ)cB\frac{1}{C_{B}}(\epsilon\delta)^{c_{B}}.

Let Q(ϵ,δ)Q(\epsilon,\delta) be the inverse of the number of samples needed by the Basic Univariate Algorithm when given a target precision ϵ\epsilon (and an error parameter δ\delta).

We would like to define a Window to be the range of values from Q(ϵ,δ)Q(\epsilon,\delta) to ϵ\epsilon so that if all pairs of Gaussians either have parameter distance at least ϵ\epsilon, or statistical distance at most Q(ϵ,δ)Q(\epsilon,\delta) then the we can just run the Basic Univariate Algorithm and assume that the algorithm behaves as if each pair of Gaussians that is extremely close is replaced with a single (appropriately) chosen Gaussian. However, we will need some slack, and so we make the Window wider so that we can take union bounds over many different runs of the algorithm, and compose different subdivisions.

Let R(ϵ,δ)=Q(ϵ,δ)C1k4R(\epsilon,\delta)=\frac{Q(\epsilon,\delta)}{C_{1}k^{4}} and let S(ϵ,δ)=R(ϵ,δ)C2k4S(\epsilon,\delta)=\frac{R(\epsilon,\delta)}{C_{2}k^{4}} for some sufficiently large constants C1,C2C_{1},C_{2}.

Given a target precision ϵ\epsilon, we define the Window W(ϵ)W(\epsilon) at ϵ\epsilon as the range of values [R(ϵ,δ),ϵ][R(\epsilon,\delta),\epsilon].

Given a mixture of Gaussians FF, we will say that a Window W(ϵ)W(\epsilon) is good if for all iji\neq j, Dp(Fi,Fj)W(ϵ)D_{p}(F_{i},F_{j})\notin W(\epsilon).

We give a number of claims that will be useful in the case in which we have a good Window W(ϵ)W(\epsilon). So suppose that the Window W(ϵ)W(\epsilon) is good

The set of Gaussians at parameter distance at most R(ϵ,δ)R(\epsilon,\delta) is an equivalence class.

Consider Gaussians F1,F2F_{1},F_{2} and F3F_{3} such that F1F_{1} and F2F_{2} are at parameter distance at most R(ϵ,δ)R(\epsilon,\delta) and F2F_{2} and F3F_{3} are also at parameter distance at most R(ϵ,δ)R(\epsilon,\delta). Dp(F1,F3)D(F1,F2)+D(F2,F3)2R(ϵ,δ)<<ϵD_{p}(F_{1},F_{3})\leq D(F_{1},F_{2})+D(F_{2},F_{3})\leq 2R(\epsilon,\delta)<<\epsilon and since there is no pair of Gaussians with parameter distance inside the Window W(ϵ)W(\epsilon), this implies that Dp(F1,F3)R(ϵ,δ)D_{p}(F_{1},F_{3})\leq R(\epsilon,\delta). ∎

We will let E={E1,E2,...Ek}\mathcal{E}=\{\mathcal{E}_{1},\mathcal{E}_{2},...\mathcal{E}_{k^{\prime}}\} be the equivalence class of Gaussians at parameter distance at most R(ϵ,δ)R(\epsilon,\delta). We let πE:[k][k]\pi_{\mathcal{E}}:[k]\rightarrow[k^{\prime}] be the mapping function that maps a Gaussian FjF_{j} to the corresponding equivalence class Ei\mathcal{E}_{i} (i.e. πE(j)=i\pi_{\mathcal{E}}(j)=i). From this equivalence class and this mapping function, we can define a natural R(ϵ,δ)R(\epsilon,\delta)-correct subdivision of FF.

We define the natural R(ϵ,δ)R(\epsilon,\delta)-correct subdivision F^E\hat{F}^{\mathcal{E}} as a mixture of kk^{\prime} Gaussians in which F^jE\hat{F}^{\mathcal{E}}_{j} is an arbitrarily chosen representative from Ei\mathcal{E}_{i} (πE(j)=i\pi_{\mathcal{E}}(j)=i), and w^iE=jπE1(i)wj\hat{w}^{\mathcal{E}}_{i}=\sum_{j\in\pi_{\mathcal{E}}^{-1}(i)}w_{j}.

Clearly, (F^E,πE)DR(ϵ,δ)(F)(\hat{F}^{\mathcal{E}},\pi_{\mathcal{E}})\in\mathcal{D}_{R(\epsilon,\delta)}(F), and F^E\hat{F}^{\mathcal{E}} actually is an R(ϵ,δ)R(\epsilon,\delta)-correct subdivision.

Let (F^,π)DR(ϵ,δ)(F)(\hat{F},\pi)\in\mathcal{D}_{R(\epsilon,\delta)}(F), then F^EDO(k)R(ϵ,δ)(F^)\hat{F}^{\mathcal{E}}\in\mathcal{D}_{O(k)R(\epsilon,\delta)}(\hat{F}).

Let k,kk^{\prime},k^{\prime\prime} be the number of Gaussians in the GMMs F^E\hat{F}^{\mathcal{E}} and F^\hat{F} respectively. Consider any two Gaussians Fi,FjF_{i},F_{j} that are not mapped to the same equivalence class - i.e. πE(i)πE(j)\pi_{\mathcal{E}}(i)\neq\pi_{\mathcal{E}}(j). Since the Window W(ϵ)W(\epsilon) is good, this implies that Dp(Fi,Fj)ϵD_{p}(F_{i},F_{j})\geq\epsilon. So in order for F^\hat{F} to be an R(ϵ,δ)R(\epsilon,\delta)-correct subdivision, it must be the case that π(i)π(j)\pi(i)\neq\pi(j).

This means that π\pi as a partition is a refinement of the partition πE\pi_{\mathcal{E}}. Formally, there must be some function πint:[k][k]\pi_{int}:[k^{\prime\prime}]\rightarrow[k^{\prime}] such that πE=πintπ\pi_{\mathcal{E}}=\pi_{int}\circ\pi. Then it follows that (F^E,π)DO(k)R(ϵ,δ)(F^)(\hat{F}^{\mathcal{E}},\pi_{\int})\in\mathcal{D}_{O(k)R(\epsilon,\delta)}(\hat{F}): Consider any i[k]i\in[k^{\prime}].

And hπE1(i)wh=jπint1(i)hπ1(j)wh\sum_{h\in\pi_{\mathcal{E}}^{-1}(i)}w_{h}=\sum_{j\in\pi_{int}^{-1}(i)}\sum_{h\in\pi^{-1}(j)}w_{h} so this implies that

And similarly for any jπint1(i)j\in\pi_{int}^{-1}(i) let h=π1(j)h=\pi^{-1}(j),

where the last line follows because πE(h)=i\pi_{\mathcal{E}}(h)=i. ∎

Suppose we are given a mixture of Gaussians F=i=1kwiN(μi,σi2,x)F=\sum_{i=1}^{k}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x) that is in near isotropic position, where wiϵw_{i}\geq\epsilon and the Window W(ϵ)W(\epsilon) is good and suppose further that σi212\sigma_{i}^{2}\geq\frac{1}{2}. Let (F^,π)DR(ϵ,δ)(F)(\hat{F},\pi)\in\mathcal{D}_{R(\epsilon,\delta)}(F). Then with probability at least 12δ1-2\delta, the output of the Basic Univariate Algorithm is a GMM NN such that NDO(k)ϵ(F^)N\in\mathcal{D}_{O(k)\epsilon}(\hat{F}).

Let E={E1,E2,...Ek}\mathcal{E}=\{\mathcal{E}_{1},\mathcal{E}_{2},...\mathcal{E}_{k^{\prime}}\} be the equivalence class of Gaussians at parameter distance at most R(ϵ,δ)R(\epsilon,\delta) (see Claim 29), and let F^E\hat{F}^{\mathcal{E}} and πE\pi_{\mathcal{E}} be the natural R(ϵ,δ)R(\epsilon,\delta)-correct subdivision for FF and corresponding mapping function.

Let kk^{\prime\prime} be the number of components in F^\hat{F}. Then we can apply Claim 30 and this implies that F^E\hat{F}^{\mathcal{E}} is an O(k)R(ϵ,δ)O(k)R(\epsilon,\delta)-correct subdivision for F^\hat{F}.

Using Lemma 28, this implies that D(F,F^E)+D(F^E,F^)O(k2)R(ϵ,δ)D(F,\hat{F}^{\mathcal{E}})+D(\hat{F}^{\mathcal{E}},\hat{F})\leq O(k^{2})R(\epsilon,\delta). So this implies that given 1Q(ϵ,δ)\frac{1}{Q(\epsilon,\delta)} samples taken from FF when running the Basic Univariate Algorithm, with probability at least

we can assume that all samples came from F^E\hat{F}^{\mathcal{E}} (because there is an approximate between FF and F^E\hat{F}^{\mathcal{E}} that fails with probability at D(F,F^E)D(F,\hat{F}^{\mathcal{E}}) and with probability at least 1δ1-\delta this coupling will never fail, given the number of samples obtained from FF).

When the Basic Univariate Algorithm is run on F^E\hat{F}^{\mathcal{E}}, the constraints of the Basic Univariate Algorithm are met because for all iji\neq j, Dp(F^iE,F^jE)ϵD_{p}(\hat{F}^{\mathcal{E}}_{i},\hat{F}^{\mathcal{E}}_{j})\geq\epsilon because the Window W(ϵ)W(\epsilon) is good. So with probability at least 1δ1-\delta, the Basic Univariate Algorithm (when run on F^E\hat{F}^{\mathcal{E}}) will return an ϵ\epsilon-correct subdivision NN of F^E\hat{F}^{\mathcal{E}} (in fact, a stronger guarantee is true because the Basic Univariate Algorithm will actually return an estimate NN that has kk^{\prime} components, which matches the number of components in F^E\hat{F}^{\mathcal{E}}). Then we can apply Lemma 24, and NN must then be an O(k)ϵO(k)\epsilon-correct subdivision for F^\hat{F}. ∎

D.3 Reaching a Consensus

We call a sequence of GMMs, F1,F2,...FrF^{1},F^{2},...F^{r} an ϵ\epsilon-correct chain if for all i[r1]i\in[r-1], Fi+1Dϵ(Fi)F^{i+1}\in\mathcal{D}_{\epsilon}(F^{i})

Suppose we are given a mixture of Gaussians F=i=1kwiN(μi,σi2,x)F=\sum_{i=1}^{k}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x) that is in isotropic position, where wiϵw_{i}\geq\epsilon. Then the General Univariate Algorithm will return a GMM of kkk^{\prime}\leq k Gaussians F^\hat{F} such that F^\hat{F} is an ϵ\epsilon-correct subdivision of FF.

Given ϵ\epsilon, we first define a sequence of parameters where

Suppose first that each Gaussian in FF has variance at least 12\frac{1}{2}. Then in this case, the idea is to run the Basic Univariate Algorithm for a number of different precisions, each of which corresponds to a particular Window. We will choose parameters so that these Windows are disjoint, and then because a Window is bad iff there is some pair of Gaussians with parameter distance contained inside the Window, at most (k2)<k22{k\choose 2}<\frac{k^{2}}{2} Windows can be bad. So this will guarantee that a strict majority of the computations are correct.

To formalize this, given the sequence of parameters ϵ1,ϵ2,...ϵk2+1\epsilon_{1},\epsilon_{2},...\epsilon_{k^{2}+1} we define a sequence of Windows W=W(ϵ1),W(ϵ2),...W(ϵk2+1)\mathcal{W}=W(\epsilon_{1}),W(\epsilon_{2}),...W(\epsilon_{k^{2}+1}).

The sequence of Windows W\mathcal{W} is disjoint

If we consider the Window W(ϵi)W(\epsilon_{i}), the largest value contained in any Window W(ϵj)W(\epsilon_{j}) for j>ij>i is the largest value contained in the Window W(ϵi+1)W(\epsilon_{i+1}) which is ϵi+1\epsilon_{i+1}. Yet ϵi+1=S(ϵi,δ)\epsilon_{i+1}=S(\epsilon_{i},\delta) and the lower bound for the Window W(ϵi)W(\epsilon_{i}) is R(ϵi,δ)R(\epsilon_{i},\delta) and R(ϵi,δ)>>S(ϵi,δ)R(\epsilon_{i},\delta)>>S(\epsilon_{i},\delta). Similarly, the smallest value in W(ϵj)W(\epsilon_{j}) for jij\leq i is the smallest value in W(ϵi)W(\epsilon_{i}). So this implies that for any ii, the set of Windows W(ϵ1),W(ϵ2),...W(ϵi)W(\epsilon_{1}),W(\epsilon_{2}),...W(\epsilon_{i}) are separable from the set of Windows W(ϵi+1),W(ϵi+2),...W(ϵk2+1)W(\epsilon_{i+1}),W(\epsilon_{i+2}),...W(\epsilon_{k^{2}+1}) and this implies the claim. ∎

Suppose running the Basic Univariate Algorithm on Window W(ϵi)W(\epsilon_{i}) returns an estimate F^i\hat{F}^{i}.

Given any subset of indices T[k2+1]T\subset[k^{2}+1], let i1>i2>...ij>...iTi_{1}>i_{2}>...i_{j}>...i_{|T|} be the indices in TT arranged in decreasing order. We can generate a sequence of estimates F^T1,F^T2,...F^TT\hat{F}_{T}^{1},\hat{F}_{T}^{2},...\hat{F}_{T}^{|T|} in which F^Tj=F^ij\hat{F}_{T}^{j}=\hat{F}^{i_{j}}. Also let prec(F^Tj)=ϵijprec(\hat{F}_{T}^{j})=\epsilon_{i_{j}}, which corresponds to the precision of the Window that returned the estimate F^Tj=F^ij\hat{F}_{T}^{j}=\hat{F}^{i_{j}}. We call this sequence the TT-sequence of estimates.

Note that this sequence of estimates F^T1,...F^TT\hat{F}_{T}^{1},...\hat{F}_{T}^{|T|} is arranged in order of coarsening precision - i.e. prec(F^Ti)<<prec(F^Ti+1)prec(\hat{F}_{T}^{i})<<prec(\hat{F}_{T}^{i+1}).

S(prec(F^Tj),δ)prec(F^Tj1)S(prec(\hat{F}_{T}^{j}),\delta)\geq prec(\hat{F}_{T}^{j-1})

Let i1>i2>...ij>...iTi_{1}>i_{2}>...i_{j}>...i_{|T|} be the indices in TT arranged in decreasing order. So ij1>iji_{j-1}>i_{j}. Then S(prec(F^Tj),δ)=S(ϵij,δ)=ϵij+1S(prec(\hat{F}_{T}^{j}),\delta)=S(\epsilon_{i_{j}},\delta)=\epsilon_{i_{j}+1}. And because ij1ij+1i_{j-1}\geq i_{j}+1, it implies that ϵij1ϵij+1\epsilon_{i_{j-1}}\leq\epsilon_{i_{j}+1}, and this yields the claim. ∎

Let G[k2+1]G\subset[k^{2}+1] be the set of indices of Windows that are good - i.e. W(ϵi)W(\epsilon_{i}) is good iff iGi\in G. Then let F^G1,F^G2,...F^GG\hat{F}_{G}^{1},\hat{F}_{G}^{2},...\hat{F}_{G}^{|G|} be the GG-sequence of estimates. Because the sequence of Windows W\mathcal{W} is disjoint, and each pair of Gaussians (and the corresponding parameter distance) can only make a single Window bad, the set GG is a strict majority - i.e. G>[k2+1]G|G|>|[k^{2}+1]-G|.

The GG-sequence of estimates is an O(k)ϵ1O(k)\epsilon_{1}-correct chain, and F^G1\hat{F}_{G}^{1} is an O(k)ϵ1O(k)\epsilon_{1}-correct subdivision for FF.

Let ϵ1<<ϵ2<<...ϵG\epsilon^{\prime}_{1}<<\epsilon^{\prime}_{2}<<...\epsilon^{\prime}_{|G|} be the sequence of precisions given by prec(F^G1),prec(F^G2),...prec(F^G)prec(\hat{F}_{G}^{1}),prec(\hat{F}_{G}^{2}),...prec(\hat{F}^{|G|}).

Using Lemma 31, F^G1DO(k)ϵ1(F)\hat{F}_{G}^{1}\in\mathcal{D}_{O(k)\epsilon^{\prime}_{1}}(F). Because O(k)ϵ1O(k)S(ϵ2,δ)R(ϵ2,δ)O(k)\epsilon^{\prime}_{1}\leq O(k)S(\epsilon^{\prime}_{2},\delta)\leq R(\epsilon^{\prime}_{2},\delta) (using the above claim) this implies that F^G1DR(ϵ2,δ)(F)\hat{F}_{G}^{1}\in\mathcal{D}_{R(\epsilon^{\prime}_{2},\delta)}(F) and so we can apply Lemma 31 again and F^2DO(k)ϵ2(F^1)\hat{F}^{2}\in\mathcal{D}_{O(k)\epsilon^{\prime}_{2}}(\hat{F}^{1}). Continuing this argument, for all ii, F^i+1DO(k)ϵi(F^i)\hat{F}^{i+1}\in\mathcal{D}_{O(k)\epsilon^{\prime}_{i}}(\hat{F}^{i}). Since O(k)ϵiO(k)ϵGO(k)ϵ1O(k)\epsilon^{\prime}_{i}\leq O(k)\epsilon^{\prime}_{|G|}\leq O(k)\epsilon_{1}, the sequence F^1,F^2,...F^G\hat{F}^{1},\hat{F}^{2},...\hat{F}^{|G|} is an O(k)ϵ1O(k)\epsilon_{1}-correct chain. ∎

Given a subset G[k2+1]G^{\prime}\subset[k^{2}+1], we can check if the GG^{\prime}-sequence of estimates is an O(k)ϵ1O(k)\epsilon_{1}-correct chain because this property is only a function of the estimates. Then if we consider all sets in 2[k2+1]2^{[k^{2}+1]}, we will find some set G[k2+1]G^{\prime}\subset[k^{2}+1] that is a strict majority (i.e. G>[k2+1]G|G^{\prime}|>|[k^{2}+1]-G^{\prime}|) and the GG^{\prime}-sequence of estimates is an O(k)ϵ1O(k)\epsilon_{1}-correct chain. Because GG^{\prime} is a strict majority, and a strict majority GG of the Windows are good, GGG\cap G^{\prime}\neq\emptyset. Suppose that gGGg\in G\cap G^{\prime}, and let jj the value such that gg is the jthj^{th} largest index in GG^{\prime}.

Given the GG^{\prime}-sequence of estimates, we can take the sequence S=F,F^Gj,F^Gj+1,...F^GG\mathcal{S}=F,\hat{F}_{G^{\prime}}^{j},\hat{F}_{G^{\prime}}^{j+1},...\hat{F}_{G^{\prime}}^{|G^{\prime}|}. Since the index gg corresponds to a good Window (W(ϵg)W(\epsilon_{g}) is good), the computation F^Gj\hat{F}_{G^{\prime}}^{j} (which corresponds to the estimate F^g\hat{F}^{g}) is at least an O(k)ϵ1O(k)\epsilon_{1}-correct subdivision of FF. So the sequence S\mathcal{S} is an O(k)ϵ1O(k)\epsilon_{1}-correct chain. So we can apply Lemma 24, and this implies that F^GG\hat{F}_{G^{\prime}}^{|G^{\prime}|} (i.e. the last estimate in the sequence S\mathcal{S}) is an (Ck)k2+1ϵ1(Ck)^{k^{2}+1}\epsilon_{1}-correct subdivision for FF. Since (Ck)k2+1ϵ1ϵ(Ck)^{k^{2}+1}\epsilon_{1}\leq\epsilon, this implies that F^GG\hat{F}_{G^{\prime}}^{|G^{\prime}|} is an ϵ\epsilon-correct subdivision for FF.

However, we have assumed thus far in the proof of this theorem that each Gaussian has variance at least 12\frac{1}{2}. So given samples from FF, we can add random noise to each sample. We add Gaussian noise of variance 12\frac{1}{2} and mean , and this corresponds to convolving the original distribution FF by N(0,12)\mathcal{N}(0,\frac{1}{2}) to obtain a new distribution FF^{\prime}. Then this distribution FF^{\prime} has each Gaussian with variance at least 12\frac{1}{2} and is also in nearly isotropic position - because the original mixture FF was in isotropic position, and convolving by N(0,12)\mathcal{N}(0,\frac{1}{2}) just adds 12\frac{1}{2} to the variance of the mixture (var(F)=12+var(F)var(F^{\prime})=\frac{1}{2}+var(F)).

Using the above argument, we can recover an estimate F^GG\hat{F}_{G^{\prime}}^{|G^{\prime}|} that is an ϵ\epsilon-correct subdivision for FF^{\prime}. We can subtract 12\frac{1}{2} from the variance of each component in F^GG\hat{F}_{G^{\prime}}^{|G^{\prime}|}, and then using Claim 27 this resulting mixture NN will be an ϵ\epsilon-correct subdivision for FF.

Appendix E Exponential Dependence on k𝑘k is Inevitable

We restate the main proposition that we prove in this section:

Proposition 15. There exists a pair D1,D2D_{1},D_{2} of 1/(4k2+2)1/(4k^{2}+2)-standard distributions that are each mixtures of k2+1k^{2}+1 Gaussians such that

The following lemma will be helpful in the proof of correctness of our construction.

Let Fk(x)=cki=1πe(i/k)2N(i/k,1/2,x),F_{k}(x)=c_{k}\sum_{i=-\infty}^{\infty}\frac{1}{\sqrt{\pi}}e^{-(i/k)^{2}}\mathcal{N}(i/k,1/2,x), where ckc_{k} is a constant chosen so as to make FkF_{k} a distribution.

The probability density function Fk(x)F_{k}(x) can be rewritten as Fk(x)=(ckC1/k(x)N(0,1/2,x))N(0,1/2,x),F_{k}(x)=\left(c_{k}C_{1/k}(x)\mathcal{N}(0,1/2,x)\right)\circ\mathcal{N}(0,1/2,x), where C1/k(x)C_{1/k}(x) denotes the infinite comb function, consisting of delta functions spaced a distance 1/k1/k apart, and \circ denotes convolution. Considering the Fourier transform, we see that

It is now easy to see that why the lemma should be true, since the transformed comb has delta functions spaced at a distance kk apart, and we’re convolving by a Gaussian of variance 2 (essentially yielding nonoverlapping Gaussians with centers at multiples of kk) , and then multiplying by a Gaussian of variance 2. The final multiplication will nearly kill off all the Gaussians except the one centered at 0, yielding a Gaussian with variance 11 centered at the origin, whose inverse transform will yield a Gaussian of variance 1, as claimed.

To make the details rigorous, observe that the total Fourier mass of F(Fk)\mathcal{F}(F_{k}) that ends up within the interval [k/2,k/2][-k/2,k/2] contributed by the delta functions aside from the one at the origin, even before the final multiplication by N(0,2),\mathcal{N}(0,2), is bounded by the following:

Additionally, this L1L_{1} fourier mass is an upper bound on the L2L_{2} Fourier mass. The total L1L_{1} Fourier mass (which bounds the L2L_{2} mass) outside the interval [k/2,k/2][-k/2,k/2] contributed by the delta functions aside from the one at the origin is bounded by

From Plancherel’s Theorem: FkF_{k}, the inverse transform of F(F)\mathcal{F}(F), is a distribution, whose L2L_{2} distance from a single Gaussian (possibly scaled) of variance 1 is at most 8ek2/8.8e^{-k^{2}/8}. To translate this L2L_{2} distance to L1L_{1} distance, note that the contributions to the L1L_{1} norm from outside the interval [k,k][-k,k] is bounded by 4kN(0,1,x)dx41k2πek2/2.4\int_{k}^{\infty}\mathcal{N}(0,1,x)dx\leq 4\frac{1}{k\sqrt{2\pi}}e^{-k^{2}/2}. Since the magnitude of the derivative of Fkckk122πN(0,1)F_{k}-c_{k}k\frac{1}{2\sqrt{2\pi}}\mathcal{N}(0,1), is at most 2 and the value of Fk(x)ckk122πN(0,1,x)F_{k}(x)-c_{k}k\frac{1}{2\sqrt{2\pi}}\mathcal{N}(0,1,x) is close to at the endpoints of the interval [k,k][-k,k], we have

which, combined with the above bounds on the L2L_{2} distance, yields maxx[k,k](Fk(x)ckk122πN(0,1,x))(72ek2/8)1/3.\max_{x\in[-k,k]}(|F_{k}(x)-c_{k}k\frac{1}{2\sqrt{2\pi}}\mathcal{N}(0,1,x)|)\leq(72e^{-k^{2}/8})^{1/3}. Thus we have

The lemma follows from the additional observation that

where the minimization is taken to be over all functions that are probability density functions. ∎

Proof of Proposition 15: We will construct a pair of 1/(4k2+2)1/(4k^{2}+2)-standard distributions, D1,D2D_{1},D_{2}, that are mixtures of k2+1k^{2}+1 Gaussians, whose statistical distance is inverse exponential in kk. Let

where ckc_{k}^{\prime} is a constant chosen so as to make cki=k2k2N(0,1/2,i/k)N(i/k,1/2)c_{k}^{\prime}\sum_{i=-k^{2}}^{k^{2}}\mathcal{N}(0,1/2,i/k)\mathcal{N}(i/k,1/2) a distribution. Clearly the pair of distributions is 1/(4k2+2)1/(4k^{2}+2)-standard, since all weights are at least 1/(4k2+2),1/(4k^{2}+2), and the Gaussian component of D1D_{1} centered at can not be paired with any component of D2D_{2} without having a discrepancy in parameters of at least 1/2k.1/2k.

We now argue that D1,D2D_{1},D_{2} are statistically close. Let D2=cki=k2k2N(0,1/2,i/k)N(i/k,1/2).D_{2}^{\prime}=c_{k}^{\prime}\sum_{i=-k^{2}}^{k^{2}}\mathcal{N}(0,1/2,i/k)\mathcal{N}(i/k,1/2). Note that kFk(x)dxkN(0,1/2,x)2maxy(N(0,1/2,y))dx22kπek22ek2,\int_{k}^{\infty}F_{k}(x)dx\leq\int_{k}^{\infty}\mathcal{N}(0,1/2,x)2\max_{y}(\mathcal{N}(0,1/2,y))dx\leq\frac{2\sqrt{2}}{k\sqrt{\pi}}e^{-k^{2}}\leq 2e^{-k^{2}}, and thus D2Fk18ek2,||D_{2}^{\prime}-F_{k}||_{1}\leq 8e^{-k^{2}}, and our claim follows from Lemma 36. \Box

Appendix F Partition Pursuit

We first need to ensure that if we consider two directions r,rx,yr,r_{x,y} that are ϵ2\epsilon_{2}-close, the parameters of a component in Pu[F]P_{u}[F] cannot change too much as we vary uu from rr to rx,yr_{x,y}.

Given a mixture of kk nn-Dimensional Gaussians F=iwiFiF=\sum_{i}w_{i}F_{i} that is in isotropic position and is ϵ\epsilon-statistically learnable, for all ii, μi,Σi21ϵ\|\mu_{i}\|,\|\Sigma_{i}\|_{2}\leq\frac{1}{\epsilon}.

For all i,ji,j s.t. μiμj1ϵ\|\mu_{i}-\mu_{j}\|\leq\frac{1}{\epsilon} because if we project onto the direction μ1μ2μ1μ2\frac{\mu_{1}-\mu_{2}}{\|\mu_{1}-\mu_{2}\|} the variance of the mixture FF is 11 and is also at least wiwjμiμj2w_{i}w_{j}\|\mu_{i}-\mu_{j}\|^{2}, and this implies that μiμj1ϵ\|\mu_{i}-\mu_{j}\|\leq\frac{1}{\epsilon}. Yet the convex hull of μi\mu_{i} for all ii contains the origin and so

Similarly, for any i[k]i\in[k], if we choose uu corresponding to the direction of the maximum eigenvector of Σi\Sigma_{i},

and so Σi21ϵ\|\Sigma_{i}\|_{2}\leq\frac{1}{\epsilon}. ∎

Suppose FF is an nn-dimensional GMM that is ϵ\epsilon-statistically learnable.

Let F^u,F^v\hat{F}^{u},\hat{F}^{v} be univariate mixtures of Gaussians. Then we call components F^au,F^bv\hat{F}^{u}_{a},\hat{F}^{v}_{b} paired estimates if there is some πu,πv\pi_{u},\pi_{v} and i[k]i\in[k] such that πu(i)=a,πv(i)=b\pi_{u}(i)=a,\pi_{v}(i)=b and (F^u,πu)Dϵ1(Pu[F])(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F]) and (F^v,πv)Dϵ1(Pv[F])(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F]).

Let (F^u,πu)Dϵ1(Pu[F])(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F]) and (F^v,πv)Dϵ1(Pv[F])(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F]), then for every component F^au\hat{F}^{u}_{a} in F^u\hat{F}^{u}, there is some component F^bv\hat{F}^{v}_{b} such that the components F^au,F^bv\hat{F}^{u}_{a},\hat{F}^{v}_{b} are paired estimates.

This follows because πu\pi_{u} is onto. ∎

Suppose u,vu,v are ϵ2\epsilon_{2}-close (i.e. uvϵ2\|u-v\|\leq\epsilon_{2}), and let F^au,\hat{F}^{u}_{a}, and F^bv\hat{F}^{v}_{b} be paired estimates.

Dp(F^au,F^bv)2ϵ1+4ϵ2ϵD_{p}(\hat{F}^{u}_{a},\hat{F}^{v}_{b})\leq 2\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}.

πu,πv\pi_{u},\pi_{v} and i[k]i\in[k] such that πu(i)=a,πv(i)=b\pi_{u}(i)=a,\pi_{v}(i)=b and (F^u,πu)Dϵ1(Pu[F])(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F]) and (F^v,πv)Dϵ1(Pv[F])(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F]). Then

Then this implies that Dp(Pu[Fi],Pv[Fi])μiuv+2Σi2ϵ2+Σi2ϵ22D_{p}(P_{u}[F_{i}],P_{v}[F_{i}])\leq\|\mu_{i}\|\|u-v\|+2\|\Sigma_{i}\|_{2}\epsilon_{2}+\|\Sigma_{i}\|_{2}\epsilon_{2}^{2} and if we apply Claim 37, this is at most 4ϵ2ϵ\frac{4\epsilon_{2}}{\epsilon}, and this implies the claim. ∎

F.2 Reconstruction

Lemma 7. Let ϵ2,ϵ1>0\epsilon_{2},\epsilon_{1}>0. Suppose m0μr|m^{0}-\mu\cdot r|,mijμrij|{m}^{ij}-\mu\cdot r^{ij}|, v0rTΣr|v^{0}-r^{T}\Sigma r|,vij(rij)TΣrij|v^{ij}-(r^{ij})^{T}\Sigma r^{ij}| are all at most ϵ1\epsilon_{1}. Then Solve outputs μ^Rn\hat{\mu}\in{\bf R}^{n} and Σ^Rn×n\hat{\Sigma}\in{\bf R}^{n\times n} such that μ^μ<ϵ1nϵ2\|\hat{\mu}-\mu\|<\frac{\epsilon_{1}\sqrt{n}}{\epsilon_{2}}, and Σ^ΣF6nϵ1ϵ22\|\hat{\Sigma}-\Sigma\|_{F}\leq\frac{6n\epsilon_{1}}{\epsilon_{2}^{2}}. Furthermore, Σ^0\hat{\Sigma}\succeq 0 and Σ^\hat{\Sigma} is symmetric.

We will again need the notion of a Window:

Given a target additive error ϵ\epsilon, we call a Window W=(ϵ1,ϵ2,ϵ3,ϵ4)W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}) well-separated if the following conditions hold:

max(ϵ1nϵ2,6nϵ1ϵ22)ϵ\max(\frac{\epsilon_{1}\sqrt{n}}{\epsilon_{2}},\frac{6n\epsilon_{1}}{\epsilon_{2}^{2}})\leq\epsilon

ϵ2ϵ+ϵ1<<ϵ3\frac{\epsilon_{2}}{\epsilon}+\epsilon_{1}<<\epsilon_{3}

ϵ2ϵ+ϵ1+ϵ3<<ϵ4\frac{\epsilon_{2}}{\epsilon}+\epsilon_{1}+\epsilon_{3}<<\epsilon_{4}

Given any univariate estimate F^\hat{F} that (weakly) satisfies a Window W=(ϵ1,ϵ2,ϵ3,ϵ4)W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}), the set of components of F^\hat{F} with parameter distance at most ϵ1\epsilon_{1} is an equivalence class.

Let u,vu,v be two directions that are ϵ2\epsilon_{2}-close - i.e. uvϵ2\|u-v\|\leq\epsilon_{2}. Suppose that (F^u,πu)Dϵ1(Pu[F])(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F]) and (F^v,πv)Dϵ1(Pv[F])(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F]). Suppose further that F^u\hat{F}^{u} and F^v\hat{F}^{v} (weakly) satisfy the Window (ϵ1,ϵ2,ϵ3,ϵ4)(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}). Let Eu={E1u,E2u,...Eku}\mathcal{E}^{u}=\{\mathcal{E}^{u}_{1},\mathcal{E}^{u}_{2},...\mathcal{E}^{u}_{k^{\prime}}\} and Ev={E1v,E2v,...Ekv}\mathcal{E}^{v}=\{\mathcal{E}^{v}_{1},\mathcal{E}^{v}_{2},...\mathcal{E}^{v}_{k^{\prime\prime}}\} be the equivalence classes of components of F^u,F^v\hat{F}^{u},\hat{F}^{v} respectively at parameter distance at most ϵ1\epsilon_{1}.

Then k=kk^{\prime}=k^{\prime\prime}, and there is a permutation πu,v:[k][k]\pi_{u,v}:[k^{\prime}]\rightarrow[k^{\prime\prime}] such that Pu[Fj]P_{u}[F_{j}] is mapped to the equivalence class Ehu\mathcal{E}^{u}_{h} by the mapping πu\pi_{u} iff Pv[Fj]P_{v}[F_{j}] is mapped to the equivalence Eπu,v(h)v\mathcal{E}^{v}_{\pi_{u,v}(h)} by the mapping πv\pi_{v}. Also we can construct πu,v\pi_{u,v} from the estimates F^u,F^v\hat{F}^{u},\hat{F}^{v}.

To establish this claim, consider two distinct equivalence classes Eau,Ebu\mathcal{E}^{u}_{a},\mathcal{E}^{u}_{b}, and let F^au,F^bu\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}} be arbitrary representative. For each component F^au\hat{F}^{u}_{a^{\prime}} in F^u\hat{F}^{u}, there is some component Pu[Fi]P_{u}[F_{i}] in Pu[F]P_{u}[F] that is mapped by πu\pi_{u} to F^au\hat{F}^{u}_{a^{\prime}}. Then let Pu[Fi],Pu[Fj]P_{u}[F_{i}],P_{u}[F_{j}] be mapped to F^au,F^bu\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}} respectively - i.e. πu(i)=a,πv(j)=b\pi_{u}(i)=a^{\prime},\pi_{v}(j)=b^{\prime}. Then since F^u\hat{F}^{u} (weakly) satisfies the Window WW, we have that Dp(F^au,F^bu)ϵ3D_{p}(\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}})\geq\epsilon_{3}.

Suppose that Pv[Fi],Pv[Fj]P_{v}[F_{i}],P_{v}[F_{j}] are are mapped to F^cv,F^dv\hat{F}^{v}_{c^{\prime}},\hat{F}^{v}_{d^{\prime}} and these two components are in the same equivalence class in the mixture F^v\hat{F}^{v}. Then Dp(F^cv,F^dv)ϵ1D_{p}(\hat{F}^{v}_{c^{\prime}},\hat{F}^{v}_{d^{\prime}})\leq\epsilon_{1}. Yet F^av,F^cv\hat{F}^{v}_{a^{\prime}},\hat{F}^{v}_{c^{\prime}} are paired estimates so using Claim 39, Dp(F^av,F^cv)2ϵ1+4ϵ2ϵD_{p}(\hat{F}^{v}_{a^{\prime}},\hat{F}^{v}_{c^{\prime}})\leq 2\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}, and similarly for F^bv,F^dv\hat{F}^{v}_{b^{\prime}},\hat{F}^{v}_{d^{\prime}}. Then Dp(F^au,F^bu)ϵ1+4ϵ1+8ϵ2ϵD_{p}(\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}})\leq\epsilon_{1}+4\epsilon_{1}+\frac{8\epsilon_{2}}{\epsilon} using the triangle inequality, but this contradicts the above implication that Dp(F^au,F^bu)ϵ3D_{p}(\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}})\geq\epsilon_{3} because ϵ3>>ϵ1+ϵ2ϵ\epsilon_{3}>>\epsilon_{1}+\frac{\epsilon_{2}}{\epsilon}.

This implies that every every two components in F^u\hat{F}^{u} that are in a different equivalence classes must be each paired to to two components in F^v\hat{F}^{v} that are also in a different equivalence class. The claim is symmetric w.r.t. u,vu,v, so this implies that F^u,F^v\hat{F}^{u},\hat{F}^{v} have the same number of equivalence classes.

And also consider any component F^au\hat{F}^{u}_{a}. Using Claim 38, there is some component F^bv\hat{F}^{v}_{b} so that F^au,F^bv\hat{F}^{u}_{a},\hat{F}^{v}_{b} are paired estimates. Then using Claim 39, Dp(F^au,F^bv)2ϵ1+4ϵ2ϵD_{p}(\hat{F}^{u}_{a},\hat{F}^{v}_{b})\leq 2\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}. Yet for any component F^cv\hat{F}^{v}_{c} that is not in the same equivalence class as F^bv\hat{F}^{v}_{b},

where the last line follows because F^v\hat{F}^{v} (weakly) satisfies the Window WW. So we can construct πu,v\pi_{u,v} given just F^u,F^v\hat{F}^{u},\hat{F}^{v} because for any pair of equivalence classes Eiu,Ejv\mathcal{E}^{u}_{i},\mathcal{E}^{v}_{j}, if there is a pair of Gaussians that are paired estimates, the parameter distance between any representative from Eiu\mathcal{E}^{u}_{i} to any representative from Ejv\mathcal{E}^{v}_{j} must be at most 4ϵ1+4ϵ2ϵ4\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}. Yet if there is no such pair of Gaussians, one from each equivalence class, that are paired estimates, the parameter distance between any representative from Eiu\mathcal{E}^{u}_{i} to any representative from Ejv\mathcal{E}^{v}_{j} is at least ϵ32ϵ14ϵ2ϵ\epsilon_{3}-2\epsilon_{1}-\frac{4\epsilon_{2}}{\epsilon} so we can distinguish these two cases because ϵ3>>ϵ1+ϵ2ϵ\epsilon_{3}>>\epsilon_{1}+\frac{\epsilon_{2}}{\epsilon}. ∎

Let W=(ϵ1,ϵ2,ϵ3,ϵ4)W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}) be a well-separated window. Suppose for some root direction rr, and n2n^{2} ϵ2\epsilon_{2}-close-by directions rx,yr_{x,y} as in the Partition Pursuit Algorithm, we run the General Univariate Algorithm with precision ϵ1\epsilon_{1} and for each run we get an estimate F^x,y\hat{F}^{x,y} that (weakly) satisfies the Window WW. Then suppose we run Solve given the directions r,rx,yr,r_{x,y} and the estimate F^x,y\hat{F}^{x,y}.

Solve returns an nn-dimensional estimate F^\hat{F} that is an ϵ\epsilon-correct subdivision of FF.

We can apply Lemma 41 and find a partition of all equivalence classes that arise in any estimate in any direction, into sets Eh={E1h,E2h,...En2h}\mathcal{E}^{h}=\{\mathcal{E}^{h}_{1},\mathcal{E}^{h}_{2},...\mathcal{E}^{h}_{n^{2}}\} with the property that for all FiF_{i}, there is some hh such that in each direction rx,yr_{x,y}, FiF_{i} is mapped some equivalence class in Eh\mathcal{E}^{h}. Suppose in direction rx,yr_{x,y}, FiF_{i} is mapped to the equivalence class Ejh\mathcal{E}^{h}_{j}. Then we can take an arbitrary F^jh\hat{F}^{h}_{j} in this set, and use these parameters as an estimate for the projected mean and projected variance of Prx,y[Fi]P_{r_{x,y}}[F_{i}] and these estimates will be 2ϵ12\epsilon_{1} close in parameter distance to the actual values. So we can apply Lemma 7, and the component F^h\hat{F}_{h} of the estimate F^\hat{F} output from Solve that has parameter distance at most ϵ\epsilon to FiF_{i}. So for every component FiF_{i}, there will be some estimate F^h\hat{F}_{h} output from Solve that has parameter distance at most ϵ\epsilon to FiF_{i}. Additionally, for every set of equivalence classes Eh\mathcal{E}^{h}, there is some component FiF_{i} with the property that in each direction rx,yr_{x,y}, FiF_{i} is mapped some equivalence class in Eh\mathcal{E}^{h}. So the mapping from a component FiF_{i} to an estimate F^h\hat{F}_{h} that is ϵ\epsilon-close in parameter distance, will be onto. Lastly, given any partition into sets E1,E2,...Ek\mathcal{E}^{1},\mathcal{E}^{2},...\mathcal{E}^{k^{\prime}}, we can choose the weight w^h\hat{w}_{h} to be the sum of the estimated weights in any equivalence class Ejh\mathcal{E}^{h}_{j} in the set, and because the General Univariate Algorithm returns an ϵ1\epsilon_{1}-correct subdivision, this aggregate weight w^h\hat{w}_{h} will be within an additive kϵ1k\epsilon_{1} of the actual aggregate weight of the components FiF_{i} that are ϵ\epsilon-close in parameter distance to F^h\hat{F}_{h}. ∎

F.3 Observed Components

Given precision ϵ1\epsilon_{1} (given to the General Univariate Algorithm), we say that the number of observed pairs in the estimate F^\hat{F} returned is the maximum value of (k2){k^{\prime\prime}\choose 2} such that there is a subset of kk^{\prime\prime} components of F^\hat{F} with the property that every pair is at parameter distance >ϵ1>\epsilon_{1}. And we will say that the number of observed components is kk^{\prime\prime}.

Suppose we are given any well-separated Window W=(ϵ1,ϵ2,ϵ3,ϵ4)W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}), and an estimate F^\hat{F} that (weakly) satisfies the Window WW. Suppose further that the set of equivalence classes E1,E2,...Ek\mathcal{E}_{1},\mathcal{E}_{2},...\mathcal{E}_{k^{\prime}} (of components in F^\hat{F} at parameter distance at most ϵ1\epsilon_{1})has kk^{\prime} elements.

The number of observed components is kk^{\prime}.

So let u,vu,v be two directions that are ϵ2\epsilon_{2}-close (i.e. uvϵ2\|u-v\|\leq\epsilon_{2}), and let F^u,F^v\hat{F}^{u},\hat{F}^{v} be the estimates returned by the General Univariate Algorithm when given target precision ϵ1\epsilon_{1}, for the directions uu, vv respectively. Suppose further that F^u\hat{F}^{u} (strongly) satisfies the Window WW.

Then the estimate F^v\hat{F}^{v} will either (weakly) satisfy the Window W=(ϵ1,ϵ2,ϵ3,ϵ4)W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}), or the number of observed pairs in F^v\hat{F}^{v} is strictly more than the number observed in F^u\hat{F}^{u}.

Since the estimate F^u\hat{F}^{u} (strongly) satisfies the Window W=(ϵ1,ϵ2,ϵ3,ϵ4)W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}), it also (weakly) satisfies this Window. So we can apply Claim 43 and this implies that there are kk^{\prime} observed components in the estimate F^u\hat{F}^{u} (if there are kk^{\prime} equivalence classes of components in F^u\hat{F}^{u} at parameter distance at most ϵ1\epsilon_{1}).

Let F^cv,F^dv\hat{F}^{v}_{c},\hat{F}^{v}_{d} be two arbitrary components in F^v\hat{F}^{v}. We can apply Claim 38 to get two components F^au,F^bu\hat{F}^{u}_{a},\hat{F}^{u}_{b} in F^u\hat{F}^{u} such that F^au\hat{F}^{u}_{a} and F^cv\hat{F}^{v}_{c} are paired estimates, and similarly F^bu\hat{F}^{u}_{b} and F^dv\hat{F}^{v}_{d} are also paired estimates.

Suppose F^au,F^bu\hat{F}^{u}_{a},\hat{F}^{u}_{b} are not in the same equivalence class in F^u\hat{F}^{u}. This implies that Dp(F^au,F^bu)ϵ4D_{p}(\hat{F}^{u}_{a},\hat{F}^{u}_{b})\geq\epsilon_{4} because F^u\hat{F}^{u} (strongly) satisfies the Window WW. Using Claim 39, we get that

so this implies that the parameter distance Dp(F^cv,F^dv)D_{p}(\hat{F}^{v}_{c},\hat{F}^{v}_{d}) does not contribute to F^v\hat{F}^{v} not (weakly) satisfying WW.

So suppose F^au,F^bu\hat{F}^{u}_{a},\hat{F}^{u}_{b} are in the same equivalence class in F^u\hat{F}^{u}. Then using Claim 39, we get that

because D(F^au,F^bu)ϵ1D(\hat{F}^{u}_{a},\hat{F}^{u}_{b})\leq\epsilon_{1}.

This implies that the only way that the Window WW could be not (weakly) satisfied if there is some pair F^cv,F^dv\hat{F}^{v}_{c},\hat{F}^{v}_{d} for which the paired estimates of each are in the same equivalence class in F^u\hat{F}^{u}, and yet Dp(F^cv,F^dv)>ϵ1D_{p}(\hat{F}^{v}_{c},\hat{F}^{v}_{d})>\epsilon_{1}. So for each other equivalence class in F^u\hat{F}^{u} (other than the one that F^au,F^bu\hat{F}^{u}_{a},\hat{F}^{u}_{b} are in), we can select a representative component F^eu\hat{F}^{u}_{e}, and for each one we apply Claim 38 and find a corresponding component F^ev\hat{F}^{v}_{e^{\prime}}. If we take this set, and F^cv,F^dv\hat{F}^{v}_{c},\hat{F}^{v}_{d} this is a set of k+1k^{\prime}+1 components, and using the above argument all pairs of distances are at least ϵ3>>ϵ1\epsilon_{3}>>\epsilon_{1}, except for the pair Dp(F^cv,F^dv)D_{p}(\hat{F}^{v}_{c},\hat{F}^{v}_{d}) which is still >ϵ1>\epsilon_{1}, so we have k+1k^{\prime}+1 observed components in F^v\hat{F}^{v} if F^v\hat{F}^{v} does not (weakly) satisfy the Window WW. ∎

F.4 Partition Pursuit

Theorem 8. Given an ϵ\epsilon-statistically learnable GMM FF in isotropic position, the Partition Pursuit Algorithm will recover an ϵ\epsilon-correct sub-division F^\hat{F} and if FF has more than one component, F^\hat{F} also has more than one component.

Given an ϵ\epsilon-statistically learnable GMM FF in nn dimensions (and in isotropic position), we can project onto a direction rr chosen uniformly at random. Using Lemma 13, we can instantiate the Partition Pursuit Algorithm with a Window W=(ϵ1,ϵ2,ϵ3,ϵ4)W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4}) with ϵ4=poly(ϵ,1n)\epsilon_{4}=poly(\epsilon,\frac{1}{n}) so that there is at least one pair of Gaussians (with high probability) that when projected onto rr are at parameter distance at least ϵ4\epsilon_{4}. So when we run the General Univariate Algorithm after projecting onto the direction rr, the estimate returned F^r\hat{F}^{r} will have at least two components in order for it to be an ϵ1\epsilon_{1} correct subdivision for Pr[F]P_{r}[F].

If the estimate F^r\hat{F}^{r} returned by the General Univariate Algorithm does not (strongly) satisfy the Window WW, we can perform a shifting operation on the Window WW to obtain a new Window W=(ϵ1,ϵ2,ϵ3,ϵ1)W^{\prime}=(\epsilon_{1}^{\prime},\epsilon_{2}^{\prime},\epsilon_{3}^{\prime},\epsilon_{1}) so that WW^{\prime} is also well-separated and the number of pairwise components observed has strictly increased. So eventually we can find a Window W=(ϵ1,ϵ2,ϵ3,ϵ4)W^{\prime}=(\epsilon^{\prime}_{1},\epsilon^{\prime}_{2},\epsilon^{\prime}_{3},\epsilon^{\prime}_{4}) such that the estimate F^r\hat{F}^{r} returned by General Univariate Algorithm run with target precision ϵ1\epsilon_{1} (strongly) satisfies the Window. Because the number of observed components strictly increases each time we perform a shifting operation, the number of times that we must slide the Window is at most kk. And each time we slide a Window, the parameters of the new Window are polynomially related to the parameters in the old Window. So the precision ϵ1\epsilon^{\prime}_{1} of this Window will be some polynomial in the original precision ϵ1\epsilon_{1}.

So the total number of times that we need to slide the Window is at most kk, and this implies that the parameters we need remain polynomially lower-bounded in ϵ,1n\epsilon,\frac{1}{n}. And when we need to perform no more slides, we have reached a root direction rr such that the estimate returned by the General Univariate Algorithm is (strongly) consistent with the Window WW^{\prime}, and for each direction ri,jr_{i,j} the estimate returned by the General Univariate Algorithm (weakly) satisfies the Window WW^{\prime} as well.

Using Claim 42, this implies that the output of our algorithm is an nn-dimensional ϵ\epsilon-correct sub-division F^\hat{F} for FF. ∎

Appendix G Clustering and Recursion

Suppose the estimate F^\hat{F} returned by the Partition Pursuit Algorithm is an ϵ1\epsilon_{1}-correct subdivision for FF, but is not a good estimate in terms of statistical distance. The only way that this can happen is if there is some component of FF which has a co-variance matrix Σi\Sigma_{i} that has a very small eigenvalue. In this case, we can use this direction (i.e. the eigenvector corresponding to this eigenvalue) to cluster samples from FF into two sets, and proceed in each set by induction.

In this section, we give some simple claims that will be useful building blocks for deciding how to cluster. Specifically, we will need to choose some clustering scheme for samples coming from FF, so that there is some bi-partition of the components of FF into S[k]S\subset[k] and [k]S[k]-S such that any sample generated from FiF_{i} (iSi\in S) has a negligible probability of being mis-clustered.

Given a set of kk points x1,x2,...xkx_{1},x_{2},...x_{k}\in\Re on the line and the maximum distance between any pair is Δ\Delta. Then there is a bi-partition A{x1,x2,...xk}A\subset\{x_{1},x_{2},...x_{k}\}, B={x1,x2,...xk}AB=\{x_{1},x_{2},...x_{k}\}-A such that D(A,B)Δ2k2D(A,B)\geq\frac{\Delta}{2^{k-2}} (and A,BA,B\neq\emptyset) and diam(A),diam(B)Δ(12k1)diam(A),diam(B)\leq\Delta(1-2^{k-1}).

Assume that x1x_{1} is at least as small as any other value in the set, and assume that x2x_{2} is at least as large as any other value in the set. Then set A2={x1},B2={x2}A_{2}=\{x_{1}\},B_{2}=\{x_{2}\}. Clearly D(A2,B2)ΔD(A_{2},B_{2})\geq\Delta. Consider the point x3x_{3}. Either D(A2,x3)D(A_{2},x_{3}) or D(B2,xe)D(B_{2},x_{e}) must be at least Δ2\frac{\Delta}{2}, using the triangle inequality (because D(A2,B2)ΔD(A_{2},B_{2})\geq\Delta). Add the point x3x_{3} to the side that it is closest to, and the resulting subsets A3,B3A_{3},B_{3} are at distance at least Δ2\frac{\Delta}{2}. Iterating this procedure yields two subset Ak,BkA_{k},B_{k} that are disjoint, have D(Ak,Bk)Δ2k2D(A_{k},B_{k})\geq\frac{\Delta}{2^{k-2}} and AkBk={x1,x2,...xk}A_{k}\cup B_{k}=\{x_{1},x_{2},...x_{k}\}. Also diam(Ak)=maxxiAkxix1x2D(Ak,Bk)x1Δ(12k1)diam(A_{k})=\max_{x_{i}\in A_{k}}x_{i}-x_{1}\leq x_{2}-D(A_{k},B_{k})-x_{1}\leq\Delta(1-2^{k-1}), and similarly for BkB_{k}. So take A=Ak,B=BkA=A_{k},B=B_{k}, and this implies the claim. ∎

Given a set of kk points x1,x2,...xk+x_{1},x_{2},...x_{k}\in\Re^{+} on the line that are strictly positive s.t. the maximum ratio of any two points in the set is C>1C>1. Then there is a bi-partition A{x1,x2,...xk}A\subset\{x_{1},x_{2},...x_{k}\}, B={x1,x2,...xk}AB=\{x_{1},x_{2},...x_{k}\}-A such that for all xiA,xjBx_{i}\in A,x_{j}\in B,

(and A,BA,B\neq\emptyset) and also for all xi,xjAx_{i},x_{j}\in A, xixjC112k\frac{x_{i}}{x_{j}}\leq C^{1-\frac{1}{2^{k}}} and also for all xi,xjBx_{i},x_{j}\in B, xixjC112k\frac{x_{i}}{x_{j}}\leq C^{1-\frac{1}{2^{k}}}.

Let y1,y2,...yky_{1},y_{2},...y_{k}\in\Re be the logarithm of each point xix_{i} - i.e. yi=logxiy_{i}=\log x_{i}. Then the maximum distance between any two points in y1,y2,...yky_{1},y_{2},...y_{k} is maxi,jlogxilogxj=maxi,jxixj=logC\max_{i,j}\log x_{i}-\log x_{j}=\max_{i,j}\frac{x_{i}}{x_{j}}=\log C. So let Δ=logC\Delta=\log C and apply Claim 45 to the set y1,y2,..yky_{1},y_{2},..y_{k}. Then we get a bipartition A,BA^{\prime},B^{\prime} of y1,y2,...yky_{1},y_{2},...y_{k} and let A,BA,B be the corresponding bi-partition of x1,x2,...xkx_{1},x_{2},...x_{k} - i.e. xiAx_{i}\in A iff yiAy_{i}\in A^{\prime}.

Then minyiA,yjByiyjΔ2k1\min_{y_{i}\in A^{\prime},y_{j}\in B^{\prime}}y_{i}-y_{j}\geq\frac{\Delta}{2^{k-1}} and yiyj=logxixjy_{i}-y_{j}=\log\frac{x_{i}}{x_{j}}. So this implies that

Also from Claim 45, we have that maxyi,yjAyiyjΔ(112k1)\max_{y_{i},y_{j}\in A^{\prime}}y_{i}-y_{j}\geq\Delta(1-\frac{1}{2^{k-1}}) and yiyj=logxixjy_{i}-y_{j}=\log\frac{x_{i}}{x_{j}} and so

Let F^\hat{F} be a mixture of nn-dimensional Gaussians s.t. F^\hat{F} is an ϵ1\epsilon_{1}-correct sub-division for FF. Also we assume that FF is in isotropic position.

Let FF be an ϵ\epsilon-statistically learnable distribution in isotropic position. Let (F^,π)Dϵ1(F)(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F). Then for any direction rr, var(Pr[F^])1k2O(ϵ1ϵ2)var(P_{r}[\hat{F}])\geq 1-k^{2}O(\frac{\epsilon_{1}}{\epsilon^{2}})

Let μ=iwiμi,μ^=iw^iμ^i\mu=\sum_{i}w_{i}\mu_{i},\hat{\mu}=\sum_{i}\hat{w}_{i}\hat{\mu}_{i}. We can apply Claim 37 to get that μμ^ϵ1+kO(ϵ1ϵ)=O(kϵ1ϵ)\|\mu-\hat{\mu}\|\leq\epsilon_{1}+kO(\frac{\epsilon_{1}}{\epsilon})=O(\frac{k\epsilon_{1}}{\epsilon}). Also using Claim 37, we obtain Σi21ϵ\|\Sigma_{i}\|_{2}\leq\frac{1}{\epsilon} and Σ^π(i)21ϵ+ϵ1\|\hat{\Sigma}_{\pi(i)}\|_{2}\leq\frac{1}{\epsilon}+\epsilon_{1}.

Consider any symmetric matrix AA: (u+v)TA(u+v)=uTAu+2vTAu+vTAv(u+v)^{T}A(u+v)=u^{T}Au+2v^{T}Au+v^{T}Av. And so

And we can apply this equation using A=rrTA=rr^{T}, u=μ^π(i)μ^u=\hat{\mu}_{\pi(i)}-\hat{\mu} and v=μiμuv=\mu_{i}-\mu-u and note that A2=1,uO(1ϵ+kϵ1ϵ)=O(1ϵ)\|A\|_{2}=1,\|u\|\leq O(\frac{1}{\epsilon}+\frac{k\epsilon_{1}}{\epsilon})=O(\frac{1}{\epsilon}) and vO(kϵ1ϵ)\|v\|\leq O(\frac{k\epsilon_{1}}{\epsilon}). Then this implies that (rT(μiμ))2(rT(μ^π(i)μ^))2+O(kϵ1ϵ2)(r^{T}(\mu_{i}-\mu))^{2}\leq(r^{T}(\hat{\mu}_{\pi(i)}-\hat{\mu}))^{2}+O(k\frac{\epsilon_{1}}{\epsilon^{2}}). Then if we take Δ\Delta to be the discrete distribution rTμir^{T}\mu_{i} with probability wiw_{i}, and similarly Δ^\hat{\Delta} to be the discrete distribution rTμ^ir^{T}\hat{\mu}_{i} with probability w^i\hat{w}_{i}, var(Δ^)var(Δ)O(k2ϵ1ϵ2)var(\hat{\Delta})\geq var(\Delta)-O(k^{2}\frac{\epsilon_{1}}{\epsilon^{2}}).

Also rT(ΣiΣ^π(i))rΣiΣ^π(i)F1ϵ|r^{T}(\Sigma_{i}-\hat{\Sigma}_{\pi(i)})r|\leq\|\Sigma_{i}-\hat{\Sigma}_{\pi(i)}\|_{F}\leq\frac{1}{\epsilon}. These facts are enough to be able to apply Fact 57 to get that var(Pr[F^])var(Pr[F])O(k2ϵ1ϵ2)var(P_{r}[\hat{F}])\geq var(P_{r}[F])-O(k^{2}\frac{\epsilon_{1}}{\epsilon^{2}})

G.2 How to Cluster

We will call A,BnA,B\subset\Re^{n} a clustering scheme if AB=A\cap B=\emptyset

For AnA\subset\Re^{n}, we will write P[Fi,A]P[F_{i},A] to denote PrxFi[xA]Pr_{x\sim F_{i}}[x\in A] - i.e. the probability that a randomly chosen sample from FiF_{i} is in the set AA.

Let (F^,π)Dϵ1(F)(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F). Suppose also that F^\hat{F} is a mixture of kk^{\prime} components.

Lemma 9. Suppose that for some direction vv, for all ii: vTΣ^ivϵ2v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{2}, for ϵ1ϵ22ϵ3\epsilon_{1}\leq\frac{\sqrt{\epsilon_{2}}}{2\epsilon_{3}}. If there is some bi-partition S[k]S\subset[k^{\prime}] s.t. iS,j[k]SvTμ^ivTμ^j3ϵ2ϵ3\forall_{i\in S,j\in[k^{\prime}]-S}|v^{T}\hat{\mu}_{i}-v^{T}\hat{\mu}_{j}|\geq\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}} then there is a clustering scheme (A,B)(A,B) (based only on F^\hat{F}) so that for all iS,jπ1(i)i\in S,j\in\pi^{-1}(i), P[Fi,A]1ϵ3P[F_{i},A]\geq 1-\epsilon_{3} and for all iS,jπ1(i)i\notin S,j\in\pi^{-1}(i), Pr[Fi,B]1ϵ3Pr[F_{i},B]\geq 1-\epsilon_{3}.

For each ii, consider the interval Ii=[vTμ^iϵ2ϵ3,vTμ^i+ϵ2ϵ3]I_{i}=[v^{T}\hat{\mu}_{i}-\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}},v^{T}\hat{\mu}_{i}+\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}}]. Then we will choose A={xnvTxiSIi}A=\{x\in\Re^{n}|v^{T}x\in\cup_{i\in S}I_{i}\} and similarly we choose B={xnvTxiSIi}B=\{x\in\Re^{n}|v^{T}x\in\cup_{i\notin S}I_{i}\}.

We first demonstrate that AB=A\cap B=\emptyset. Because of how A,BA,B are defined, this condition is equivalent to the condition that Ai=iSIiA_{i}=\cup_{i\in S}I_{i} and Bi=iSIiB_{i}=\cup_{i\notin S}I_{i} be disjoint. (Ai,BiA_{i},B_{i}\subset\Re and AiBi=A_{i}\cap B_{i}=\emptyset). So consider any two intervals Ii,IjI_{i},I_{j} for iS,jSi\in S,j\notin S. Then because i,ji,j are on different sides of the bipartition S,[k]SS,[k^{\prime}]-S, we get that vTμ^ivTμ^j3ϵ2ϵ3|v^{T}\hat{\mu}_{i}-v^{T}\hat{\mu}_{j}|\geq\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}} so Ii,IjI_{i},I_{j} are in fact disjoint. This implies Ai,BiA_{i},B_{i} are disjoint, and this implies that A,BA,B are disjoint.

Since the standard deviation of FjF_{j} in the direction of vv is at most 2ϵ2\sqrt{2\epsilon_{2}}, points outside Iπ(j)I_{\pi(j)} are at least 1/(2ϵ3)1/(2\epsilon_{3}) standard deviations from their true mean. Using the fact that, for a one-dimensional Gaussian random variable, the probability of being at least ss standard deviations from the mean is at most 2es2/2/(2πs)1/s2e^{-s^{2}/2}/(\sqrt{2\pi}s)\leq 1/s, we get that the probability that xx sampled from FiF_{i} is outside the range [vTμ^iϵ2ϵ3,vTμ^i+ϵ2ϵ3][v^{T}\hat{\mu}_{i}-\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}},v^{T}\hat{\mu}_{i}+\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}}] is at most ϵ3\epsilon_{3}. And this implies the lemma. ∎

Let (F^,π)Dϵ1(F)(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F). Suppose also that F^\hat{F} is a mixture of kk^{\prime} components.

Lemma 10. Suppose that for some direction vv and some i[k]i\in[k^{\prime}] such that: vTΣ^ivϵmv^{T}\hat{\Sigma}_{i}v\leq\epsilon_{m}, for ϵm>>ϵ1\epsilon_{m}>>\epsilon_{1}. If there is some bi-partition S[k]S\subset[k^{\prime}] s.t.

(and ϵt<<ϵ33\epsilon_{t}<<\epsilon_{3}^{3}) then there is a clustering scheme A,BA,B such that for all iS,jπ1(i)i\in S,j\in\pi^{-1}(i), P[Fi,A]1ϵ3P[F_{i},A]\geq 1-\epsilon_{3} and for all iS,jπ1(i)i\notin S,j\in\pi^{-1}(i), Pr[Fi,B]1ϵ3Pr[F_{i},B]\geq 1-\epsilon_{3}.

Let T=[k]ST=[k^{\prime}]-S. Let σ^S=miniSvTΣ^iv\hat{\sigma}_{S}=\min_{i\in S}v^{T}\hat{\Sigma}_{i}v, σ^T=maxjTvTΣ^jv\hat{\sigma}_{T}=\max_{j\in T}v^{T}\hat{\Sigma}_{j}v. So we are given that σ^Smax(σ^T,ϵm)1ϵt\frac{\hat{\sigma}_{S}}{\max(\hat{\sigma}_{T},\epsilon_{m})}\geq\frac{1}{\epsilon_{t}}.

Let Bv=iTIiB_{v}=\cup_{i\in T}I_{i} where Ii=[vTμ^imax(σ^T,ϵm)ϵ3ϵ1,vTμ^i+max(σ^T,ϵm)ϵ3+ϵ1]I_{i}=[v^{T}\hat{\mu}_{i}-\frac{\sqrt{\max(\hat{\sigma}_{T},\epsilon_{m})}}{\epsilon_{3}}-\epsilon_{1},v^{T}\hat{\mu}_{i}+\frac{\sqrt{\max(\hat{\sigma}_{T},\epsilon_{m})}}{\epsilon_{3}}+\epsilon_{1}].

Let FjF_{j} be a component in FF s.t. π(j)=iT\pi(j)=i\in T. Then the variance of FjF_{j} in the direction vv is at most σ^T+ϵ12max(σ^T,ϵm)\hat{\sigma}_{T}+\epsilon_{1}\leq 2\max(\hat{\sigma}_{T},\epsilon_{m}) where here we have used the condition that ϵm>>ϵ1\epsilon_{m}>>\epsilon_{1}. So any point xx outside the interval Iπ(j)I_{\pi(j)} is at least 1/(2ϵ3)1/(2\epsilon_{3}) standard deviations from their true mean. Using the fact that, for a one-dimensional Gaussian random variable, the probability of being at least ss standard deviations from the mean is at most 2es2/2/(2πs)1/s2e^{-s^{2}/2}/(\sqrt{2\pi}s)\leq 1/s, we get that the probability that vTxv^{T}x (when xx is sampled from FjF_{j}) is outside the range BvB_{v} is at most ϵ3\epsilon_{3}.

We will we take as our clustering algorithm B={xnvTxBv}B=\{x\in\Re^{n}|v^{T}x\in B_{v}\} and and A=nBA=\Re^{n}-B, then clearly AB=A\cap B=\emptyset. So the above statement implies that Pr[Fj,B]1ϵ3Pr[F_{j},B]\geq 1-\epsilon_{3} for any iS,jπ1(i)i\notin S,j\in\pi^{-1}(i).

Also, for any FjF_{j} with π(j)S\pi(j)\in S, the variance when projected onto vv is at least σ^Sϵ1\hat{\sigma}_{S}-\epsilon_{1}. So the probability that a point vTxv^{T}x (where xx is sampled from FjF_{j}) is inside the range BvB_{v} is at most the measure of BvB_{v} times the maximum density of Pv[Fj]P_{v}[F_{j}]. This is at most

where the last line follows because σ^S>>ϵm>>ϵm\hat{\sigma}_{S}>>\epsilon_{m}>>\epsilon_{m} because ϵSˉϵ1\epsilon_{\bar{S}}\geq\epsilon_{1} and the ratio σ^Smax(σ^T,ϵm)1ϵt\frac{\hat{\sigma}_{S}}{\max(\hat{\sigma}_{T},\epsilon_{m})}\geq\frac{1}{\epsilon_{t}} is large, and because ϵt<<ϵ33\epsilon_{t}<<\epsilon_{3}^{3}.

So we also have that Pr[Fj,B]ϵ3Pr[F_{j},B]\leq\epsilon_{3} for all iS,jπ1(i)i\in S,j\in\pi^{-1}(i). So Pr[Fj,A]1ϵ3Pr[F_{j},A]\geq 1-\epsilon_{3}, and this implies the lemma. ∎

G.3 Making Progress when there is a Small Variance

Lemma 11. Suppose μ^iμiϵ1\|\hat{\mu}_{i}-\mu_{i}\|\leq\epsilon_{1}, Σ^iΣiFϵ1\|\hat{\Sigma}_{i}-\Sigma_{i}\|_{F}\leq\epsilon_{1}, and w^iwiϵ1|\hat{w}_{i}-w_{i}|\leq\epsilon_{1}, if either Σi1212ϵm\|\Sigma^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}} or Σ^i1212ϵm\|\hat{\Sigma}^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}} then

Now we can describe the idea behind the hierarchical clustering. Suppose the entire algorithm on k1k-1 Gaussians requires mm samples. Then choose ϵ3=ϵδm\epsilon_{3}=\frac{\epsilon\delta}{m} so that if we take mϵ\frac{m}{\epsilon} samples in total, then each side in the bipartition that results from clustering would get at least mm samples and none of the samples obtained from the oracle are mis-clustered. Then we can run the k1k-1 Gaussian algorithm on each side of the bi-partition in order to get a statistically good estimate for the original mixture of kk Gaussians.

Given ϵ3\epsilon_{3}, choose ϵ2\epsilon_{2} s.t. ϵ2ϵ312k+1\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}}\leq\frac{1}{2^{k+1}}. Also choose ϵm<<ϵ2\epsilon_{m}<<\epsilon_{2} s.t. (ϵmϵ2)12k<<ϵ33(\frac{\epsilon_{m}}{\epsilon_{2}})^{\frac{1}{2^{k}}}<<\epsilon_{3}^{3}. Then choose ϵ1<<ϵm\epsilon_{1}<<\epsilon_{m}.

We call the set of parameters ϵ1<<ϵm<<ϵ2<<ϵ3\epsilon_{1}<<\epsilon_{m}<<\epsilon_{2}<<\epsilon_{3} good if

2nϵ1ϵm+ϵ12ϵmϵ2\frac{2n\epsilon_{1}}{\epsilon_{m}}+\frac{\epsilon_{1}^{2}}{\epsilon_{m}}\leq\epsilon^{2}

k2ϵ1ϵ2=o(1)k^{2}\frac{\epsilon_{1}}{\epsilon^{2}}=o(1)

ϵ1ϵ22ϵ3\epsilon_{1}\leq\frac{\sqrt{\epsilon_{2}}}{2\epsilon_{3}}

3ϵ2ϵ3=o(2k)\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}}=o(2^{-k})

(ϵmϵ2)12k<<ϵ33(\frac{\epsilon_{m}}{\epsilon_{2}})^{\frac{1}{2^{k}}}<<\epsilon_{3}^{3}

Suppose we choose a set of good parameters ϵ1<<ϵm<<ϵ2<<ϵ3\epsilon_{1}<<\epsilon_{m}<<\epsilon_{2}<<\epsilon_{3}. Then the Hierarchical Clustering Algorithm will either return an ϵ\epsilon-close statistical estimate F^\hat{F} for FF or make progress by returning a clustering scheme.

Theorem 12. The Hierarchical Clustering Algorithm either returns an ϵ\epsilon-close statistical estimate F^\hat{F} for FF, or returns a clustering scheme A,BA,B such that there is some bipartition S[k]S\subset[k] such that for all iS,jπ1(i)i\in S,j\in\pi^{-1}(i), P[Fi,A]1ϵ3P[F_{i},A]\geq 1-\epsilon_{3} and for all iS,jπ1(i)i\notin S,j\in\pi^{-1}(i), Pr[Fi,B]1ϵ3Pr[F_{i},B]\geq 1-\epsilon_{3}. And also S,[k]SS,[k]-S are both non-emtpy.

We analyze the output of the Hierarchical Clustering Algorithm via a case analysis:

Case 1: Suppose that no Gaussian F^i\hat{F}_{i} has any variance (i.e. in any direction) that is at most ϵm\epsilon_{m}.

Suppose that no Gaussian F^i\hat{F}_{i} has any variance (i.e. in any direction) that is at most ϵm\epsilon_{m}. Then we can apply Lemma 11 and because 2nϵ1ϵm+ϵ12ϵmϵ2\frac{2n\epsilon_{1}}{\epsilon_{m}}+\frac{\epsilon_{1}^{2}}{\epsilon_{m}}\leq\epsilon^{2}, and this will imply that the estimate F^\hat{F} is statistically close to the actual mixture FF.

Case 2: So suppose there is a Gaussian F^i\hat{F}_{i} which has a variance of at most ϵm\epsilon_{m} on some direction vv.

Then using Claim 47, var(Pv[F^])1O(k2ϵ1ϵ2)var(P_{v}[\hat{F}])\geq 1-O(k^{2}\frac{\epsilon_{1}}{\epsilon^{2}}). Because the parameters are good, we know that k2ϵ1ϵ2=o(1)k^{2}\frac{\epsilon_{1}}{\epsilon^{2}}=o(1) and so var(Pv[F^])=Ω(1)var(P_{v}[\hat{F}])=\Omega(1). Suppose that for all F^j\hat{F}_{j}, Dp(Pv[F^i],Pv[F^j])=o(1)D_{p}(P_{v}[\hat{F}_{i}],P_{v}[\hat{F}_{j}])=o(1). In this case, we could apply Fact 57 and jw^jvTΣ^jv=o(1)\sum_{j}\hat{w}_{j}v^{T}\hat{\Sigma}_{j}v=o(1) and similarly var(Δ^)var(\hat{\Delta}) (where Δ^\hat{\Delta} is the discrete distribution on \Re which takes value vTμ^jv^{T}\hat{\mu}_{j} with probability w^j\hat{w}_{j}) will be upper bounded by maxj(vT(μ^iμ^j))2=o(1)\max_{j}(v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j}))^{2}=o(1). So if for all F^j\hat{F}_{j}, Dp(Pv[F^i],Pv[F^j])=o(1)D_{p}(P_{v}[\hat{F}_{i}],P_{v}[\hat{F}_{j}])=o(1), we would have var(Pv[F^])=o(1)var(P_{v}[\hat{F}])=o(1) which is not possible, hence there must be some other Gaussian F^j\hat{F}_{j} s.t. Dp(Pv[F^i],Pv[F^j])=Ω(1)D_{p}(P_{v}[\hat{F}_{i}],P_{v}[\hat{F}_{j}])=\Omega(1).

Case 2a: Suppose that each Gaussian F^h\hat{F}_{h} has projected variance vTΣ^hvϵ2v^{T}\hat{\Sigma}_{h}v\leq\epsilon_{2}, and there is a Gaussian F^j\hat{F}_{j} s.t. the difference in projected means vT(μ^iμ^j)=Ω(1)|v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j})|=\Omega(1).

In this case, we can apply Claim 45 to get a bipartition S[k]S^{\prime}\subset[k^{\prime}] (let T=[k]ST^{\prime}=[k^{\prime}]-S^{\prime}) such that S,TS^{\prime},T^{\prime}\neq\emptyset and such that for all iS,jTi\in S^{\prime},j\in T^{\prime}, vT(μ^iμ^j)Ω(2k)|v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j})|\geq\Omega(2^{-k}). Because the parameters are good, we have that 3ϵ2ϵ3=o(2k)\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}}=o(2^{-k}). Then we can apply Lemma 9 to obtain a clustering so that each successive point sampled from the oracle has probability at most ϵ3\epsilon_{3} of being mis-clustered, as desired. And since both S,TS^{\prime},T^{\prime} are non-empty, this clustering scheme returned by Lemma 9 has the property that for either side of the clustering scheme, there is some component FiF_{i} in the original mixture that is mapped to that side w.h.p.

Case 2b: Either there is some Gaussian F^h\hat{F}_{h} which has projected variance vTΣ^hvϵ2v^{T}\hat{\Sigma}_{h}v\geq\epsilon_{2}, or for all Gaussians F^j\hat{F}_{j} (jij\neq i) the difference in projected means vT(μ^iμ^j)=o(1)|v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j})|=o(1).

Either case implies that there is some Gaussian F^h\hat{F}_{h} such that when projected onto vv, F^h\hat{F}_{h} has variance at least ϵ2\epsilon_{2}. In the first case, this is directly true. In the second case, (if we let Δ^\hat{\Delta} be the discrete distribution on \Re which takes value vTμ^jv^{T}\hat{\mu}_{j} with probability w^j\hat{w}_{j}), var(Δ^)=o(1)var(\hat{\Delta})=o(1). And using Claim 47 and Fact 57, then there must be some component F^h\hat{F}_{h} with vTF^hv=Ω(1)>>ϵ2v^{T}\hat{F}_{h}v=\Omega(1)>>\epsilon_{2}.

So let F^h\hat{F}_{h} be the component for which vTF^hvv^{T}\hat{F}_{h}v is the largest (and is at least ϵ2\epsilon_{2}).

We can do the following: Let A1[k]={i[k]vTΣ^ivϵm}A_{1}\subset[k^{\prime}]=\{i\in[k^{\prime}]|v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{m}\}. Let B1=[k]A1B_{1}=[k^{\prime}]-A_{1}, which is necessarily non-empty because hB1h\in B_{1}. Then take B2={ϵm}{vTΣ^iviB1}B_{2}=\{\epsilon_{m}\}\cup\{v^{T}\hat{\Sigma}_{i}v|i\in B_{1}\} and we can apply Claim 46 to get a bi-partition A3,B3A_{3},B_{3} of B2B_{2} with the property that ϵmA3\epsilon_{m}\in A_{3}, both A3,B3A_{3},B_{3} are non-empty and (choosing C=ϵ2ϵmC=\frac{\epsilon_{2}}{\epsilon_{m}} in Claim 46 and C12k>>1ϵ33C^{\frac{1}{2^{k}}}>>\frac{1}{\epsilon_{3}^{3}}) the ratio min(B3)max(A3)1ϵt>>1ϵ33\frac{\min(B_{3})}{\max(A_{3})}\geq\frac{1}{\epsilon_{t}}>>\frac{1}{\epsilon_{3}^{3}}.

Then every projected variance vTΣ^ivv^{T}\hat{\Sigma}_{i}v is in the set A1A3B3A_{1}\cup A_{3}\cup B_{3}. So we can take AA to be the set of indices i[k]i\in[k^{\prime}] such that vTΣ^ivA1A3v^{T}\hat{\Sigma}_{i}v\in A_{1}\cup A_{3} and similarly we take BB to be the set of indices i[k]i\in[k^{\prime}] such that vTΣ^ivB3v^{T}\hat{\Sigma}_{i}v\in B_{3}. Then A,BA,B is a bipartition of [k][k^{\prime}].

Also miniBvTΣ^ivmax(ϵm,maxiAvTΣ^iv)=min(B3)max(A3)1ϵt>>1ϵ33\frac{\min_{i\in B}v^{T}\hat{\Sigma}_{i}v}{\max(\epsilon_{m},\max_{i\in A}v^{T}\hat{\Sigma}_{i}v)}=\frac{\min(B_{3})}{\max(A_{3})}\geq\frac{1}{\epsilon_{t}}>>\frac{1}{\epsilon_{3}^{3}}. And then we can apply Lemma 10 and this yields a clustering so that each successive point sampled from the oracle has probability at most ϵ3\epsilon_{3} of being mis-clustered, as desired. Note that iAi\in A, and hBh\in B, so both of the sides of this clustering scheme are non-empty (for either side of the clustering scheme, there is some component FiF_{i} in the original mixture that is mapped to that side w.h.p.).

This completes the description of the Hierarchical Clustering Algorithm.

G.4 Recursion

Appendix H The Isotropic Projection Lemma for k𝑘k Gaussians

Lemma 13. [Isotropic Projection Lemma] Given a mixture of kk nn-Dimensional Gaussians F=iwiFiF=\sum_{i}w_{i}F_{i} that is in isotropic position and is ϵ\epsilon-statistically learnable, with probability 1δ\geq 1-\delta over a randomly chosen direction uu, there is some pair of Gaussians Fi,FjF_{i},F_{j} s.t. Dp(Pu[Fi],Pu[Fj])ϵ5δ250n2D_{p}(P_{u}[F_{i}],P_{u}[F_{j}])\geq\frac{\epsilon^{5}\delta^{2}}{50n^{2}}.

Let ϵ1=ϵ5δ2100n2\epsilon_{1}=\frac{\epsilon^{5}\delta^{2}}{100n^{2}}, and ϵ2=4ϵ1ϵ\epsilon_{2}=\frac{4\epsilon_{1}}{\epsilon}

Case 1: μiμj>t\|\mu_{i}-\mu_{j}\|>t for some i,j[k]i,j\in[k]. In this case, by Lemma 50, with probability 1δ\geq 1-\delta, u(μiμj)δt/n=2ϵ1|u\cdot(\mu_{i}-\mu_{j})|\geq\delta t/\sqrt{n}=2\epsilon_{1}, as desired.

Case 2: μiμjt\|\mu_{i}-\mu_{j}\|\leq t for all i,j[k]i,j\in[k]. By Lemma 48, with probability 1δ\geq 1-\delta, for some hh,

If u(μiμj)2ϵ1|u\cdot(\mu_{i}-\mu_{j})|\geq 2\epsilon_{1}, then we are done. If not, then u(μiμj)2ϵ1|u\cdot(\mu_{i}-\mu_{j})|\leq 2\epsilon_{1} for all i,j[k]i,j\in[k]. Then using Fact 57, var(Δ)+jwjuTΣju=1var(\Delta)+\sum_{j}w_{j}u^{T}\Sigma_{j}u=1 where Δ\Delta is the discrete distribution on points in 11-dimension which is uTμju^{T}\mu_{j} with probability wjw_{j}. The variance of this mixture Δ\Delta is upper bounded by maxi,juTμiuTμj2\max_{i,j}|u^{T}\mu_{i}-u^{T}\mu_{j}|^{2} which is at most 4ϵ122ϵ14\epsilon_{1}^{2}\leq 2\epsilon_{1}.

Let ϵ,δ>0\epsilon,\delta>0, t(0,ϵ2)t\in(0,\epsilon^{2}). Let FF be an ϵ\epsilon-statistically learnable distribution in isotropic position. Suppose for all i,j[k]i,j\in[k] that μiμjt\|\mu_{i}-\mu_{j}\|\leq t. Then, for uniformly random rr,

We can apply Lemma 52 and then apply Lemma 51. So with probability at least 1δ1-\delta, there is some ii s.t. uTΣiu[1c,1+c]u^{T}\Sigma_{i}u\notin[1-c,1+c] for c==δ2a4n,a=ϵ3t23nc==\frac{\delta^{2}a}{4n},a=\frac{\epsilon^{3}-t^{2}}{3n}. If uTΣiu<1cu^{T}\Sigma_{i}u<1-c then we are done. If instead uTΣiu>1+cu^{T}\Sigma_{i}u>1+c, we can apply Fact 57 which implies that jwjuTΣju1\sum_{j}w_{j}u^{T}\Sigma_{j}u\leq 1 and we have that wiuTΣiu>wi(1+c)w_{i}u^{T}\Sigma_{i}u>w_{i}(1+c). We can apply Claim 49 which implies that there is then some jij\neq i s.t. uTΣju<1ϵcu^{T}\Sigma_{j}u<1-\epsilon c which implies the lemma. ∎

Suppose w1(1+α)+w2(1β)1w_{1}(1+\alpha)+w_{2}(1-\beta)\leq 1, w1,w2ϵ0w_{1},w_{2}\geq\epsilon\geq 0, w1+w2=1w_{1}+w_{2}=1 and α>0\alpha>0. Then, βϵα\beta\geq\epsilon\alpha.

For any μi,μjRn,δ>0\mu_{i},\mu_{j}\in{\bf R}^{n},\delta>0, over uniformly random unit vectors uu,

Suppose the mixture F=iwiFiF=\sum_{i}w_{i}F_{i} is in isotropic position and is ϵ\epsilon-statistically learnable, and that for all i,j[k]i,j\in[k], μiμjt\|\mu_{i}-\mu_{j}\|\leq t. Then maxi{ Σi12 }1+a,a=ϵ3t23n.\max_{i}\{~{}\|\Sigma_{i}^{-1}\|_{2}~{}\}\geq 1+a,\quad a=\frac{\epsilon^{3}-t^{2}}{3n}.

By Fact 58, the squared variation distance between FiF_{i} and FjF_{j} is,

Where λ1,,λn>0\lambda_{1},\ldots,\lambda_{n}>0 are the eigenvalues of Σi1Σj\Sigma_{i}^{-1}\Sigma_{j}. Suppose (μ1μ2)TΣi1(μ1μ2)t2ϵ(\mu_{1}-\mu_{2})^{T}\Sigma_{i}^{-1}(\mu_{1}-\mu_{2})\geq\frac{t^{2}}{\epsilon}, then this implies Σi121ϵ\|\Sigma_{i}^{-1}\|_{2}\geq\frac{1}{\epsilon} because μ1μ2t\|\mu_{1}-\mu_{2}\|\leq t, and we would be done in this case. If not, then from the above equation we get

In particular, there must be some eigenvalue λ\lambda, such that, λ+1/λ22n (ϵ2t2ϵ)=6aϵ2.\lambda+1/\lambda-2\geq\frac{2}{n}\ (\epsilon^{2}-\frac{t^{2}}{\epsilon})=\frac{6a}{\epsilon^{2}}. Let vv be a unit (eigen)vector corresponding to λ\lambda, i.e., v=λΣi1Σjvv=\lambda\Sigma_{i}^{-1}\Sigma_{j}v. Then we have that vTΣiv=λvTΣjvv^{T}\Sigma_{i}v=\lambda v^{T}\Sigma_{j}v and this yields

Since one of the two terms in parentheses above must be at least 3a/ϵ23a/\epsilon^{2}, WLOG, we can take vTΣivvTΣjv1+3a/ϵ2\frac{v^{T}\Sigma_{i}v}{v^{T}\Sigma_{j}v}\geq 1+3a/\epsilon^{2}. This means that the numerator or denominator is bounded from 1. We can break this into two cases.

Case 1: vTΣjv<1/(1+a)v^{T}\Sigma_{j}v<1/(1+a). This establishes the lemma immediately.

Case 2: vTΣiv(1+3a/ϵ2)/(1+a)=1+(3/ϵ21)a/(1+a)1+(3/ϵ21)a/2v^{T}\Sigma_{i}v\geq(1+3a/\epsilon^{2})/(1+a)=1+(3/\epsilon^{2}-1)a/(1+a)\geq 1+(3/\epsilon^{2}-1)a/2. By Claim 49, since hwhvTΣhv1\sum_{h}w_{h}v^{T}\Sigma_{h}v\leq 1, we have there is some g[k]g\in[k], gig\neq i such that

This means that Σg121/(1a)1+a\|\Sigma_{g}^{-1}\|_{2}\geq 1/(1-a)\geq 1+a.

Appendix I Approximate Isotropic Position

Let F1=N(μ1,Σ1)F_{1}=\mathcal{N}(\mu_{1},\Sigma_{1}). Let m=O(n4ln1δϵ4c)m=O(\frac{n^{4}\ln\frac{1}{\delta}}{\epsilon^{4c}}). Then given mm samples from F1F_{1}, x1,x2,...xmx_{1},x_{2},...x_{m} compute μ^1=1mixi\hat{\mu}_{1}=\frac{1}{m}\sum_{i}x_{i} and Σ^1=1mixixiTμ^1μ^1T\hat{\Sigma}_{1}=\frac{1}{m}\sum_{i}x_{i}x_{i}^{T}-\hat{\mu}_{1}\hat{\mu}_{1}^{T}. Let F^1=N(μ^1,Σ^1)\hat{F}_{1}=\mathcal{N}(\hat{\mu}_{1},\hat{\Sigma}_{1}). Then with probability at least 1δ1-\delta, D(F1,F^1)O(ϵc)D(F_{1},\hat{F}_{1})\leq O(\epsilon^{c}).

For m=O(n4lnkδϵ4c+1)m=O(\frac{n^{4}\ln\frac{k}{\delta}}{\epsilon^{4c+1}}), D(F,F^),maxiD(Fi,F^i),maxiwiw^iO(ϵc)D(F,\hat{F}),\max_{i}D(F_{i},\hat{F}_{i}),\max_{i}|w_{i}-\hat{w}_{i}|\leq O(\epsilon^{c}), with probability at least 1δ1-\delta.

because mΩ(logkδϵ2c)m\geq\Omega(\frac{\log\frac{k}{\delta}}{\epsilon^{2c}}).

So each ii receives at least Ω(ϵm)=Ω(n4lnkδϵ4c)\Omega(\epsilon m)=\Omega(\frac{n^{4}\ln\frac{k}{\delta}}{\epsilon^{4c}}) samples, so using Theorem 53, D(Fi,F^i)O(ϵc)D(F_{i},\hat{F}_{i})\leq O(\epsilon^{c}) with probability at least 1δ4k1-\frac{\delta}{4k}.

Then D(F,F^)maxiD(Fi,F^i)+iwiw^i=O(ϵc)D(F,\hat{F})\leq\max_{i}D(F_{i},\hat{F}_{i})+\sum_{i}|w_{i}-\hat{w}_{i}|=O(\epsilon^{c}) and the total probability of any bad event occurring is at most δ\delta so this implies the corollary. ∎

ExF^=1mixiE_{x\sim\hat{F}}=\frac{1}{m}\sum_{i}x_{i} and ExF[xxT]=1mixixiTE_{x\sim F}[xx^{T}]=\frac{1}{m}\sum_{i}x_{i}x_{i}^{T}

Given an ϵ\epsilon^{\prime}-statistically learnable distribution (for ϵϵ\epsilon^{\prime}\geq\epsilon) FF, given m=O(n4lnkδϵ5)m=O(\frac{n^{4}\ln\frac{k}{\delta}}{\epsilon^{5}}) samples from FF, one can compute a transformation T^\hat{T} such that there is ϵO(ϵ)\epsilon^{\prime}-O(\epsilon)-statistically learnable distribution F^\hat{F} s.t. with probability at least 1δ1-\delta

computing an γ\gamma-close estimate for F^\hat{F} is also an γ+O(ϵ)\gamma+O(\epsilon)-close statistical estimate for FF

a transformation T^\hat{T} places F^\hat{F} in exactly isotropic position

T^\hat{T} can be computed from just the sample points x1,x2,...xmx_{1},x_{2},...x_{m}

Appendix J Basic Properties of Gaussians

In this section we state many useful basic facts about univariate Gaussian distributions that are used throughout this paper.

Given a discrete distribution on points in 11-dimension, Δ\Delta, we will define var(Δ)var(\Delta) to be the variance of this distribution.

Given a GMM of 11-dimensional Gaussians, F=iwiN(μi,σi2)F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2}),

where Δ\Delta is the discrete distribution on points in 11-dimension corresponding to selecting each μi\mu_{i} with probability wiw_{i}.

Also ExF[x]=iwiμi=ExΔ[x]E_{x\sim F}[x]=\sum_{i}w_{i}\mu_{i}=E_{x\sim\Delta}[x] and combining these equations yields:

Let F1=N(μ1,Σ1)F_{1}=\mathcal{N}(\mu_{1},\Sigma_{1}) and F2=N(μ2,Σ2)F_{2}=\mathcal{N}(\mu_{2},\Sigma_{2}) be two nn-dimensional Gaussian distributions. Let λ1,,λn>0\lambda_{1},\ldots,\lambda_{n}>0 be the eigenvalues of Σ11Σ2\Sigma_{1}^{-1}\Sigma_{2}. Then the variation distance between them satisfies,

It is easy to verify that argmaxσ2N(0,σ2,γ)=γ2,\text{argmax}_{\sigma^{2}}\mathcal{N}(0,\sigma^{2},\gamma)=\gamma^{2}, from which the fact follows. ∎

Either μγ/2,\mu\geq\gamma/2, or σ2γ/2\sigma^{2}\geq\gamma/2. In the first case, using Fact 59,

[Lemma 29 from ] Given σ22,\sigma^{2}\leq 2,

Using Lemma 61, the above bound follows by a change of variables and induction. Note that the constant inside the O()O() depends (exponentially) on ii. ∎

Given μ,μ,σ2,σ2\mu,\mu^{\prime},\sigma^{2},\sigma^{\prime 2} such that μ,μ<c|\mu|,|\mu^{\prime}|<c and ϵ1/3σ2,σ22\epsilon^{1/3}\leq\sigma^{2},\sigma^{\prime 2}\leq 2 and μμ+σ2σ2ϵ,|\mu-\mu^{\prime}|+|\sigma^{2}-\sigma^{\prime 2}|\leq\epsilon, (and we also assume that ϵc2=o(1)\epsilon c^{2}=o(1) and c1c\geq 1) then

Consider the interval I=[2c,2c]I=[-2c,2c]. Then in order to bound max(N(μ,σ2,x)N(μ,σ2,x))\max(|\mathcal{N}(\mu,\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|) over II, we first bound max(N(μ,σ2,x)N(μ,σ2,x))\max(|\mathcal{N}(\mu,\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x)|) over II and next we bound max(N(μ,σ2,x)N(μ,σ2,x))\max(|\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|) over II.

We prove this claim in two parts: first we bound maxxIN(μ,σ2,x)N(μ,σ2,x)\max_{x\in I}|\mathcal{N}(\mu,\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x)|:

Next, we bound the term maxxI(N(μ,σ2,x)N(μ,σ2,x))\max_{x\in I}(|\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|). We accomplish this by bounding both maxxI(N(μ,σ2,x)N(μ,σ2,x)\max_{x\in I}(\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x) and maxxI(N(μ,σ2,x)N(μ,σ2,x))\max_{x\in I}(\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x)). Assume that σ2σ2\sigma^{2}\geq\sigma^{\prime 2}. Then it follows that: max(N(μ,σ2,x)N(μ,σ2,x))=N(μ,σ2,μ)N(μ,σ2,μ)=12π[1σ1σ]\max(\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x))=\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},\mu^{\prime})-\mathcal{N}(\mu^{\prime},\sigma^{2},\mu^{\prime})=\frac{1}{\sqrt{2\pi}}[\frac{1}{\sigma^{\prime}}-\frac{1}{\sigma}] because N(μ,σ2,x)\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x) decreases at a faster rate than N(μ,σ2,x)\mathcal{N}(\mu^{\prime},\sigma^{2},x) whenever N(μ,σ2,x)>N(μ,σ2,x)\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)>\mathcal{N}(\mu^{\prime},\sigma^{2},x). Also using the restriction that σ2,σ2ϵ1/3\sigma^{\prime 2},\sigma^{2}\geq\epsilon^{1/3} yields [1σ1σ]1σO(ϵσ2)O(ϵ)[\frac{1}{\sigma^{\prime}}-\frac{1}{\sigma}]\leq\frac{1}{\sigma}O(\frac{\epsilon}{\sigma^{2}})\leq O(\sqrt{\epsilon}).

Lastly, we bound the term maxxIN(μ,σ2,x)N(μ,σ2,x)\max_{x\in I}\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x):

Thus these bounds imply that maxxI(N(μ,σ2,x)N(μ,σ2,x))=O(c2ϵ1/6)\max_{x\in I}(|\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|)=O(c^{2}\epsilon^{1/6})

So we can use the Claim 64 to conclude that

And we can use Corollary 62 to conclude that

The kthk^{th} raw moment of a univariate Gaussian, Mk(N(μ,σ2))=i=0kciμiσ2(ki),M_{k}(\mathcal{N}(\mu,\sigma^{2}))=\sum_{i=0}^{k}c_{i}\mu^{i}\sigma^{2(k-i)}, where ci(k+2)!.|c_{i}|\leq(k+2)!.

Consider the moment generating function MX(t)=etμ+σ2t2/2.M_{X}(t)=e^{t\mu+\sigma^{2}t^{2}/2}. We claim that diMX(t)dti=polyi(μ,σ,t)MX(t),\frac{d^{i}M-X(t)}{dt^{i}}=poly_{i}(\mu,\sigma,t)\cdot M_{X}(t), where polyi(μ,σ,t)poly_{i}(\mu,\sigma,t) is a polynomial of μ,σ2,t\mu,\sigma^{2},t, whose degree when viewed as a polynomial over tt is at most ii, whose degree when viewed as a polynomial over μ,σ2\mu,\sigma^{2} is at most ii, and whose coefficients are bounded in magnitude by i!.i!. We prove this by induction, with the base case i=1i=1 being trivial. Assuming the statement holds for some value i1,i\geq 1, we have

Thus polyi+1(μ,σ,t)=polyi(μ,σ,t)(2σ2t+μ)+dpolyi(μ,σ,t)dt.poly_{i+1}(\mu,\sigma,t)=poly_{i}(\mu,\sigma,t)(2\sigma^{2}t+\mu)+\frac{dpoly_{i}(\mu,\sigma,t)}{dt}. Clearly degt(polyi+1(μ,σ,t))=i+1,deg_{t}(poly_{i+1}(\mu,\sigma,t))=i+1, and the degree in terms of μ\mu and σ2\sigma^{2} increases by at most one. To get from polyipoly_{i} to polyi+1,poly_{i+1}, each coefficient is multiplied by 2 in the first product, and multiplied by at most ii in the second term because of the differentiation. Thus if cc is the maximum magnitude of a coefficient of polyi,poly_{i}, the maximum magnitude of a coefficient of polyi+1poly_{i+1} will be at most (2+i)c,(2+i)c, from which the claim follows. ∎