Settling the Polynomial Learnability of Mixtures of Gaussians

Ankur Moitra, Gregory Valiant

Introduction

Given access to random samples generated from a mixture of (multivariate) Gaussians, the algorithmic problem of learning the parameters of the underlying distribution is of fundamental importance in physics, biology, geology, social sciences – any area in which such finite mixture models arise . Starting with Dasgupta , a series of work in theoretical computer science has sought to find (or disprove the existence of) an efficient algorithm for this task . In this paper, we settle the polynomial-time learnability of mixtures of Gaussians, giving an algorithm that uses a polynomial amount of data and estimates the components at an inverse polynomial rate under provably minimal assumptions on the mixture (specifically, that the mixing weights and the statistical distance between the components are bounded away from zero). As a corollary, our efficient learning algorithm can be employed to yield the first provably efficient algorithm for near-optimal clustering and density estimation, without any restrictions on the Gaussian mixture. Finally, we note that the runtime and data requirements of our algorithm are exponential in the number of Gaussian components; however, as we show in Section 6, this exponential dependence is necessary. In the remainder of this section, we briefly summarize previous work on this problem, formally state our main result, and then discuss the differences between learning mixtures of $2$ Gaussians, and mixtures of many Gaussians, which motivates the high-level outline of our algorithm presented in Section 2. We first define a Gaussian Mixture Model (GMM).

The most popular solution for recovering reasonable estimates of the components of GMMs in practice is the EM algorithm given by Dempster, Laird and Rubin . This algorithm is a local-search heuristic that converges to a set of parameters that locally maximizes the probability of generated the observed samples. However, the EM algorithm is a heuristic only, and makes no guarantees about converging to an estimate that is close to the true parameters. Worse still, the EM algorithm (even for univariate mixtures of just two Gaussians) has been observed to converge very slowly (see Redner and Walker for a thorough treatment ).

In order to even hope for an algorithm (not necessarily even polynomial time), we would need a uniqueness property – that two distinct mixtures of Gaussians must have different probability density functions. Teicher demonstrated that a mixture of Gaussians can be uniquely identified (up to a relabeling components) by considering the probability density function at points sufficiently far from the centers (in the tails). However, such a result sheds little light on the rate of convergence of an estimator: If distinguishing Gaussian mixtures really required analyzing the tails of the distribution, then we would require an enormous number of data samples!

Dasgupta introduced theoretical computer science to the algorithmic problem of provably recovering good estimates for the parameters in polynomial time (and a polynomial number of samples). His technique is based on projecting data down to a randomly chosen low-dimensional subspace, finding an accurate clustering. Given enough accurately clustered points, the empirical means and co-variances of these points will be a good estimate for the actual parameters. Arora and Kannan extended these ideas to work in the much more general setting in which the co-variances of each Gaussian component could be arbitrary, and not necessarily almost spherical as in . Yet both of these techniques are based on the concentration of distances (under random projections), and consequently required that the centers of the components be separated by at least $\sqrt{n}$ times the largest variance. Vempala and Wang and Achlioptas and McSherry introduced the use of spectral techniques, and were able to overcome this barrier (of relying on distance concentration) by choosing a subspace on which to project based on large principle components. Brubaker and Vempala later gave the first affine-invariant algorithm for learning mixtures of Gaussians, and these ideas proved to be central in subsequent work .

Yet all of these approaches for provably learning good estimates require, at the very least, that the statistical overlap (i.e. one minus the statistical distance) between each pair of components be at least smaller than some constant (in some cases, it is even required that the statistical overlap be exponentially small). Recently, Felman et al gave a polynomial time algorithm for the related problem of density estimation (without any separation condition) for the special case of axis-aligned GMMs (GMMs where each component has principle coordinates aligned with the coordinate axes). Also without any separation requirements, Belkin and Sinha showed that one can efficiently learn GMMs in the special case that all components are identical spherical Gaussians. Most similar to the present work is the recent work of Kalai et al , that gave a learning algorithm for the case of mixtures of two arbitrary Gaussians with provably minimal assumptions.

2 Main Results

In this section we state our main results. To motivate these results, we first state three obvious lower bounds for recovering the parameters of a GMM $F=\sum_{i=1}^{k}w_{i}F_{i},$ which motivate our defintion of $\epsilon$ -statistically learnable. We provide a formal definition of statistical distance in Section 2.1.

Permuting the order of the components does not change the resulting density, thus at best the hope is to recover the parameter set, $\{(w_{1},\mu_{1},\Sigma_{1}),\ldots,(w_{k},\mu_{k},\Sigma_{k})\}.$

We require at least $\Omega(1/\min_{i}(w_{i}))$ samples to estimate the parameters, since we require this number of samples to ensure that we have seen, with reasonable probability, any sample from each component.

If $F_{i}=F_{j}$ , then it is impossible to accurately estimate $w_{i},$ and in general we require at least $\Omega(1/D(F_{i},F_{j}))$ samples to estimate $w_{i}$ , where $D(F_{i},F_{j})$ denotes the statistical distance between the two distributions.

We call a GMM $F=\sum_{i}w_{i}F_{i}$ $\epsilon$ -statistically learnable if $\min_{i}w_{i}\geq\epsilon$ and $\min_{i\neq j}D(F_{i},F_{j})\geq\epsilon$ .

We now consider what it means to “accurately recover the mixture components”.

Given two $n$ -dimensional GMMs of $k$ Gaussians, $F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\Sigma_{i})$ and $\hat{F}=\sum_{i}\hat{w}_{i}\mathcal{N}(\hat{\mu}_{i},\hat{\Sigma}_{i})$ , we call $\hat{F}$ an $\epsilon$ -close estimate for $F$ if there is permutation function $\pi:[k]\rightarrow[k]$ such that for all $i\in[k]$

$D(\mathcal{N}(\mu_{i},\Sigma_{i}),\mathcal{N}(\hat{\mu}_{\pi(i)},\hat{\Sigma}_{\pi(i)}))\leq\epsilon$

Note that the above definition of an $\epsilon$ -close estimate is affine invariant. This is more natural than defining a good estimate in terms of additive errors, since in general, even estimating the mean of an arbitrary Gaussian to some fixed additive precision is impossible without restrictions on the covariance, as scaling the data will scale the error linearly. We can now state our main theorem:

Given any $n$ dimensional mixture of $k$ Gaussians $F$ that is $\epsilon$ -statistically learnable, we can output an $\epsilon$ -close estimate $\hat{F}$ and the running time and data requirements of our algorithm (for any fixed $k$ ) are polynomial in $n$ , and $\frac{1}{\epsilon}$ .

The guarantee in the main theorem implies that the estimated parameters are off by an additive $O(\epsilon\sigma^{2}_{max})$ , where $\sigma^{2}_{max}$ is the largest (projected) variance of any Gaussian in any direction.

Throughout this paper, we favor clarity of proof and exposition above optimization of runtime. Since our main goal is show that these problems can be solved in polynomial time, we make very little effort to optimize the exponent. Our algorithms are polynomial in the dimension, inverse of the success probability, and inverse of the target accuracy for any fixed number of Gaussians, $k$ . The dependency on $k$ , however, is severe: the degree of our polynomials are linear in $k$ . In Section 6, we give a natural construction of two GMMs $F,F^{\prime}$ of $k$ components that are each $1/k$ -statistically learnable, satisfy $D(F,F^{\prime})\leq e^{-k}$ , but $F$ is not even a $1/4$ -close estimate of $F$ . Thus we require an exponential in $k$ number of samples to even distinguish these two mixtures, demonstrating that the exponential dependency on $k$ in our learning algorithms is inevitable.

There exists two GMMs $F,F^{\prime}$ of $k$ components each that satisfies the following properties:

$F,F^{\prime}$ are $1/k$ -statistically learnable.

$F$ is not a $1/4$ -close estimate of $F^{\prime}$ .

3 Applications

We can leverage our main theorem to show that we can efficiently perform density estimation for arbitrary GMMs. For density estimation—as opposed to parameter recovery—we only care to recover a distribution that is similar to the GMM, without worrying about matching each component; in particular, if the true weight of one of the components is negligible, we can simply disregard that component with negligible effect on the statistical distance; if two components are nearly identical in statistical distance, we can simply regard them as being merged into one component. For these reasons, we can perform density estimation efficiently without the restriction to $\epsilon$ -statistically learnable distributions, that was required for Theorem 1.

For any $n\geq 1,$ $\epsilon,\delta>0,$ and any $n$ -dimensional GMM $F=\sum_{i=1}^{k}w_{i}F_{i},$ given access to independent samples from $F$ , there is an algorithm that outputs $\hat{F}=\sum_{i=1}^{k}\hat{w_{i}}\hat{F_{i}}$ such that with probability at least $1-\delta$ over the randomization in the algorithm and in selecting the samples, $D(F,\hat{F})\leq\epsilon.$ Additionally, the runtime and number of samples is bounded by $poly(n,1/\epsilon,1/\delta).$

The proof of this corollary follows immediately from combining our main theorem, with the arguments in Appendix D. In fact, an almost identical approach to how we construct the General Univariate Algorithm from the Basic Univariate Algorithm (again in Appendix D) will work because we can run our main algorithm with many different parameter ranges so that most estimates are correct, and determine a consensus among the estimate so that we can recover a good statistical approximation to $F$ without any assumptions on the mixture - not even $\epsilon$ -statistical learnability.

For any $n\geq 1,$ $\epsilon,\delta>0,$ and any $n$ -dimensional $\epsilon$ -statistically learnable GMM $F=\sum_{i=1}^{k}w_{i}F_{i},$ given access to independent samples from $F$ , there is an algorithm that outputs a classifier $C^{\prime}_{F}$ such that with probability at least $1-\delta$ over the randomization in the algorithm and in selecting the samples, the error of $C_{F}$ is at most $\epsilon$ larger than the error of any classifier $C^{\prime}$ . Additionally, the runtime and number of samples used is bounded by $poly(n,1/\epsilon,1/\delta).$

The proof of this corollary follows immediately from our main theorem (yet here we need the assumption of $\epsilon$ -statistical learnability in this case).

4 Comparing Learning Two Gaussians to Learning Many

This work leverages several key ideas initially presented in which were used to show that learning mixtures of two arbitrary Gaussians can be done efficiently. Nevertheless, additional high-level insights, and technical details were required to extend the previous work to give an efficient learning algorithm for an arbitrary mixture of many Gaussians. In this section we briefly summarize the algorithm for learning mixtures of two Gaussians given in , and then describe the hurdles to extending it to the general case. This discussion will provide insights and motivate the high-level structure of the algorithm presented in this paper, as well as clarify which components of the proof are new, and which are straight-forward adaptations of ideas from .

Throughout this discussion, it will be helpful to refer to parameters $\epsilon_{1},\epsilon_{2},\epsilon_{3},$ which are polynomially related to each other, and satisfy $\epsilon_{1}<<\epsilon_{2}<<\epsilon_{3}$ .

There are three key components to the proof that mixtures of two Gaussians can be learned efficiently: the 1-d Learnability Lemma, the Random Projection Lemma, and the Parameter Recovery Lemma. The 1-d Learnability Lemma states that given a mixture of two univariate Gaussians whose two components have nonnegligible statistical distance, one can efficiently recover accurate estimates of the parameters of the mixture. It is worth noting that in the univariate case, saying that the statistical distance between two Gaussians is non-negligible is roughly equivalent (polynomially related) to saying that the two sets of parameters are non-negligibly different, ie. the parameter distance, $|\mu-\mu^{\prime}|+|\sigma^{2}-\sigma^{\prime 2}|,$ is non-negligible. The Random Projection Lemma states that, given an $n$ -dimensional mixture of two Gaussians which is in isotropic position and whose components have nonnegligible statistical distance, with high probability over the choice of a random unit vector $r,$ the projection of the mixture onto $r$ will yield a univariate mixture of two Gaussians that have nonnegligible statistical distance (say $\epsilon_{3}$ ). The final component—the Parameter Recovery Lemma—states that, given a Gaussian $G$ in $n$ dimensions, if one has extremely accurate estimates (say to within some $\epsilon_{1}$ ) of the mean and variance of $G$ projected onto $n^{2}$ sufficiently distinct directions (directions that differ by at least $\epsilon_{2}>>\epsilon_{1}$ ) one can accurately recover the parameters of $G$ .

Given these three pieces, the high-level algorithm for learning mixtures of two Gaussians is straight-forward:

Pick $n^{2}$ vectors $r_{1},\ldots,r_{n^{2}},$ that are “close” to $r$ , say $|r_{i}-r|\approx\epsilon_{2}.$

For each $i=1,\ldots,n^{2},$ learn extremely accurate (to accuracy $\epsilon_{1}<<\epsilon_{2}$ ) univariate parameters $w_{i},\mu_{i},\sigma_{i},\mu^{\prime}_{i},\sigma^{\prime}_{i}$ for the projection of the mixture onto the vector $r_{i}$ .

Since $|r_{i}-r_{j}|\approx\epsilon_{2},$ it is not hard to show that with high probability, $|\mu_{i}-\mu_{j}|<<\epsilon_{3},|\sigma_{i}-\sigma_{j}|<<\epsilon_{3}$ and by the Random Projection Lemma, $||(\mu_{i},\sigma_{i})-(\mu^{\prime}_{i},\sigma^{\prime}_{i})||>>\epsilon_{3}$ thus it will be easy to accurately match up which parameters come from which component in the different projections, and we can apply the Parameter Recovery Lemma to each of the two components.

Some of the above ideas are immediately applicable to the problem of learning mixtures of many Gaussians: we can clearly use the Parameter Recovery Lemma without modification. Additionally, we prove a generalization of the 1-d Learnability Lemma for mixtures of arbitrary numbers of Gaussians, provided each component has non-negligible statistical distance (which, while technically tedious, employs the key idea from of “deconvolving” by a suitably chosen Gaussian—see Appendix B). Given this extension, if we were given a mixture of $k$ Gaussians in isotropic position, and were guaranteed that the projection onto some vector $r$ resulted in a univariate mixture of Gaussians for which all pairs of components either had reasonably different means or reasonably different variances, then we could piece together the parts more-or-less as in the 2-Gaussians case.

Unfortunately, however, the Random Projection Lemma, ceases to hold in the general setting. There exist mixtures of just three Gaussians with significant pairwise statistical distances, that are in isotropic position, but have the property that with extremely high probability over choices of random unit vector $r$ , the projection of the mixture onto $r$ yields a distribution that is extremely close to a univariate mixture of two Gaussians. This observation would foil the approach employed in the case of just two Gaussians! Another difficulty is that if we take $n^{2}$ slightly different projections of our mixture of $k$ Gaussians, then it is possible that in some of the projections we see what looks like a mixture of $k^{\prime}<k$ univariate Gaussians, and in some other projections we see what looks like a mixture of $k^{\prime\prime}$ univariate Gaussians. How do we match up estimates from projections onto different directions when the number of Gaussians in the estimate can differ? Or what if each projection results in an estimate that is a mixture of $k^{\prime}<k$ Gaussians. Then how can we recover an $n$ -dimensional estimate that is a mixture of $k$ Gaussians?

Outline and Definitions

We now discuss the high-level structure of our learning algorithm, building from the intuition given in the preceding section. At the highest level, our learning algorithm has the following form: Given access to samples from a mixture of $k$ Gaussians,

Learn the parameters of some mixture of $k^{\prime}\leq k$ Gaussians, where each learned Gaussian component roughly corresponds to one or more of the Gaussians in the original mixture.

If $k^{\prime}<k$ , for each of the $k^{\prime}$ components recovered in the previous step, examine it closely and figure out whether it corresponds to a single Gaussian component of the original mixture, or whether it is a mixture of several of the original components (in which case we will then need to learn the parameters of these sub-components).

To accomplish the first step, we will require accurate parameters of the projection of each of the $k^{\prime}$ “clusters” of components, onto $n^{2}$ univariate projections. To do this, we employ a robust univariate algorithm which, given access to samples from a univariate GMM, essentially searches for some target resolution window $(w_{1},w_{2})$ with $w_{1}<<w_{2},$ such that the GMM is very close ( $w_{1}$ -close) to a GMM of $k^{\prime}\leq k$ statistically very distinct components (each pair of components is at least $w_{2}$ far apart).

Given our robust univariate algorithm, we embark on a partition pursuit where we try to find $n^{2}$ vectors that yield consistent and compatible univariate parameter sets–in particular, we require that each of the $n^{2}$ univariate projections yields parameters that satisfy three conditions: 1) they have the same number of components, 2) the recovered parameters are much more precise than the distances between the $n^{2}$ projections, and 3) that the distance between the components is large enough so as to ensure an accurate matching of the components in the different projections.

Finally, given the ability to accurately recover $k^{\prime}\leq k$ high-dimensional Gaussians, where each learned Gaussian component roughly corresponds to one or more of the Gaussians in the original mixture, we want to be able to examine each recovered component, and determine whether it corresponds to a single component of the original mixture, or a set of original components. We first claim that, with high probability, the only way a subset of original components will end up being grouped into a single recovered component is if the covariance of the mixture of that subset of components has a very small minimum eigenvalue. The existence of such an eigenvalue implies that we can accurately cluster the given sample points (whose covariance, recall, is roughly 1). Thus, given a recovered set of $k^{\prime}<k$ parameters, we examine one of these $k^{\prime}$ components; if the minimum eigenvalue is sufficiently small, we project the set of data samples onto the corresponding eigenvector, and then partition the sample points into two clusters (provided the eigenvalue is sufficiently small, since the overall mixture is in roughly isotropic position, we cluster so as to almost exactly respect some partition of the original components). Given the set of sample points corresponding (roughly) to the recovered component that had small eigenvalue, we simply re-scale the data so that this subsample is now in isotropic position, and recursively run the entire algorithm on this rescaled subsample of the data, which, as we argue, consists of a mixture of $k^{\prime\prime}<k$ components of the original mixture, with high probability. We call this clustering step hierarchical clustering.

We give a detailed summary in Appendix A of the main elements of each of these three main components: the robust univariate algorithm, partition pursuit, and hierarchical clustering.

Given two probability distributions $f(x),g(x)$ on $\Re^{n}$ we can define the statistical distance between these distributions as

We will also be interested in a related notion of the parameter distance between two univariate Gaussians:

Given two univariate Gaussians, $F_{1}=\mathcal{N}(\mu_{1},\sigma_{1}^{2}),F_{2}=\mathcal{N}(\mu_{2},\sigma_{2}^{2})$ we define the parameter distance as

In general, the parameter distance and the statistical distance between two univariate Gaussians can be unrelated. There are pairs of univariate Gaussians with arbitrarily small parameter distance, and yet statistical distance close to $1$ , and there are pairs of univariate Gaussians with arbitrarily small statistical distance, and yet arbitrarily large parameter distances. But these scenarios can only occur if the variances can be arbitrarily small or arbitrarily large. In many instances in this paper, we will have reasonable upper and lower bounds on the variances and this will allow us to move back and forth from statistical distance and parameter distance, but we will highlight when we are doing so and note why we are able to assume an upper and lower bound on variance in that particular situation.

As we noted, there are $\epsilon$ -statistically learnable mixtures of three Gaussians that are in isotropic position, but for which with overwhelming probability over a random direction $r$ , in the projection onto $r$ , there will be some pair of univariate Gaussians that are arbitrarily close in parameter distance. In these cases, our univariate algorithm may not return an estimate with three components, but will return a mixture which has only two components but is still a good estimate for the parameters of the projected mixture. To formalize this notion, we introduce what we call an $\epsilon$ -correct sub-division.

Given a GMM of $k$ Gaussians, $F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2})$ and a GMM of $k^{\prime}\leq k$ Gaussians $\hat{F}=\sum_{i}\hat{w}_{i}\mathcal{N}(\hat{\mu}_{i},\hat{\sigma}_{i}^{2})$ , we call $\hat{F}$ an $\epsilon$ -correct subdivision of $F$ if there is a function $\pi:[k]\rightarrow[k^{\prime}]$ that is onto and

$\forall_{j\in[k^{\prime}]}|\sum_{i|\pi(i)=j}w_{i}-\hat{w}_{j}|\leq\epsilon$

$\forall_{i\in[k]}D_{p}(F_{i},\hat{F}_{\pi(i)})\leq\epsilon$

When considering high-dimensional mixtures, we replace the above parameter distance by $\|\mu_{i}-\hat{\mu}_{\pi(i)}\|+\|\Sigma_{i}-\hat{\Sigma}_{\pi(i)}\|_{F}\leq\epsilon,$ where $\|_{F}$ denotes the Frobenius norm.

Notationally, we will write $(\hat{F},\pi)\in\mathcal{D}_{\epsilon}(F)$ as shorthand for the statement that $\hat{F}$ is an $\epsilon$ -correct subdivision for $F$ and $\pi$ is the (onto) function from $k$ to $k^{\prime}$ that groups $F$ into $\hat{F}$ as above.

Note that this definition, unlike the definition for $\epsilon$ -close estimate, uses parameter distance as opposed to statistical distance. This is critical because our univariate algorithm will only be able to return an estimate that is an $\epsilon$ -correct subdivision when the notion of “close” is in parameter distance, and not statistical distance because in general there could be a component of the univariate mixture of arbitrarily small variance, and we will only be able to match this to an additive guarantee and this implies nothing about the statistical distance between our estimate and the actual component.

A Robust Univariate Algorithm

In this section, we give a learning algorithm for univariate mixtures of Gaussians that will be the building block for our learning algorithm in $n$ -dimensions. Unlike in the case of , our univariate algorithm will not necessarily be given a mixture of Gaussians for which all pairwise parameter distances are reasonably large. Instead, it could happen that we are given a mixture of (say) three Gaussians so that some pair has arbitrarily small parameter distance.

In the case in which we are guaranteed that all pairwise parameter distances are reasonably large, we can iterate the technical ideas in to give an inductive proof that a simple brute force search algorithm will return good estimates. We call this algorithm the Basic Univariate Algorithm. From this, we build a General Univariate Algorithm that will return a good estimate regardless of the parameter distances, although in order to do so we will need to relax the notion of a good estimate to something weaker: the algorithm return an $\epsilon$ -correct subdivision.

In this section, we show that we can efficiently learn the parameters of univariate mixtures of Gaussians, provided that the components of the mixture have nonnegligible pairwise parameter distances. We refer to this algorithm as the Basic Univariate Algorithm. Such an algorithm will follow easily from Theorem 4—the polynomially robust identifiability of univariate mixtures. Throughout this section we will consider two univariate mixtures of Gaussians:

We will call the pair $F,F^{\prime}$ $\epsilon$ -standard if $\sigma_{i}^{2},\sigma_{i}^{\prime 2}\leq 1$ and if $\epsilon$ satisfies:

$|\mu_{i}|,|\mu^{\prime}_{i}|\leq\frac{1}{\epsilon}$

$|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon$ and $|\mu^{\prime}_{i}-\mu^{\prime}_{j}|+|\sigma_{i}^{\prime 2}-\sigma_{j}^{\prime 2}|\geq\epsilon$ for all $i\neq j$

$\epsilon\leq\min_{\pi}\sum_{i}\left(|w_{i}-w_{\pi(i)}^{\prime}|+|\mu_{i}-\mu_{\pi(i)}^{\prime}|+|\sigma_{i}^{2}-\sigma_{\pi(i)}^{\prime 2}|\right)$ , where the minimization is taken over all mappings $\pi:\{1,\ldots,n\}\rightarrow\{1,\ldots,k\}.$

There is a constant $c>0$ such that, for any $\epsilon$ -standard $F,F^{\prime}$ and any $\epsilon<c$ ,

While the dependency on $k$ in Theorem 4 is very bad, as we show in Section 6, this exponential dependency on $k$ is necessary. Specifically, we give a construction of two $1/k$ -standard distributions whose statistical distance is $O(e^{-k}).$

Given the polynomially robust identifiability guaranteed by the above theorem, and simple concentration bounds on the $i^{th}$ sample moment, it is easy to see that a brute-force search over a set of candidate parameter sets will yield an efficient algorithm that recovers the parameters for a univariate mixtures of Gaussians whose components have pairwise parameter distance at least $\epsilon$ : roughly, the Basic Univariate Algorithm will take a polynomial number of samples, compute the first $4k-2$ sample moments, and compare those with the first $4k-2$ moments of each of the candidate parameter sets. The algorithm then returns the parameter set whose moments most closely match the sample moments. Theorem 4 guarantees that if the first $4k-2$ sample moments closely match those of the chosen parameter set, then the parameter set must be nearly accurate. To conclude the proof, we argue that a polynomial-sized set of candidate parameters suffices to guarantee that at least one set of parameters will yield moments sufficiently close to the sample moments, which, by simple concentration bounds, will be close to the true moments of the GMM. We state the corollary below, and defer the details of the algorithm and the proof of its correctness to Appendix C.

Suppose we are given access to independent samples from a GMM $\sum_{i=1}^{k}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x)$ with mean 0 and variance in the interval $[1/2,2],$ where $w_{i}\geq\epsilon$ , and $|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon$ . There exists a Basic Univariate Algorithm that, for any fixed $k$ , has runtime at most $poly(\frac{1}{\epsilon},\frac{1}{\delta})$ samples and with probability at least $1-\delta$ will output mixture parameters $\hat{w}_{i},\hat{\mu}_{i},\hat{\sigma_{i}}^{2}$ , so that there is a permutation $\pi:[k]\rightarrow[k]$ and

2 The General Univariate Algorithm

In this section we seek to extend the Basic Univariate Algorithm of Corollary 5 to the general setting of a univariate mixture of $k$ Gaussians without any requirements that the components have significant pair-wise parameter distances. In particular, given some target accuracy $\epsilon,$ and access to independent samples from a mixture $F$ of $k$ univariate Gaussians, we want to efficiently compute a mixture $F^{\prime}$ of $k^{\prime}\leq k$ Gaussians that is an $\epsilon$ -correct subdivision of $F.$

There is a General Univariate Algorithm which, given $\epsilon,\delta>0$ , and access to a GMM of $k$ Gaussians, $F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2})$ that is in near isotropic position and satisfies $w_{i}\geq\epsilon$ , will run in time polynomial in $1/\epsilon$ and $1/\delta,$ and will return with probability at least $1-\delta$ a GMM of $k^{\prime}\leq k$ Gaussians $\hat{F}$ that is an $\epsilon$ -correct subdivision of $F$ .

The critical insight in building up such a General Univariate Algorithm is that if two components are actually close enough (in statistical distance), then the Basic Univariate Algorithm could never tell these two components apart from a single (appropriately) chosen Gaussian, because this algorithm only requires a polynomial number of samples. So given a target precision $\epsilon_{1}$ for the Basic Univariate Algorithm, there is some window that describes whether or not the algorithm will work correctly. If all pairwise parameter distances are either sufficiently large or sufficiently small, then the Basic Univariate Algorithm will function as if it were given sample access to a mixture that actually does meet the requirements of the algorithm. So as long as no parameter distance falls inside a particular window (which characterizes whether or not the algorithm will behave properly), the algorithm will return a correct computation.

However, when there is some parameter distance that falls inside the Basic Univariate Algorithm’s window, we are not guaranteed that the Basic Univariate Algorithm will fail safely. The idea, then, is to use many disjoint windows (each of which corresponds to running the Basic Univariate Algorithm with some target precision). If we choose enough such windows, each pairwise parameter distance can only corrupt a single run of the Basic Univariate Algorithm so a majority of the computations will be correct. We will never know which computations resulted from cases when no parameter distance fell inside the corresponding window, but we will be able to define a notion of consensus among these different runs of the Basic Univariate Algorithm so that a majority of the runs will agree, and any run which agrees with some computation that was correct will also be close to correct.

We defer the algorithm and proof of correctness to Appendix D

Partition Pursuit

In this section we demonstrate how to use the General Univariate Algorithm to obtain good additive approximations in $n$ -dimensions. Roughly, we will project the $n$ -dimensional mixture $F$ onto many close-by directions, and run the General Univariate Algorithm on each projection. This is also how the algorithm in is able to recover good additive estimates in $n$ -dimensions. However we will have to cope with the additional complication that our univariate algorithm (the General Univariate Algorithm) does not necessarily return an estimate that is a mixture of $k$ Gaussians.

We explain in detail how the algorithm in is able to obtain additive approximation guarantees in $n$ -dimensions, building on a univariate algorithm for learning mixtures of two Gaussians. Let $\epsilon_{3}>>\epsilon_{2}>>\epsilon_{1}$ . Given any $\epsilon$ -statistically learnable mixture of two Gaussians in $n$ -dimensions, with high probability, for a direction $r$ chosen uniformly at random the parameter distance between the two Gaussians in $P_{r}[F]$ will be at least $\epsilon_{3}$ . Then given such a direction $r$ , we can choose $n^{2}$ different directions $r_{x,y}$ each of which are $\epsilon_{2}$ -close to $r$ (i.e. $\|r-r_{x,y}\|\approx\epsilon_{2}$ ). The mean and variance of a component in $P_{u}[F]$ change continuously as we vary the direction $u$ from $r$ to $r_{x,y}$ , and this implies that for $\epsilon_{2}<<\epsilon_{3}$ , we will be able to consistently pair up estimates recovered from each projection, so that for each Gaussian we have $n^{2}$ different estimates in different directions of the projected mean and variance. Each of these estimates are accurate to within $\epsilon_{1}$ (i.e. this is the target precision that is given to the univariate algorithm). For any Gaussian, an estimate for the projected mean and the projected variance for a direction $r$ gives a linear constraint on the mean vector $\mu$ and the co-variance matrix $\Sigma$ . As a result, if $\epsilon_{1}<<\epsilon_{2}$ then the precision is much finer than the condition number of this system of linear constraints on $\mu,\Sigma$ and this yields an accurate estimate in $n$ -dimensions.

Let $\epsilon_{2},\epsilon_{1}>0$ . Suppose $|m^{0}-\mu\cdot r|$ , $|{m}^{ij}-\mu\cdot r^{ij}|$ , $|v^{0}-r^{T}\Sigma r|$ , $|v^{ij}-(r^{ij})^{T}\Sigma r^{ij}|$ are all at most $\epsilon_{1}$ . Then Solve outputs $\hat{\mu}\in{\bf R}^{n}$ and $\hat{\Sigma}\in{\bf R}^{n\times n}$ such that $\|\hat{\mu}-\mu\|<\frac{\epsilon_{1}\sqrt{n}}{\epsilon_{2}}$ , and $\|\hat{\Sigma}-\Sigma\|_{F}\leq\frac{6n\epsilon_{1}}{\epsilon_{2}^{2}}$ . Furthermore, $\hat{\Sigma}\succeq 0$ and $\hat{\Sigma}$ is symmetric.

The algorithm to which this lemma refers is given in Appendix F.2

However, the General Univariate Algorithm does not always return a mixture of $k$ Gaussians, and can in fact return a mixture $\hat{F}^{u}$ of $k^{\prime}<k$ Gaussians provided that this mixture is still an $\epsilon_{1}$ -correct subdivision of $P_{u}[F]$ (for some direction $u$ ). But then what happens if we consider two close-by directions, $u$ and $v$ and the number of Gaussians in the estimate $\hat{F}^{u}$ is different from the number of Gaussians in the estimate $\hat{F}^{v}$ ?

The key insight is that if we choose some direction $r$ , and close-by directions $r_{x,y}$ , if any estimate returned for $r_{x,y}$ has more components than the estimate returned for the direction $r$ , then we have made progress because we have identified another Gaussian in the original mixture $F$ . So here, rather than trying to use this estimate for $r_{x,y}$ , we just start the algorithm over using $r_{x,y}$ as the original direction, and considering $n^{2}$ close-by directions.

The additional complication is that we must make sure every time we see a different number of components, that we’ve made progress. We can do so by maintaining a Window from $\epsilon_{1}$ to $\epsilon_{3}$ , and we say that a Window is satisfied if the estimate $\hat{F}^{r}$ returned for some direction $r$ has all pairs of Gaussians either at parameter distance at least $\epsilon_{3}$ , or at most the precision $\epsilon_{1}$ of the General Univariate Algorithm. Then if we consider close-by directions $r_{x,y}$ (that are $\epsilon_{2}$ -close to $r$ , for $\epsilon_{1}<<\epsilon_{2}<<\epsilon_{3}$ ), we can ensure that whenever we see a different number of components in the estimate corresponding to some direction $r_{x,y}$ , there are more components. When we see more components, we may need to shift the Window $W$ to a Window $W^{\prime}$ so that in this new direction $r_{x,y}$ , the Window $W^{\prime}$ is satisfied. We take $r_{x,y}$ as the new base direction. But we have made progress because we have identified a new component in the mixture.

We state our main theorem in this section, and defer the algorithm and proof to Appendix F

Given an $\epsilon$ -statistically learnable GMM $F$ in isotropic position, the Partition Pursuit Algorithm will recover an $\epsilon$ -correct sub-division $\hat{F}$ and if $F$ has more than one component, $\hat{F}$ also has more than one component.

Clustering and Recursion

In this section, we give an efficient algorithm for learning an estimate $\hat{F}$ that is $\epsilon$ -close to the actual mixture $F$ . Partition Pursuit assumes that the mixture $F$ is in isotropic position, and even though $F$ is not necessarily in isotropic position, we will be able to get around this hurdle by first taking enough samples to compute a transformation that places the mixture $F$ in nearly isotropic position and then applying this transformation to each sample from the oracle. The main technical challenge in this section is actually what to do when the mixture $\hat{F}$ returned by Partition Pursuit is a good additive approximation to $F$ (i.e. it is an $\epsilon_{1}$ -correct subdivision with $\epsilon_{1}<<\epsilon$ ), but is not $\epsilon$ -close to the mixture $F$ . This can only happen if there is a component in $F$ that has a very small variance in some direction. Consider for example, two Gaussians in one dimension $\mathcal{N}(0,\gamma)$ and $\mathcal{N}(0,\gamma+\epsilon_{1})$ . Even if $\epsilon_{1}$ is very small, if $\gamma$ is much smaller, then the statistical distance between these two Gaussians can be arbitrarily close to $1$ .

So the high-level idea is that if the estimate $\hat{F}$ returned by Partition Pursuit is not $\epsilon$ -close to $F$ (but $\hat{F}$ is an $\epsilon_{1}$ -correct subdivision of $F$ for $\epsilon_{1}<<\epsilon$ ), then it must be the case that some component $F_{i}$ of $F$ has a co-variance matrix $\Sigma_{i}$ so that for some direction $v$ , $v^{T}\Sigma_{i}v$ is very small. Then we can use this direction $v$ to still make progress: If we project the mixture $F$ onto $v$ , we will be able to cluster accurately. There will be some partition of the Gaussians in $F$ into two disjoint, non-empty sets of components $S,T$ and some clustering scheme that can accurately clusters points sampled from $F$ into points that originated from a component in $S$ and points that originated from a component in $T$ . So we can hope to accurately cluster enough points sampled from $F$ into sets of points that originated from $S$ and sets of points that originated from $T$ , and then we can run our learning algorithm (with a smaller maximum of at most $k-1$ components) on each set of points. By induction, this learning algorithm will return close estimates, and if we take a convex combination of these estimates we obtain a new estimate $\hat{F}^{\prime}$ that is $\epsilon$ -close to $F$ . The main technical challenge is in showing that if there is some component of $F$ with a small enough variance in some direction $v$ , then we can accurately cluster points sampled from $F$ . Given this, our main result follows almost immediately from an inductive argument.

2 How to Cluster

Here we give formalize the notion of a clustering scheme. Additionally, we state the key lemmas that will be useful in showing that if $\hat{F}$ is not an $\epsilon$ -close estimate to $F$ , then we can use $\hat{F}$ to construct a good clustering scheme that makes progress on our learning problem.

We will call $A,B\subset\Re^{n}$ a clustering scheme if $A\cap B=\emptyset$

For $A\subset\Re^{n}$ , we will write $P[F_{i},A]$ to denote $Pr_{x\sim F_{i}}[x\in A]$ - i.e. the probability that a randomly chosen sample from $F_{i}$ is in the set $A$ .

If we have a direction $v$ and some component $\hat{F}_{i}$ which has small variance in direction $v$ , we want to use this direction to cluster accurately. The intuition is clearest in the case of mixtures of two Gaussians: Suppose one of the components, say $\hat{F}_{1}$ , had small variance on direction $v$ . If the entire mixture is in isotropic position, then the variance of the mixture when projected onto direction $v$ is $1$ . This can only happen if either the difference in projected means $|v^{T}(\hat{\mu}_{1}-\hat{\mu}_{2})|$ is large or the variance of $\hat{F}_{2}$ on direction $v$ is large. In the first case, we can choose an interval around each projected (estimate) mean $v^{T}\hat{\mu}_{1}$ and $v^{T}\hat{\mu}_{2}$ so that with high probability, any point sampled from $F_{1}$ is contained in the interval around $v^{T}\hat{\mu}_{1}$ and similarly for $F_{2}$ . If, instead, the variance of $F_{2}$ when projected onto $v$ is large, then again a small interval around the point $v^{T}\hat{\mu}_{1}$ will contain most samples from $F_{1}$ , but because the maximum density of $v^{T}F_{2}$ is never large and the interval around $v^{T}\hat{\mu}_{1}$ is not too large either, most samples from $F_{2}$ will not be contained in the interval. This idea is the basis of our clustering lemmas, although there will be additional complications when the mixture contains more than two Gaussians, the intuition is close to the same.

Let $(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F)$ . Suppose also that $\hat{F}$ is a mixture of $k^{\prime}$ components.

Suppose that for some direction $v$ , for all $i$ : $v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{2}$ , for $\epsilon_{1}\leq\frac{\sqrt{\epsilon_{2}}}{2\epsilon_{3}}$ . If there is some bi-partition $S\subset[k^{\prime}]$ s.t. $\forall_{i\in S,j\in[k^{\prime}]-S}|v^{T}\hat{\mu}_{i}-v^{T}\hat{\mu}_{j}|\geq\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}}$ then there is a clustering scheme $(A,B)$ (based only on $\hat{F}$ ) so that for all $i\in S,j\in\pi^{-1}(i)$ , $P[F_{i},A]\geq 1-\epsilon_{3}$ and for all $i\notin S,j\in\pi^{-1}(i)$ , $Pr[F_{i},B]\geq 1-\epsilon_{3}$ .

This lemma corresponds to the first case in the above thought exercise when there is some bi-partition of the components so that all pairs of projected means across the bi-partition are reasonably separated.

Suppose that for some direction $v$ and some $i\in[k^{\prime}]$ such that: $v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{m}$ , for $\epsilon_{m}>>\epsilon_{1}$ . If there is some bi-partition $S\subset[k^{\prime}]$ s.t.

(and $\epsilon_{t}<<\epsilon_{3}^{3}$ ) then there is a clustering scheme $A,B$ such that for all $i\in S,j\in\pi^{-1}(i)$ , $P[F_{i},A]\geq 1-\epsilon_{3}$ and for all $i\notin S,j\in\pi^{-1}(i)$ , $Pr[F_{i},B]\geq 1-\epsilon_{3}$ .

This lemma corresponds to the second case to the second case, when there is some bi-partition of the components so that one side of the bi-partition has projected variances that are much larger than the other.

The proofs of these lemmas, along with additional technical details are given in Appendix G.2

3 Making Progress when there is a Small Variance

We state a lemma from which formalizes the intuition that if there is no component in $\hat{F}$ with small variance in any direction, the $\hat{F}$ is a good statistical estimate to $F$ :

Suppose $\|\hat{\mu}_{i}-\mu_{i}\|\leq\epsilon_{1}$ , $\|\hat{\Sigma}_{i}-\Sigma_{i}\|_{F}\leq\epsilon_{1}$ , and $|\hat{w}_{i}-w_{i}|\leq\epsilon_{1}$ , if either $\|\Sigma^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}}$ or $\|\hat{\Sigma}^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}}$ then

We will use this lemma as a building block to prove:

The Hierarchical Clustering Algorithm either returns an $\epsilon$ -close statistical estimate $\hat{F}$ for $F$ , or returns a clustering scheme $A,B$ such that there is some bipartition $S\subset[k]$ such that for all $i\in S,j\in\pi^{-1}(i)$ , $P[F_{i},A]\geq 1-\epsilon_{3}$ and for all $i\notin S,j\in\pi^{-1}(i)$ , $Pr[F_{i},B]\geq 1-\epsilon_{3}$ . And also $S,[k]-S$ are both non-emtpy.

We defer the algorithm and the proof of correctness to Appendix G.

4 Recursion

[Isotropic Projection Lemma] Given a mixture of $k$ $n$ -Dimensional Gaussians $F=\sum_{i}w_{i}F_{i}$ that is in isotropic position and is $\epsilon$ -statistically learnable, with probability $\geq 1-\delta$ over a randomly chosen direction $u$ , there is some pair of Gaussians $F_{i},F_{j}$ s.t. $D_{p}(P_{u}[F_{i}],P_{u}[F_{j}])\geq\frac{\epsilon^{5}\delta^{2}}{50n^{2}}$ .

We defer a proof of this lemma to Appendix H

Let $H_{a}(\epsilon,\delta,k),H_{i}(\epsilon,\delta,k)$ be the inverse of the number of samples needed by the High Dimensional Anisotropic Algorithm and the High Dimensional Isotropic Algorithm respectively when given target precision $\epsilon$ (and access to an $\epsilon$ -statisically learnable distribution), an upper bound $k$ on the number of Gaussians, and an error parameter $\delta$ .

Given $k,\epsilon$ , and a mixture of at most $k$ Gaussians $F$ that is $\epsilon$ -statistically learnable High Dimensional Anisotropic Algorithm returns an estimate $\hat{F}$ that is $\epsilon$ -close to the actual mixture $F$ .

We prove this theorem by induction. Let $\epsilon_{k-1}=H_{a}(\frac{\epsilon}{2},\delta,k-1)$ .

We assume by induction that both the High Dimensional Isotropic Algorithm and the High Dimensional Anisotropic Algorithm return an $\epsilon$ -close estimate for all values of $k^{\prime}\leq k-1$ . We then consider both algorithms for the case of $k$ :

Consider the High Dimensional Isotropic Algorithm which is given $k,\epsilon$ , and a mixture of at most $k$ Gaussians $F$ that is $\epsilon$ -statistically learnable and is in isotropic position: We first run the Hierarchical Clustering Algorithm with parameters $\epsilon,\delta,\epsilon_{3},k$ where $\epsilon_{3}=\frac{1}{2}\epsilon\epsilon_{k-1}\delta$ . If this algorithm returns an estimate $\hat{F}$ , we can return this estimate and it is guaranteed to be $\epsilon$ -close to the actual mixture.

Note that if the number of components in $F$ is $1$ , then the Hierarchical Clustering Algorithm will necessarily return an estimate $\hat{F}$ , because there is no partitioning scheme that partitions $F$ into two subsets of components that are both non-empty. This establishes the base case in the inductive argument.

Otherwise the output of the Hierarchical Clustering Algorithm is a clustering scheme $(A,B)$ with the property that there is some partition $S,T$ of the Gaussians in $F$ ( $S,T\neq\emptyset$ ) and for all $i\in S$ , $Pr_{x\sim F_{i}}[x_{i}\in A]\geq 1-\epsilon_{3}$ , and $j\in T$ , $Pr_{x\sim F_{i}}[x_{i}\in B]\geq 1-\epsilon_{3}$ . Let $F_{S},F_{T}$ be the (re-weighted) mixtures that result from placing every component in $S$ from $F$ into $F_{S}$ , and every component in $T$ from $F$ into $F_{T}$ . Note that $F_{S},F_{T}$ are still $\epsilon$ -statistically learnable, but may not be in isotropic position any longer.

All samples $x_{1},x_{2},...,x_{m}$ are either in $A$ or $B$

The number of samples in $A$ and the number of samples in $B$ will each be at least $\frac{1}{\epsilon_{k-1}}$

All samples are clustered correctly - i.e. if $x_{i}\in A$ , then $x_{i}$ was generated by some Gaussian $F_{j}$ with $j\in S$ and if $x_{i}\in B$ , then $x_{i}$ was generated some Gaussian $F_{j}$ with $j\in T$ .

Let $X_{S},X_{T}$ be the samples from $x_{1},x_{2},...,x_{m}$ that are in $A,B$ respectively. We can then run the the High Dimensional Anisotropic Algorithm with parameters $\frac{\epsilon}{2},\delta,k-1$ on each set $X_{S}$ and $X_{T}$ . Let the algorithm return the mixtures $\hat{F}_{A},\hat{F}_{B}$ respectively. We return a convex combination of these mixtures, $\hat{F}=\frac{|X_{S}|}{m}\hat{F}_{A}+\frac{|X_{T}|}{m}\hat{F}_{B}$ . The estimates $\hat{F}_{A},\hat{F}_{B}$ are $\epsilon$ -close estimates to $F_{S},F_{T}$ respectively. We can write $F=w_{A}F_{A}+w_{B}F_{B}$ , and with high probability $\frac{|X_{S}|}{m}$ , $\frac{|X_{T}|}{m}$ will be close to $w_{A},w_{B}$ respectively. Then this implies that $\hat{F}$ is $\epsilon$ -close to $F$ . Thus by induction, the output of the High Dimensional Isotropic Algorithm is an estimate $\hat{F}$ that is $\epsilon$ -close to $F$ .

We need to also verify by induction that the output of the High Dimensional Anisotropic Algorithm is also an $\epsilon$ -close estimate to $F$ . So suppose that the input to the High Dimensional Anisotropic Algorithm is a mixture of at most $k$ Gaussians, that is $\epsilon$ -statistically learnable and is not necessarily in isotropic position.

Exponential Dependence on k𝑘k is Inevitable

In this section, we present a lower bound, showing that the inverse exponential dependency on the number of Gaussian components in each mixture is necessary, even for mixtures in just one dimension. We show this by giving a simple construction of two 1-dimensional distributions, $D_{1},D_{2}$ that are $1/(2m)$ -standard. Specifically, they are mixtures of at most $m$ Gaussians, such that the weights of all components of each mixture are at least $1/(2m)$ , and the parameter distance between the pair of distributions is at least $1/(2m),$ but $||D_{1}-D_{2}||_{1}\leq e^{-m/30},$ for sufficiently large $m$ . The construction hinges on the inverse exponential (in $k\approx\sqrt{m}$ ) statistical distance between $\mathcal{N}(0,2),$ and the mixtures of infinitely many Gaussians of unit variance whose components are centered at multiples of $1/k$ , with the weight assigned to the component centered at $i/k$ being given by $N(0,1,i/k).$ Verifying that this is true is a straight-forward exercise in Fourier analysis. The final construction truncates the mixture of infinitely many Gaussians by removing all the components with centers a distance greater than $k$ from 0. This truncation clearly has negligibly small effect on the distribution. Finally, we alter the pair of distributions by adding to both distributions, Gaussian components of equal weight with centers at $-k^{2}/k,(-k^{2}+1)/k,\ldots,k,$ which ensures that in the final pair of distributions, all components have significant weight.

There exists a pair $D_{1},D_{2}$ of $1/(4k^{2}+2)$ -standard distributions that are each mixtures of $k^{2}+1$ Gaussians such that

We can define $F_{k}(x)^{N}=c_{k}\sum_{i=-N}^{N}\frac{1}{\sqrt{\pi}}e^{-(i/k)^{2}}\mathcal{N}(i/k,1/2,x),$ , and we give a plot of $F_{k}^{N}$ for $k=2,N=2$ in Figure 1a and the corresponding plot of each component, and in Figure 1b we give a plot of each component of $F_{k}^{N}$ for $k=4,N=8$ .

Conclusions

We give an estimator that converges to the true distribution at an inverse polynomial rate, and this result has implications for polynomial-time clustering and density estimation. A natural question is: “What is the optimal rate of convergence?” This question is wide open, and all we can say for certain is that the rate of convergence is at worst polynomial in the dimension and the inverse of the desired accuracy, and exponential in the number of components. We made no attempt here to optimize the constants in the exponent of the rate of convergence and even if we had, the theoretical runtime would still be extremely impractical. This, however, raises the practically relevant question of whether aspects of our algorithm can be combined with existing heuristics that seem to perform well in most applications. For example, the brute-force-search component of our univariate algorithm is clearly expensive; perhaps employing existing heuristics (such as the EM algorithm) for the univariate problems, in conjunction with aspects of our dimension-reduction machinery might yield improved efficiency on real-world instances.

Additionally, we note that much of the machinery we developed—from the “deconvolution” argument for the polynomially robust identifiability, to the partition pursuit and hierarchical clustering for the dimension reduction arguments, seem to be relatively general and robust. We suspect that such tools could be applied to yield corresponding results for other families of distributions.

Acknowledgements

We are grateful to Paul Valiant, for suggesting the lower-bound construction of Section 6 and many helpful discussions throughout; and are indebted to Adam Tauman Kalai for introducing us to this beautiful line of research, and for all his guidance, encouragement and deep insights about mixtures of Gaussians.

References

Appendix A In-Depth Outline

To start, suppose that we are given access to independent samples from a mixture of $k$ Gaussians, and given a unit vector $r$ with the following promise: for each pair $G_{1},G_{2}$ of components, in the projection of the mixture onto $r$ , either the projections of $G_{1}$ and $G_{2}$ have reasonably different parameters ( $>\epsilon_{2}$ ), or their projections are so close that our algorithm could never tell them apart from a single Gaussian (parameter distance at most $\epsilon_{0}<<\epsilon_{1},$ where $\epsilon_{1}<<\epsilon_{2}$ is the desired accuracy of the 1-d parameter learning algorithm. In this case, our 1-d parameter recovery algorithm will perform correctly, and return some $\epsilon_{1}$ -accurate parameters for a mixture of $k^{\prime}\leq k$ components.

Thus in general, for a given desired accuracy $\epsilon_{1}$ , there is some critical window, namely $[\epsilon_{0},\epsilon_{2}],$ associated with the 1-d learning algorithm that determines if it will function correctly. In a given projection, as long as no pair of components have parameter distances that fall within this window, then any pair of Gaussians is either reasonably different in parameters, or so close in parameters that the algorithm will never be able to tell the difference.

In this way, if an algorithm designer is told the parameters of a given mixture of $k$ Gaussians, he could construct an algorithm that would have been able to find some of these parameters. The algorithm would project onto a random direction $r$ , and based on the pairwise parameter distances between the univariate Gaussians, there will be some window (i.e. some choice of a target precision with which to run the algorithm), bounded below by some polynomial in the desired output accuracy, so that the algorithm would function correctly. The problem is that while there is always some window that would work for any mixture of $k$ univariate Gaussians, we don’t know what window to use, and in general if we run the algorithm on a bad window, we aren’t guaranteed that the algorithm will fail in a safe way.

To get around this, we run the 1-d Learning Algorithm algorithm many times on different windows that do not intersect. Because there are only $k$ univariate Gaussians, and thus at most $k^{2}$ different distances between component parameters in any given projection, at most $k^{2}$ of these windows can be corrupted. If we choose sufficiently many windows (but still a polynomial number), a majority of the windows will yield correct parameters. Even though we can never determine which windows were good and which were bad, we can return the parameters generated by some window in consensus with a majority, and in this way, regardless of whether the window was good or bad, it is in consensus with a good window and must also be close to the correct parameters.

It is important to stress that even after the above consensus is conducted on a given projection, we still cannot be guaranteed that our univariate algorithm returns a mixture of $k$ Gaussians. Instead, it will return some mixture of $k^{\prime}<k$ Gaussians, where an element in the mixture might correspond to (say) a pair of Gaussians in the original mixture that were too close to differentiate in the given projection.

A.2 Partition Pursuit

This brings us to the second obstacle outlined in Section 1.4: in order to recover the $n$ -dimensional parameters, we will need estimates of the parameters of the Gaussians when projected on many different directions. But, as mentioned above, the univariate algorithm will not necessarily return a mixture of $k$ Gaussians, and even if we choose a direction $r^{\prime}$ that is sufficiently close to $r$ (but still $|r-r^{\prime}|>>\epsilon_{1},$ the accuracy of the 1-d algorithm), it may be the case that the univariate algorithm for direction $r$ returned $k^{\prime}$ Gaussians, and the univariate algorithm for direction $r$ returned $k^{\prime\prime}\neq k^{\prime}$ Gaussians. How do we pair up these estimates in a consistent manner?

The key insight is that we are actually making progress if we see more Gaussians when projecting onto a different direction. If we choose a new direction, and we see a mixture of Gaussians with more components, we should backtrack and start over as if this was the direction we originally chose. We may have to slide the window corresponding to our 1-d algorithm and learn at a finer precision than what we chose previously, but this finer precision will still be polynomially bounded. Effectively, we are clustering the Gaussian components into clusters with the property that the components of each cluster are indistinguishable in each of the one-dimensional projections that we have considered. In order to make this idea work properly, we also need to ensure that we maintain a minimum parameter distance between all Gaussians clusters that we have seen (i.e. this distance is much larger than our 1-d accuracy $\epsilon_{1}$ ), so that when we choose a new direction $r^{\prime}$ sufficiently close to $r,$ Gaussian component cannot switch clusters. Thus at each stage, each cluster of Gaussians either continues to be a cluster, or it gets partitioned into several clusters of Gaussians.

A.3 Hierarchical Clustering

The final obstacle outlined in Section 1.4 can be addressed easily via an accurate clustering of the input samples together with a $k$ -Gaussian analog of the Random Projection Lemma. Intuitively, the only way that a set of high-dimensional Gaussians with significant statistical distance, when projected onto a random vector, will appear nearly identical is if the re-weighted mixture of the Gaussians in this set is very far from isotropic position. This motivates the hope that if we have recovered some mixture of $k^{\prime}<k$ components, then it must be the case that whichever of these components contains multiple original Gaussians has covariance matrix very far from isotropic. Thus such a component must have at least one very small eigenvalue. Given the eigenvector corresponding to such an eigenvalue, we should be able to very accurately cluster the sample points into some partition of the original Gaussians. This motivates the following slightly more specific version of the high-level algorithm approach: Given that we have recovered parameters for a mixture of $k^{\prime}<k$ components:

Learn the parameters of some mixture of $k^{\prime}\leq k$ Gaussians, where each learned Gaussian component corresponds to one or more of the Gaussians in the original mixture.

If $k^{\prime}<k$ , for each of the $k^{\prime}$ components recovered in the previous step:

If the $i^{th}$ component has covariance matrix “not too far” from isotropic, then conclude that it corresponds to a single Gaussian in the original mixture.

there is a very small eigenvalue of the covariance matrix, so project the sample points onto the corresponding eigenvector, and accurately cluster the sample points that come from this component

Given the sample points corresponding to one of the components, rescale these data points so this component (which was very far from isotropic), is now in isotropic position, and repeat the entire algorithm on this sub-mixture

The final observation that guarantees that our algorithm will make progress with every iteration, and thus terminate after a polynomial number of steps is the following analog of the Random Projection Lemma for the $k$ -Gaussians setting. Given a mixture of $k$ Gaussians in isotropic position, with high probability over random unit vectors $r$ , there will be some pair of projected Gaussians whose parameters are reasonably different. Thus, in every projection, we will, with high probability, see what appears to be a mixture of at least two components.

Appendix B Polynomially Robust Identifiability

We now sketch the rough outline of the proof of Theorem 4. While there are considerable technical details, the main proof ideas are identical to those used in to prove the analogous theorem in the case that $n=k=2.$

Our proof will be via induction on $\max(n,k)$ . We start by considering the constituent Gaussian of minimal variance in the mixtures. Assume without loss of generality that this minimum variance component is the first component of $F,$ and denote it by $N_{1}$ . If there is no component of $F^{\prime}$ whose mean, variance, and mixing weight very closely matches those of $N_{1}$ , then we argue that there is a significant disparity in the low order moments of $F$ and $F^{\prime}$ , no matter what the other Gaussian components are. (This argument is rather involved, and we will give the high-level sketch in the next paragraph.) If there is a component $N_{1}^{\prime}$ of $F^{\prime}$ whose mean, variance, and mixture weight very closely matches those of $N_{1}$ , then we argue that we can remove $N_{1}$ from $F$ and $N_{1}^{\prime}$ from $F^{\prime}$ with only negligible effect on the discrepancy in the low-order moments. More formally, let $H$ be the mixture of $n-1$ Gaussians obtained by removing $N_{1}$ from $F$ , and rescaling the weights so as to sum to one, and define $H^{\prime},$ a mixture of $k-1$ Gaussians analogously. Then, assuming that $N$ and $N^{\prime}$ are very similar, the disparity in the low-order moments of $H$ and $H^{\prime}$ is almost the same as the disparity in low-order moments of $F$ and $F^{\prime}$ . We can then apply the induction hypothesis to the mixtures $H$ and $H^{\prime}$ .

We now return to the problem of showing that if the skinniest Gaussian in $F$ cannot be paired with a component of $F^{\prime}$ with similar mean, variance, and weight, that there must be a polynomially-significant discrepancy in the low-order moments of $F$ and $F^{\prime}$ . This step relies on ’deconvolving’ by a Gaussian with an appropriately chosen variance (this corresponds to running the heat equation in reverse for a suitable amount of time). We define the operation of deconvolving by a Gaussian of variance $\alpha$ as $\mathcal{F}_{\alpha}$ ; applying this operator to a mixture of Gaussians has a particularly simple effect: subtract $\alpha$ from the variance of each Gaussian in the mixture (assuming that each constituent Gaussian has variance at least $\alpha$ ). If $\alpha$ is negative, this is just convolution.

Let $F(x)=\sum_{i=1}^{n}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x)$ be the probability density function of a mixture of Gaussian distributions, and for any $\alpha<\min_{i}\sigma_{i}^{2},$ define

The key step will be to show that if the skinniest Gaussian in either of the two mixtures cannot be paired with a nearly identical Gaussian in the other mixture, then there is some $\alpha$ for which the resulting mixtures, after applying the operation $\mathcal{F}_{\alpha}$ , have large statistical distance. Intuitively, this deconvolution operation allows us to isolate Gaussians in each mixture and then we can reason about the statistical distance between the two mixtures locally, without worrying about the other Gaussians in the mixture.

Given this statistical distance between the transformed pair of mixtures, we the fact that there are relatively few zero-crossings in the difference in probability density functions of two mixtures of Gaussians (Proposition 19) to show that this statistical distance gives rise to a discrepancy in at least one of the low-order moments of the pair of transformed distributions. To complete the argument, we then show that applying this transform to a pair of distributions, while certainly not preserving statistical distance, roughly preserves the combined disparity between the low-order moments of the pair of distributions. The complete proof can be found in Appendix B.

B.2 Theorem 4

In this section we give the complete proof of the polynomially robust identifiability of univariate mixtures of $k$ Gaussians (Theorem 4). For convenience, we restate the theorem and all necessary definitions. We make frequent reference to the simple properties of Gaussians and tail bounds provided in Appendix J. Throughout this section we will consider two univariate mixtures of Gaussians:

Definition 6. We will call the pair $F,F^{\prime}$ $\epsilon$ -standard if $\sigma_{i}^{2},\sigma_{i}^{\prime 2}\leq 1$ and if $\epsilon$ satisfies:

$|\mu_{i}|,|\mu^{\prime}_{i}|\leq\frac{1}{\epsilon}$

$|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon$ and $|\mu^{\prime}_{i}-\mu^{\prime}_{j}|+|\sigma_{i}^{\prime 2}-\sigma_{j}^{\prime 2}|\geq\epsilon$ for all $i\neq j$

Theorem 4. There is a constant $c>0$ such that, for any $\epsilon$ -standard $F,F^{\prime}$ and any $\epsilon<c$ ,

The following definition of the deconvolution operation will be central to our proof of Theorem 4:

Definition 10. Let $F(x)=\sum_{i=1}^{n}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x)$ be the probability density function of a mixture of Gaussian distributions, and for any $\alpha<\min_{i}\sigma_{i}^{2},$ define

The following lemma argues that if the skinniest Gaussian in mixture $F$ can not be matched with a sufficiently similar component in the mixture $F^{\prime}$ , then there is some $\alpha$ , possibly negative, such that $\max_{x}|\mathcal{F}_{\alpha}(F)(x)-\mathcal{F}_{\alpha}(F^{\prime})(x)|$ is significant. Furthermore, every component in the transformed mixtures have variances that are not too small.

Suppose $F,F^{\prime}$ are $\epsilon$ -standard. Suppose without loss of generality that the Gaussian of minimal variance is $\mathcal{N}(\mu_{1},\sigma_{1}^{2}),$ and there is some $\gamma$ satisfying $\epsilon/4>\gamma>0$ such that for all $i,$ at least one of the following holds:

$|\sigma_{1}^{2}-\sigma_{i}^{\prime 2}|>\gamma^{8}$

Then there is some $\alpha$ such that either

$\max_{x}(|\mathcal{F}_{\alpha}(F)(x)-\mathcal{F}_{\alpha}(F^{\prime})(x)|)\geq\frac{1}{2\gamma\sqrt{2\pi}}$ and the minimum variance in any component of $\mathcal{F}_{\alpha}(F)$ or $\mathcal{F}_{\alpha}(F^{\prime})$ is at least $\gamma^{4},$

$\max_{x}(|\mathcal{F}_{\alpha}(F)(x)-\mathcal{F}_{\alpha}(F^{\prime})(x)|)\geq\frac{1}{2\gamma^{8}\sqrt{2\pi}}$ and the minimum variance in any component of $\mathcal{F}_{\alpha}(F)$ or $\mathcal{F}_{\alpha}(F^{\prime})$ is at least $\gamma^{18}.$

We start by considering the case when there is no Gaussian in $F^{\prime}$ that matches both the mean and variance to within $\gamma^{8}.$ Consider applying $\mathcal{F}_{\sigma_{1}^{2}-\gamma^{18}}$ . $\mathcal{F}_{\sigma_{1}^{2}-\gamma^{18}}(F)(\mu_{1})\geq\epsilon\mathcal{N}(0,\gamma^{18},0)=\frac{\epsilon}{\gamma^{9}\sqrt{2\pi}}.$ Next, by Corollary 60,

Next, consider the case where we have at least one Gaussian component of $F^{\prime}$ that matches both $\mu_{1}$ and $\sigma_{1}^{2}$ to within $\gamma^{8},$ but whose weight differs from $w_{1}$ by at least $\gamma.$ By the definition of $\epsilon$ -standard, there can be at most one such Gaussian component, say the $i^{th}$ . If $w_{1}>w_{i}^{\prime},$ then $\mathcal{F}_{\sigma_{1}^{2}-\gamma^{4}}(F)(\mu_{1})-\mathcal{F}_{\sigma_{1}^{2}-\gamma^{4}}(F^{\prime})(\mu_{1})\geq\frac{1}{\gamma\sqrt{2\pi}}+\frac{2}{\epsilon\sqrt{2\pi e}},$ where the second term is a bound on the contribution of the other Gaussian components, using the fact that $F,F^{\prime}$ are $\epsilon$ -standard and Corollary 60. Since $\gamma<\epsilon/4,$ this quantity is at least $\frac{1}{2\gamma\sqrt{2\pi}}.$

If $w_{1}\leq w_{i}^{\prime},$ then consider applying $\mathcal{F}_{\sigma_{1}^{2}-\gamma^{4}}$ to the pair of distributions. Using the fact that $\frac{1}{\sqrt{x+x^{2}}}\geq\frac{1-x}{\sqrt{x}},$ we have

Let $f(x^{*})\geq M$ for some $x^{*}\in(0,r)$ and suppose that $f(x)\geq 0$ on $(0,r)$ and $f(0)=f(r)=0$ . Suppose also that $|f^{\prime}(x)|\leq m$ everywhere. Then $\int_{0}^{r}f(x)dx\geq\frac{M^{2}}{m}$

Consider the continuous function $g(x)$ that is defined to be for $x\in[0,x^{*}-M/m]\cup[x^{*}+M/m,r],$ and has slope $m$ on the interval $(x^{*}-M/m,x^{*}),$ and slope $-m$ on the interval $(x^{*},x^{*}+M/m).$ Clearly $f(x)\geq g(x)$ for $x\in(0,r),$ and thus

The above claim together with Lemma 16 yields the following

For $\alpha,\gamma$ as defined in Lemma 16,

Let $f(x)=\mathcal{F}_{\alpha}(F)(x),\mathcal{F}_{\alpha}(F^{\prime})(x)$ , then $f(x*)\geq M$ for $M=\Omega(\frac{1}{\gamma})$ and for some $x^{*}$ contained in an interval $I$ in which $f(x)$ does not change sign. Similarly, because the minimum variance in any component of $\mathcal{F}_{\alpha}(F)$ or $\mathcal{F}_{\alpha}(F^{\prime})$ is at least $\gamma^{18}$ , this implies that $f^{\prime}(x)=O(\frac{1}{\gamma^{18}})=m$ . So we can apply Claim 17 using $m,M$ and get that $\int_{I}f(x)\geq\Omega(\gamma^{18})$ and this implies the corollary. ∎

We use the following proposition from that shows that $\mathcal{F}_{\alpha}(D)(x)-\mathcal{F}_{\alpha}(D^{\prime})(x)$ has few zeros.

[Prop. 7 from .] Given $f(x)=\sum_{i=1}^{m}a_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x),$ the linear combination of $m$ one-dimensional Gaussian probability density functions, such that $\sigma_{i}^{2}\neq\sigma_{j}^{2}$ for $i\neq j$ , assuming that not all the $a_{i}$ ’s are zero, the number of solutions to $f(x)=0$ is at most $2(m-1)$ .

Suppose that $D(F,F^{\prime})\geq\Omega(\gamma^{18})$ and that the minimum variance in any component of $F,F^{\prime}$ is at least $\gamma^{18}$ and also let $F,F^{\prime}$ be mixture of $n$ and $k$ Gaussians respectively, and the mean of each component of $F$ and $F^{\prime}$ is at most $\frac{1}{\gamma}$ . Then there is some moment $i\in[2(n+k-1)]$ s.t. $|E_{F}[x^{i}]-E_{F^{\prime}}[x^{i}]|\geq\Omega(\gamma^{c})$ for some constant $c=c(n,k)$ that depends on $n,k$ .

Using Proposition 19, there are at most $2(n+k-1)$ zero crossings of the function $f(x)=F(x)-F^{\prime}(x)$ . Consider the interval $I=[\frac{-2}{\gamma},\frac{2}{\gamma}]$ . Using Corollary 62, the contribution to $E_{F}[x^{i}]$ of $\Re-I$ is at most $O(\gamma^{-i}e^{-\frac{1}{8\gamma^{2}}})$ , and for sufficiently small $\gamma$ , this is negligible.

Because $D(F,F^{\prime})\geq\Omega(\gamma^{18})$ and the fact that there are at most $2(n+k-1)$ zero crossings of the function $f(x)$ , there must be some interval $J$ for which $f(x)$ does not change signs and $\int_{J}|f(x)|dx\geq\Omega(\frac{\gamma^{18}}{2n+2k})$ . If we choose $p(x)=\pm\Pi_{z_{i}}(x-z_{i})$ for all zeros $z_{i}\in I$ . We can then choose signs so that $p(x)$ matches $f(x)$ on $J\cup I=J^{\prime}$ . Then $\int_{J^{\prime}}p(x)|f(x)|dx\geq|\int_{J}p(x)f(x)dx|-\int_{\Re-I}|p(x)f(x)|dx\geq|\int_{J}p(x)f(x)dx|-O(\gamma^{-i-2(n+k-1)}e^{-\frac{1}{8\gamma^{2}}})$ because each coefficient in $p(x)$ is bounded by $\frac{1}{\gamma^{2(n+k-1)}}$ . Let $J^{\prime\prime}\subset J$ be the interval $[a-\delta,b+\delta]\subset J=[a,b]$ .

Then $|\int_{J^{\prime\prime}}p(x)f(x)dx|\geq\delta^{2(n+k-1)}|\int_{J^{\prime\prime}}f(x)dx|$ and $|\int_{J^{\prime\prime}}f(x)dx|\geq|\int_{J}f(x)dx|-O(\frac{\delta^{2}}{\gamma^{18}})$ because the derivative of $f(x)$ is bounded by $O(\frac{1}{\gamma^{18}})$ , and $f(a)=f(b)=0$ . So choosing $\delta=O(\gamma^{19})$ yields that $|\int_{J^{\prime\prime}}f(x)dx|\geq\Omega(\gamma^{18})$ (where the constant hidden in $O()$ depends on $n,k$ ).

So this implies that $|\int_{J^{\prime\prime}}p(x)f(x)dx|\geq\Omega(\gamma^{c(n+k-1)})$ for some constant $c$ (that does not depend on $n,k$ . Using the fact that the coefficients of $p(x)$ are bounded by $\frac{1}{\gamma^{2(n+k-1)}}$ , this implies that there is some $i\in[2(n+k-1)]$ such that $|\int_{J^{\prime\prime}}x^{i}f(x)dx|\geq\Omega(\gamma^{c^{\prime}(n+k-1)})$ for some constant $c^{\prime}$ that does not depend on $n,k$ .

Then using the bound of $O(\gamma^{-i-2(n+k-1)}e^{-\frac{1}{8\gamma^{2}}})$ for $E_{\Re-I}[p(x)f(x)]$ , for sufficiently small $\gamma$ this implies that $|E_{F}[x^{i}]-E_{F^{\prime}}[x^{i}]|\geq\Omega(\gamma^{c^{\prime\prime\prime}(n+k-1)})$ ∎

Unfortunately, the transformation $\mathcal{F}_{\alpha}$ does not preserve the statistical distance between two distributions. However, we show that it, at least roughly, preserves (up to a polynomial) the disparity in low-order moments of the distributions.

[Lemma 6 from .] Suppose that each constituent Gaussian in $F$ or $F^{\prime}$ has variances in the interval $[\alpha,1]$ . Then

The proof of the above lemma follows easily from the observation that the moments of $F$ and $\mathcal{F}_{\alpha}(F)$ are related by a simple linear transformation, which can also be viewed as a recurrence relation for Hermite polynomials.

Proof of Theorem 4: The base case for our induction is when $n=k=1,$ and follows from the fact that given parameters $\mu,\mu^{\prime},\sigma^{2},\sigma^{\prime 2},$ such that $\sigma^{2},\sigma^{\prime 2}\leq 1,$ and $|\mu-\mu^{\prime}|+|\sigma^{2}-\sigma^{\prime 2}|\geq\epsilon,$ then one of the first two moments of $\mathcal{N}(\mu,\sigma^{2})$ differs from that of $\mathcal{N}(\mu^{\prime},\sigma^{\prime 2})$ by at least $\epsilon/2.$

For the induction step, assume that for all pairs of $\epsilon$ -standard mixtures of $n,$ and $k$ Gaussians, respectively, one of the first $2(n+k-1)$ moments differ by at least $f(\epsilon,n+k)$ . Consider $\epsilon$ -standard mixtures $F,F^{\prime},$ mixtures of $n^{\prime},k^{\prime}$ Gaussians, respectively, where either $n^{\prime}=n+1,$ or $k^{\prime}=k+1,$ and either $n^{\prime}=n$ or $k^{\prime}=k.$ Assume without loss of generality that $\sigma_{1}^{2}$ is the minimal variance in the mixtures, and that it occurs in mixture $F$ .

We first consider the case that there exists a component of $F^{\prime}$ whose mean, variance, and weight match $\mu_{1},\sigma_{1}^{2},w_{1}$ to within an additive $x$ , where $x$ is chosen so that each of the first $2(n+k-1)$ moments of any pair of Gaussians whose parameters are within $x$ of each other, differ by at most $f(\epsilon/2,n+k-1)/2;$ specifically, letting $q(y)$ be the polynomial (dependent on $n,k$ ) of Lemma 63 bounding the discrepancy in the first $2(n+k-1)$ moments of Gaussians whose parameters differ by $y$ , we set $x$ so that $q(x)=f(\epsilon/2,n+k-1)/2.$ Note that for fixed $n,k$ , $x$ will be polynomial in $\epsilon.$ Since Lemma 63 requires that $\sigma_{1}^{2}\geq\sqrt{x},,$ if this is not the case, we convolve the pair of mixtures by $\mathcal{N}(0,\epsilon),$ which by Lemma 21 changes the disparity in low-order moments by a polynomial amount, and proceed with the chosen value of $x$ and the transformed pair of GMMs.

Now, consider the mixtures $H,H^{\prime},$ obtained from $F,F^{\prime}$ by removing the two nearly-matching Gaussian components, and rescaling the weights so that they still sum to 1. The pair $H,H^{\prime}$ will now be mixtures of $k^{\prime}-1$ and $n^{\prime}-1$ components, and will still be $(\epsilon-\epsilon^{2})$ -standard, and the discrepancy in their first $2(n^{\prime}+k^{\prime}-1)$ moments is at most $f(\epsilon/2,n+k-1)/2$ different from the discrepancy in the pair $F,F^{\prime}$ . By our induction hypothesis, there is a discrepancy in one of the first $2(n^{\prime}+k^{\prime}-3)$ moments of at least $f(\epsilon/2,n+k-3)$ and thus the original pair $F,F^{\prime}$ will have discrepancy in moments at least half of this, which is still $poly(\epsilon),$ for any fixed $n,k$ .

Appendix C The Basic Univariate Algorithm

In this section we formally state the Basic Univariate Algorithm, and prove its correctness. In particular, we will prove the following corollary to the polynomially robust identifiability of GMMs (Theorem 4).

Corollary 5. Suppose we are given access to independent samples from a GMM

with mean 1 and variance in $[1/2,2],$ where $w_{i}\geq\epsilon$ , and $|\mu_{i}-\mu_{j}|+|\sigma_{i}^{2}-\sigma_{j}^{2}|\geq\epsilon$ . The Basic Univariate Algorithm, for any fixed $k$ , has runtime at most $poly_{k}(\frac{1}{\epsilon},\frac{1}{\delta})$ samples and with probability at least $1-\delta$ will output mixture parameters $\hat{w}_{i},\hat{\mu}_{i},\hat{\sigma_{i}}^{2}$ , so that there is a permutation $\pi:[k]\rightarrow[k]$ and

Our proof of the above Corollary will consist of three parts; first, we will show that for any $\alpha\leq\epsilon,$ a there is some polynomial $p$ such that $p(\alpha,\epsilon)$ samples suffices to guarantee that with probability at least $1-\delta,$ the first $4k-2$ sample moments will all be within $\alpha$ of the corresponding true moments. Next, we show that it suffices to perform brute-force search over a polynomially-fine mesh of parameters in order to ensure that at least one point $(\hat{w_{1}},\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{w_{k}},\hat{\mu_{k}},\hat{\sigma_{k}}^{2})$ in our parameter-mesh will have the first $4k-2$ moments that are each within $\alpha$ from the true moments. Finally, we will use Theorem 4 to conclude that the recovered parameter set $(\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{\mu_{k}},\hat{\sigma_{k}}^{2})$ must be close to the true parameter set, because the first $4k-2$ moments nearly agree. We now formalize these pieces.

Let $x_{1},x_{2},\ldots,x_{m}$ be independent draws from a univariate GMM $F$ that is in isotropic position, and each of whose components has weight at least $\epsilon$ . With probability $\geq 1-\delta$ ,

where the hidden constant on the big-Oh depends on $k$ .

By Chebyshev’s inequality, with probability at most $\delta$ ,

We now argue that a polynomially-fine mesh suffices to guarantee that there is some parameter set in our mesh whose first $4k-2$ moments are all close to the corresponding true moments.

Given a univariate mixture $F$ of $k$ Gaussians centered at 0 with variance at most $2$ , each of whose weights are at least $\epsilon$ , such that each pair of components has parameter distance at least $\epsilon,$ and a target accuracy $\alpha\leq\epsilon,$ there exists a $\gamma=poly(\alpha),$ and set of parameters $(\hat{w_{1}},\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{w_{k}},\hat{\mu_{k}},\hat{\sigma_{k}}^{2})$ such that each parameter is a multiple of $\gamma,$ each is bounded by $2/\epsilon$ , each weight is at least $\epsilon/2,$ each pair of components has parameter distance at least $\epsilon/2$ , and the first $4k-2$ moments of $F$ are within $\alpha$ of the corresponding moments of $\hat{F},$ the mixture corresponding to the recovered parameters.

Consider the parameter set obtained by rounding the true parameter set, excluding the weights, to the nearest multiple of $\gamma.$ For each weight $w_{i}$ , we set $\hat{w_{i}}$ to be either the multiple of $\gamma$ just above, or just below $w_{i}$ , ensuring that $\sum_{i}\hat{w_{i}}=1,$ which can clearly be down. That the rounded mixture has component weights at least $\epsilon/2,$ pairwise parameter distances at least $\epsilon/2,$ and values bounded in magnitude by $2/\epsilon$ is obvious. We now analyze how much the rounding has effected the moments.

From Claim 65, the $i^{th}$ moment of each component is just some polynomial in $\mu,\sigma^{2},$ which is a polynomial of degree at most $i$ , with coefficients bounded in magnitude by $(i+2)!$ Thus changing the mean or variance by at most $\gamma$ will change the $i^{th}$ moment by at most

Thus if we used the true mixing weights, the error in each moment of the entire mixture would be at most $k$ times this. To conclude, note that for each mixing weight $|w_{j}-\hat{w_{j}}|\leq\gamma,$ and since, as noted in the proof of the previous lemma, each moment is at most $O(\epsilon^{-i})$ (where the hidden constant depends on $i$ ), thus the rounding of the weight will contribute at most an extra $O(\gamma\epsilon^{-i}).$ Adding these bounds together, we get that each of the first $4k-2$ moments of $\hat{F}$ can be off from the true ones by at most $k(O(\gamma\epsilon^{-4k+2})+2(4k-2)^{2}(4k)!\epsilon^{-4k+3}\gamma=O(\gamma\epsilon^{-4k+2}),$ where the hidden constant depends on $k$ . Thus letting $\gamma=c_{k}\alpha^{4k-1},$ where the constant $c_{k}$ depends on $k$ suffices to ensure that all moments are within $\alpha$ of their true values. ∎

We now piece together the above two lemmas to prove Corollary 5.

Proof of Corollary 5: Given a desired moment accuracy $\alpha\leq\epsilon,$ by applying a union bound to Lemma 22, $O(\alpha\epsilon^{-8k}\delta^{-2})$ samples suffices to guarantee that with probability at least $1-\delta,$ the first $4k-2$ sample moments are within $\alpha$ from the true moments. Thus with at least probability $1-\delta,$ by Lemma 23, our polynomial mesh of parameters suffices to recover a set of parameters $(\hat{w_{1}},\hat{\mu_{1}},\hat{\sigma_{1}}^{2},\ldots,\hat{w_{k}},\hat{\mu_{k}},\hat{\sigma_{k}}^{2})$ whose weights and pairwise parameter-distances are at least $\epsilon/2,$ and whose first $4k-2$ sample moments will all be within $2\alpha$ from the sample moments, and hence within $3\alpha$ from the true moments.

To conclude, note that the pair of mixtures $F,\hat{F},$ after rescaling by at most $(\epsilon/2)^{1/2}$ so as to ensure each component in the mixture has variance at most 1 (which scales the $k^{th}$ moments by $(\epsilon/2)^{k/2}$ ), satisfies the first three conditions of being $\epsilon/2$ -standard, and thus, if the first $4k-2$ moments (after rescaling) agree to within $(\epsilon/2)^{2k-1}\cdot(\epsilon/2)^{O(k)},$ Theorem 4 guarantees that the recovered parameters must be accurate to within $\epsilon$ (where the first $O(k)$ in the exponent is from Theorem 4). Thus setting $3\alpha\leq(\epsilon/2)^{O(k)}=poly_{k}(\epsilon)$ will ensure that with the desired high probability, the recovered parameters are $\epsilon/2$ accurate. $\Box$

Appendix D The General Univariate Algorithm

Suppose that $F$ , $G$ and $H$ are GMM of $k_{1}\leq k_{2}\leq k_{3}$ Gaussians respectively. If $(G,\pi_{1})\in\mathcal{D}_{\epsilon}(F)$ and $(H,\pi_{2})\in\mathcal{D}_{\epsilon}(G)$ , then $(H,\pi_{2}\circ\pi_{1})\in\mathcal{D}_{O(k_{1})\epsilon}(F)$ .

Note that $\pi_{1}:[k_{1}]\rightarrow[k_{2}]$ and $\pi_{2}:[k_{2}]\rightarrow[k_{3}]$ . Consider $\pi_{3}:[k_{1}]\rightarrow[k_{3}]=\pi_{2}\circ\pi_{1}$ . This function $\pi_{3}$ is onto, because both $\pi_{1}$ and $\pi_{2}$ are both onto.

Also consider any $j\in\pi_{3}^{-1}(h)$ (for some $h\in[k_{3}]$ ). In fact, let $i\in\pi_{2}^{-1}(h)$ and $j\in\pi_{1}^{-1}(i)$ . Then because parameter distance is a distance (i.e. satisfies triangle-inequality):

because $(G,\pi_{1})\in\mathcal{D}_{\epsilon}(F)$ and $(H,\pi_{2})\in\mathcal{D}_{\epsilon}(G)$ and $\pi_{2}(i)=h$ and $\pi_{1}(j)=i$ . We write $w^{F}_{j}$ for the weight of the $j^{th}$ component of $F$ to simplify notation, and similarly for $G,H$ . Then using this notation:

Convolving two Gaussians $F_{1},F_{2}$ by the same Gaussian $\mathcal{N}(\mu,\sigma^{2})$ preserves the parameter distance between $F_{1}$ and $F_{2}$ . Also, given an estimate $\hat{F}_{i}$ which is within $D$ in parameter distance from $\mathcal{N}\circ F_{i}$ , by subtracting $\mu$ from the mean of $\hat{F}_{i}$ and $\sigma^{2}$ from the variance of $\hat{F}_{i}$ , we obtain an estimate for $F_{i}$ which is within $D$ in parameter distance from $F_{i}$ .

Suppose $(\hat{F},\pi)\in\mathcal{D}_{\epsilon}(F)$ and that each Gaussian $F_{i}$ in the mixture $F$ has variance at least $\frac{1}{2}$ . Then $D(F,\hat{F})\leq O(k^{\prime})\epsilon$ , where $k^{\prime}$ is the number of components in the GMM $\hat{F}$ .

Let $k$ be the number of components in $F$ . Then

We can then apply Fact 25 and the assumption that each Gaussian has variance at least $\frac{1}{2}$ (and if $\epsilon<<1$ ) implies that $\|\hat{F}_{i}-F_{j}\|_{1}=O(D_{p}(\hat{F}_{i},F_{j}))=O(\epsilon)$ for all $j\in\pi^{-1}(i)$ . And so $D(F,\hat{F})\leq O(k^{\prime})\epsilon$ ∎

D.2 Windows

Here we define the notion of a Window. Suppose we run the Basic Univariate Algorithm with target precision of $\epsilon$ (and an error parameter $\delta$ ). Then Basic Univariate Algorithm uses at most some polynomial in $\frac{1}{\epsilon}$ and $\frac{1}{\delta}$ number of samples.

Here that we assume the Basic Univariate Algorithm run with precision $\epsilon$ and an error parameter $\delta$ requires some polynomial in $\frac{1}{\epsilon}$ and $\frac{1}{\delta}$ samples. We in fact assume that the number of samples is at most $C_{B}(\epsilon\delta)^{-c_{B}}$ for some universal constants $C_{B},c_{b}>0$ . Then we denote $Q(\epsilon,\delta)$ as $\frac{1}{C_{B}}(\epsilon\delta)^{c_{B}}$ .

Let $Q(\epsilon,\delta)$ be the inverse of the number of samples needed by the Basic Univariate Algorithm when given a target precision $\epsilon$ (and an error parameter $\delta$ ).

We would like to define a Window to be the range of values from $Q(\epsilon,\delta)$ to $\epsilon$ so that if all pairs of Gaussians either have parameter distance at least $\epsilon$ , or statistical distance at most $Q(\epsilon,\delta)$ then the we can just run the Basic Univariate Algorithm and assume that the algorithm behaves as if each pair of Gaussians that is extremely close is replaced with a single (appropriately) chosen Gaussian. However, we will need some slack, and so we make the Window wider so that we can take union bounds over many different runs of the algorithm, and compose different subdivisions.

Let $R(\epsilon,\delta)=\frac{Q(\epsilon,\delta)}{C_{1}k^{4}}$ and let $S(\epsilon,\delta)=\frac{R(\epsilon,\delta)}{C_{2}k^{4}}$ for some sufficiently large constants $C_{1},C_{2}$ .

Given a target precision $\epsilon$ , we define the Window $W(\epsilon)$ at $\epsilon$ as the range of values $[R(\epsilon,\delta),\epsilon]$ .

Given a mixture of Gaussians $F$ , we will say that a Window $W(\epsilon)$ is good if for all $i\neq j$ , $D_{p}(F_{i},F_{j})\notin W(\epsilon)$ .

We give a number of claims that will be useful in the case in which we have a good Window $W(\epsilon)$ . So suppose that the Window $W(\epsilon)$ is good

The set of Gaussians at parameter distance at most $R(\epsilon,\delta)$ is an equivalence class.

Consider Gaussians $F_{1},F_{2}$ and $F_{3}$ such that $F_{1}$ and $F_{2}$ are at parameter distance at most $R(\epsilon,\delta)$ and $F_{2}$ and $F_{3}$ are also at parameter distance at most $R(\epsilon,\delta)$ . $D_{p}(F_{1},F_{3})\leq D(F_{1},F_{2})+D(F_{2},F_{3})\leq 2R(\epsilon,\delta)<<\epsilon$ and since there is no pair of Gaussians with parameter distance inside the Window $W(\epsilon)$ , this implies that $D_{p}(F_{1},F_{3})\leq R(\epsilon,\delta)$ . ∎

We will let $\mathcal{E}=\{\mathcal{E}_{1},\mathcal{E}_{2},...\mathcal{E}_{k^{\prime}}\}$ be the equivalence class of Gaussians at parameter distance at most $R(\epsilon,\delta)$ . We let $\pi_{\mathcal{E}}:[k]\rightarrow[k^{\prime}]$ be the mapping function that maps a Gaussian $F_{j}$ to the corresponding equivalence class $\mathcal{E}_{i}$ (i.e. $\pi_{\mathcal{E}}(j)=i$ ). From this equivalence class and this mapping function, we can define a natural $R(\epsilon,\delta)$ -correct subdivision of $F$ .

We define the natural $R(\epsilon,\delta)$ -correct subdivision $\hat{F}^{\mathcal{E}}$ as a mixture of $k^{\prime}$ Gaussians in which $\hat{F}^{\mathcal{E}}_{j}$ is an arbitrarily chosen representative from $\mathcal{E}_{i}$ ( $\pi_{\mathcal{E}}(j)=i$ ), and $\hat{w}^{\mathcal{E}}_{i}=\sum_{j\in\pi_{\mathcal{E}}^{-1}(i)}w_{j}$ .

Clearly, $(\hat{F}^{\mathcal{E}},\pi_{\mathcal{E}})\in\mathcal{D}_{R(\epsilon,\delta)}(F)$ , and $\hat{F}^{\mathcal{E}}$ actually is an $R(\epsilon,\delta)$ -correct subdivision.

Let $(\hat{F},\pi)\in\mathcal{D}_{R(\epsilon,\delta)}(F)$ , then $\hat{F}^{\mathcal{E}}\in\mathcal{D}_{O(k)R(\epsilon,\delta)}(\hat{F})$ .

Let $k^{\prime},k^{\prime\prime}$ be the number of Gaussians in the GMMs $\hat{F}^{\mathcal{E}}$ and $\hat{F}$ respectively. Consider any two Gaussians $F_{i},F_{j}$ that are not mapped to the same equivalence class - i.e. $\pi_{\mathcal{E}}(i)\neq\pi_{\mathcal{E}}(j)$ . Since the Window $W(\epsilon)$ is good, this implies that $D_{p}(F_{i},F_{j})\geq\epsilon$ . So in order for $\hat{F}$ to be an $R(\epsilon,\delta)$ -correct subdivision, it must be the case that $\pi(i)\neq\pi(j)$ .

This means that $\pi$ as a partition is a refinement of the partition $\pi_{\mathcal{E}}$ . Formally, there must be some function $\pi_{int}:[k^{\prime\prime}]\rightarrow[k^{\prime}]$ such that $\pi_{\mathcal{E}}=\pi_{int}\circ\pi$ . Then it follows that $(\hat{F}^{\mathcal{E}},\pi_{\int})\in\mathcal{D}_{O(k)R(\epsilon,\delta)}(\hat{F})$ : Consider any $i\in[k^{\prime}]$ .

And $\sum_{h\in\pi_{\mathcal{E}}^{-1}(i)}w_{h}=\sum_{j\in\pi_{int}^{-1}(i)}\sum_{h\in\pi^{-1}(j)}w_{h}$ so this implies that

And similarly for any $j\in\pi_{int}^{-1}(i)$ let $h=\pi^{-1}(j)$ ,

where the last line follows because $\pi_{\mathcal{E}}(h)=i$ . ∎

Suppose we are given a mixture of Gaussians $F=\sum_{i=1}^{k}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x)$ that is in near isotropic position, where $w_{i}\geq\epsilon$ and the Window $W(\epsilon)$ is good and suppose further that $\sigma_{i}^{2}\geq\frac{1}{2}$ . Let $(\hat{F},\pi)\in\mathcal{D}_{R(\epsilon,\delta)}(F)$ . Then with probability at least $1-2\delta$ , the output of the Basic Univariate Algorithm is a GMM $N$ such that $N\in\mathcal{D}_{O(k)\epsilon}(\hat{F})$ .

Let $\mathcal{E}=\{\mathcal{E}_{1},\mathcal{E}_{2},...\mathcal{E}_{k^{\prime}}\}$ be the equivalence class of Gaussians at parameter distance at most $R(\epsilon,\delta)$ (see Claim 29), and let $\hat{F}^{\mathcal{E}}$ and $\pi_{\mathcal{E}}$ be the natural $R(\epsilon,\delta)$ -correct subdivision for $F$ and corresponding mapping function.

Let $k^{\prime\prime}$ be the number of components in $\hat{F}$ . Then we can apply Claim 30 and this implies that $\hat{F}^{\mathcal{E}}$ is an $O(k)R(\epsilon,\delta)$ -correct subdivision for $\hat{F}$ .

Using Lemma 28, this implies that $D(F,\hat{F}^{\mathcal{E}})+D(\hat{F}^{\mathcal{E}},\hat{F})\leq O(k^{2})R(\epsilon,\delta)$ . So this implies that given $\frac{1}{Q(\epsilon,\delta)}$ samples taken from $F$ when running the Basic Univariate Algorithm, with probability at least

we can assume that all samples came from $\hat{F}^{\mathcal{E}}$ (because there is an approximate between $F$ and $\hat{F}^{\mathcal{E}}$ that fails with probability at $D(F,\hat{F}^{\mathcal{E}})$ and with probability at least $1-\delta$ this coupling will never fail, given the number of samples obtained from $F$ ).

When the Basic Univariate Algorithm is run on $\hat{F}^{\mathcal{E}}$ , the constraints of the Basic Univariate Algorithm are met because for all $i\neq j$ , $D_{p}(\hat{F}^{\mathcal{E}}_{i},\hat{F}^{\mathcal{E}}_{j})\geq\epsilon$ because the Window $W(\epsilon)$ is good. So with probability at least $1-\delta$ , the Basic Univariate Algorithm (when run on $\hat{F}^{\mathcal{E}}$ ) will return an $\epsilon$ -correct subdivision $N$ of $\hat{F}^{\mathcal{E}}$ (in fact, a stronger guarantee is true because the Basic Univariate Algorithm will actually return an estimate $N$ that has $k^{\prime}$ components, which matches the number of components in $\hat{F}^{\mathcal{E}}$ ). Then we can apply Lemma 24, and $N$ must then be an $O(k)\epsilon$ -correct subdivision for $\hat{F}$ . ∎

D.3 Reaching a Consensus

We call a sequence of GMMs, $F^{1},F^{2},...F^{r}$ an $\epsilon$ -correct chain if for all $i\in[r-1]$ , $F^{i+1}\in\mathcal{D}_{\epsilon}(F^{i})$

Suppose we are given a mixture of Gaussians $F=\sum_{i=1}^{k}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2},x)$ that is in isotropic position, where $w_{i}\geq\epsilon$ . Then the General Univariate Algorithm will return a GMM of $k^{\prime}\leq k$ Gaussians $\hat{F}$ such that $\hat{F}$ is an $\epsilon$ -correct subdivision of $F$ .

Given $\epsilon$ , we first define a sequence of parameters where

Suppose first that each Gaussian in $F$ has variance at least $\frac{1}{2}$ . Then in this case, the idea is to run the Basic Univariate Algorithm for a number of different precisions, each of which corresponds to a particular Window. We will choose parameters so that these Windows are disjoint, and then because a Window is bad iff there is some pair of Gaussians with parameter distance contained inside the Window, at most ${k\choose 2}<\frac{k^{2}}{2}$ Windows can be bad. So this will guarantee that a strict majority of the computations are correct.

To formalize this, given the sequence of parameters $\epsilon_{1},\epsilon_{2},...\epsilon_{k^{2}+1}$ we define a sequence of Windows $\mathcal{W}=W(\epsilon_{1}),W(\epsilon_{2}),...W(\epsilon_{k^{2}+1})$ .

The sequence of Windows $\mathcal{W}$ is disjoint

If we consider the Window $W(\epsilon_{i})$ , the largest value contained in any Window $W(\epsilon_{j})$ for $j>i$ is the largest value contained in the Window $W(\epsilon_{i+1})$ which is $\epsilon_{i+1}$ . Yet $\epsilon_{i+1}=S(\epsilon_{i},\delta)$ and the lower bound for the Window $W(\epsilon_{i})$ is $R(\epsilon_{i},\delta)$ and $R(\epsilon_{i},\delta)>>S(\epsilon_{i},\delta)$ . Similarly, the smallest value in $W(\epsilon_{j})$ for $j\leq i$ is the smallest value in $W(\epsilon_{i})$ . So this implies that for any $i$ , the set of Windows $W(\epsilon_{1}),W(\epsilon_{2}),...W(\epsilon_{i})$ are separable from the set of Windows $W(\epsilon_{i+1}),W(\epsilon_{i+2}),...W(\epsilon_{k^{2}+1})$ and this implies the claim. ∎

Suppose running the Basic Univariate Algorithm on Window $W(\epsilon_{i})$ returns an estimate $\hat{F}^{i}$ .

Given any subset of indices $T\subset[k^{2}+1]$ , let $i_{1}>i_{2}>...i_{j}>...i_{|T|}$ be the indices in $T$ arranged in decreasing order. We can generate a sequence of estimates $\hat{F}_{T}^{1},\hat{F}_{T}^{2},...\hat{F}_{T}^{|T|}$ in which $\hat{F}_{T}^{j}=\hat{F}^{i_{j}}$ . Also let $prec(\hat{F}_{T}^{j})=\epsilon_{i_{j}}$ , which corresponds to the precision of the Window that returned the estimate $\hat{F}_{T}^{j}=\hat{F}^{i_{j}}$ . We call this sequence the $T$ -sequence of estimates.

Note that this sequence of estimates $\hat{F}_{T}^{1},...\hat{F}_{T}^{|T|}$ is arranged in order of coarsening precision - i.e. $prec(\hat{F}_{T}^{i})<<prec(\hat{F}_{T}^{i+1})$ .

$S(prec(\hat{F}_{T}^{j}),\delta)\geq prec(\hat{F}_{T}^{j-1})$

Let $i_{1}>i_{2}>...i_{j}>...i_{|T|}$ be the indices in $T$ arranged in decreasing order. So $i_{j-1}>i_{j}$ . Then $S(prec(\hat{F}_{T}^{j}),\delta)=S(\epsilon_{i_{j}},\delta)=\epsilon_{i_{j}+1}$ . And because $i_{j-1}\geq i_{j}+1$ , it implies that $\epsilon_{i_{j-1}}\leq\epsilon_{i_{j}+1}$ , and this yields the claim. ∎

Let $G\subset[k^{2}+1]$ be the set of indices of Windows that are good - i.e. $W(\epsilon_{i})$ is good iff $i\in G$ . Then let $\hat{F}_{G}^{1},\hat{F}_{G}^{2},...\hat{F}_{G}^{|G|}$ be the $G$ -sequence of estimates. Because the sequence of Windows $\mathcal{W}$ is disjoint, and each pair of Gaussians (and the corresponding parameter distance) can only make a single Window bad, the set $G$ is a strict majority - i.e. $|G|>|[k^{2}+1]-G|$ .

The $G$ -sequence of estimates is an $O(k)\epsilon_{1}$ -correct chain, and $\hat{F}_{G}^{1}$ is an $O(k)\epsilon_{1}$ -correct subdivision for $F$ .

Let $\epsilon^{\prime}_{1}<<\epsilon^{\prime}_{2}<<...\epsilon^{\prime}_{|G|}$ be the sequence of precisions given by $prec(\hat{F}_{G}^{1}),prec(\hat{F}_{G}^{2}),...prec(\hat{F}^{|G|})$ .

Using Lemma 31, $\hat{F}_{G}^{1}\in\mathcal{D}_{O(k)\epsilon^{\prime}_{1}}(F)$ . Because $O(k)\epsilon^{\prime}_{1}\leq O(k)S(\epsilon^{\prime}_{2},\delta)\leq R(\epsilon^{\prime}_{2},\delta)$ (using the above claim) this implies that $\hat{F}_{G}^{1}\in\mathcal{D}_{R(\epsilon^{\prime}_{2},\delta)}(F)$ and so we can apply Lemma 31 again and $\hat{F}^{2}\in\mathcal{D}_{O(k)\epsilon^{\prime}_{2}}(\hat{F}^{1})$ . Continuing this argument, for all $i$ , $\hat{F}^{i+1}\in\mathcal{D}_{O(k)\epsilon^{\prime}_{i}}(\hat{F}^{i})$ . Since $O(k)\epsilon^{\prime}_{i}\leq O(k)\epsilon^{\prime}_{|G|}\leq O(k)\epsilon_{1}$ , the sequence $\hat{F}^{1},\hat{F}^{2},...\hat{F}^{|G|}$ is an $O(k)\epsilon_{1}$ -correct chain. ∎

Given a subset $G^{\prime}\subset[k^{2}+1]$ , we can check if the $G^{\prime}$ -sequence of estimates is an $O(k)\epsilon_{1}$ -correct chain because this property is only a function of the estimates. Then if we consider all sets in $2^{[k^{2}+1]}$ , we will find some set $G^{\prime}\subset[k^{2}+1]$ that is a strict majority (i.e. $|G^{\prime}|>|[k^{2}+1]-G^{\prime}|$ ) and the $G^{\prime}$ -sequence of estimates is an $O(k)\epsilon_{1}$ -correct chain. Because $G^{\prime}$ is a strict majority, and a strict majority $G$ of the Windows are good, $G\cap G^{\prime}\neq\emptyset$ . Suppose that $g\in G\cap G^{\prime}$ , and let $j$ the value such that $g$ is the $j^{th}$ largest index in $G^{\prime}$ .

Given the $G^{\prime}$ -sequence of estimates, we can take the sequence $\mathcal{S}=F,\hat{F}_{G^{\prime}}^{j},\hat{F}_{G^{\prime}}^{j+1},...\hat{F}_{G^{\prime}}^{|G^{\prime}|}$ . Since the index $g$ corresponds to a good Window ( $W(\epsilon_{g})$ is good), the computation $\hat{F}_{G^{\prime}}^{j}$ (which corresponds to the estimate $\hat{F}^{g}$ ) is at least an $O(k)\epsilon_{1}$ -correct subdivision of $F$ . So the sequence $\mathcal{S}$ is an $O(k)\epsilon_{1}$ -correct chain. So we can apply Lemma 24, and this implies that $\hat{F}_{G^{\prime}}^{|G^{\prime}|}$ (i.e. the last estimate in the sequence $\mathcal{S}$ ) is an $(Ck)^{k^{2}+1}\epsilon_{1}$ -correct subdivision for $F$ . Since $(Ck)^{k^{2}+1}\epsilon_{1}\leq\epsilon$ , this implies that $\hat{F}_{G^{\prime}}^{|G^{\prime}|}$ is an $\epsilon$ -correct subdivision for $F$ .

However, we have assumed thus far in the proof of this theorem that each Gaussian has variance at least $\frac{1}{2}$ . So given samples from $F$ , we can add random noise to each sample. We add Gaussian noise of variance $\frac{1}{2}$ and mean , and this corresponds to convolving the original distribution $F$ by $\mathcal{N}(0,\frac{1}{2})$ to obtain a new distribution $F^{\prime}$ . Then this distribution $F^{\prime}$ has each Gaussian with variance at least $\frac{1}{2}$ and is also in nearly isotropic position - because the original mixture $F$ was in isotropic position, and convolving by $\mathcal{N}(0,\frac{1}{2})$ just adds $\frac{1}{2}$ to the variance of the mixture ( $var(F^{\prime})=\frac{1}{2}+var(F)$ ).

Using the above argument, we can recover an estimate $\hat{F}_{G^{\prime}}^{|G^{\prime}|}$ that is an $\epsilon$ -correct subdivision for $F^{\prime}$ . We can subtract $\frac{1}{2}$ from the variance of each component in $\hat{F}_{G^{\prime}}^{|G^{\prime}|}$ , and then using Claim 27 this resulting mixture $N$ will be an $\epsilon$ -correct subdivision for $F$ .

Appendix E Exponential Dependence on k𝑘k is Inevitable

We restate the main proposition that we prove in this section:

Proposition 15. There exists a pair $D_{1},D_{2}$ of $1/(4k^{2}+2)$ -standard distributions that are each mixtures of $k^{2}+1$ Gaussians such that

The following lemma will be helpful in the proof of correctness of our construction.

Let $F_{k}(x)=c_{k}\sum_{i=-\infty}^{\infty}\frac{1}{\sqrt{\pi}}e^{-(i/k)^{2}}\mathcal{N}(i/k,1/2,x),$ where $c_{k}$ is a constant chosen so as to make $F_{k}$ a distribution.

The probability density function $F_{k}(x)$ can be rewritten as $F_{k}(x)=\left(c_{k}C_{1/k}(x)\mathcal{N}(0,1/2,x)\right)\circ\mathcal{N}(0,1/2,x),$ where $C_{1/k}(x)$ denotes the infinite comb function, consisting of delta functions spaced a distance $1/k$ apart, and $\circ$ denotes convolution. Considering the Fourier transform, we see that

It is now easy to see that why the lemma should be true, since the transformed comb has delta functions spaced at a distance $k$ apart, and we’re convolving by a Gaussian of variance 2 (essentially yielding nonoverlapping Gaussians with centers at multiples of $k$ ) , and then multiplying by a Gaussian of variance 2. The final multiplication will nearly kill off all the Gaussians except the one centered at 0, yielding a Gaussian with variance $1$ centered at the origin, whose inverse transform will yield a Gaussian of variance 1, as claimed.

To make the details rigorous, observe that the total Fourier mass of $\mathcal{F}(F_{k})$ that ends up within the interval $[-k/2,k/2]$ contributed by the delta functions aside from the one at the origin, even before the final multiplication by $\mathcal{N}(0,2),$ is bounded by the following:

Additionally, this $L_{1}$ fourier mass is an upper bound on the $L_{2}$ Fourier mass. The total $L_{1}$ Fourier mass (which bounds the $L_{2}$ mass) outside the interval $[-k/2,k/2]$ contributed by the delta functions aside from the one at the origin is bounded by

From Plancherel’s Theorem: $F_{k}$ , the inverse transform of $\mathcal{F}(F)$ , is a distribution, whose $L_{2}$ distance from a single Gaussian (possibly scaled) of variance 1 is at most $8e^{-k^{2}/8}.$ To translate this $L_{2}$ distance to $L_{1}$ distance, note that the contributions to the $L_{1}$ norm from outside the interval $[-k,k]$ is bounded by $4\int_{k}^{\infty}\mathcal{N}(0,1,x)dx\leq 4\frac{1}{k\sqrt{2\pi}}e^{-k^{2}/2}.$ Since the magnitude of the derivative of $F_{k}-c_{k}k\frac{1}{2\sqrt{2\pi}}\mathcal{N}(0,1)$ , is at most 2 and the value of $F_{k}(x)-c_{k}k\frac{1}{2\sqrt{2\pi}}\mathcal{N}(0,1,x)$ is close to at the endpoints of the interval $[-k,k]$ , we have

which, combined with the above bounds on the $L_{2}$ distance, yields $\max_{x\in[-k,k]}(|F_{k}(x)-c_{k}k\frac{1}{2\sqrt{2\pi}}\mathcal{N}(0,1,x)|)\leq(72e^{-k^{2}/8})^{1/3}.$ Thus we have

The lemma follows from the additional observation that

where the minimization is taken to be over all functions that are probability density functions. ∎

Proof of Proposition 15: We will construct a pair of $1/(4k^{2}+2)$ -standard distributions, $D_{1},D_{2}$ , that are mixtures of $k^{2}+1$ Gaussians, whose statistical distance is inverse exponential in $k$ . Let

where $c_{k}^{\prime}$ is a constant chosen so as to make $c_{k}^{\prime}\sum_{i=-k^{2}}^{k^{2}}\mathcal{N}(0,1/2,i/k)\mathcal{N}(i/k,1/2)$ a distribution. Clearly the pair of distributions is $1/(4k^{2}+2)$ -standard, since all weights are at least $1/(4k^{2}+2),$ and the Gaussian component of $D_{1}$ centered at can not be paired with any component of $D_{2}$ without having a discrepancy in parameters of at least $1/2k.$

We now argue that $D_{1},D_{2}$ are statistically close. Let $D_{2}^{\prime}=c_{k}^{\prime}\sum_{i=-k^{2}}^{k^{2}}\mathcal{N}(0,1/2,i/k)\mathcal{N}(i/k,1/2).$ Note that $\int_{k}^{\infty}F_{k}(x)dx\leq\int_{k}^{\infty}\mathcal{N}(0,1/2,x)2\max_{y}(\mathcal{N}(0,1/2,y))dx\leq\frac{2\sqrt{2}}{k\sqrt{\pi}}e^{-k^{2}}\leq 2e^{-k^{2}},$ and thus $||D_{2}^{\prime}-F_{k}||_{1}\leq 8e^{-k^{2}},$ and our claim follows from Lemma 36. $\Box$

Appendix F Partition Pursuit

We first need to ensure that if we consider two directions $r,r_{x,y}$ that are $\epsilon_{2}$ -close, the parameters of a component in $P_{u}[F]$ cannot change too much as we vary $u$ from $r$ to $r_{x,y}$ .

Given a mixture of $k$ $n$ -Dimensional Gaussians $F=\sum_{i}w_{i}F_{i}$ that is in isotropic position and is $\epsilon$ -statistically learnable, for all $i$ , $\|\mu_{i}\|,\|\Sigma_{i}\|_{2}\leq\frac{1}{\epsilon}$ .

For all $i,j$ s.t. $\|\mu_{i}-\mu_{j}\|\leq\frac{1}{\epsilon}$ because if we project onto the direction $\frac{\mu_{1}-\mu_{2}}{\|\mu_{1}-\mu_{2}\|}$ the variance of the mixture $F$ is $1$ and is also at least $w_{i}w_{j}\|\mu_{i}-\mu_{j}\|^{2}$ , and this implies that $\|\mu_{i}-\mu_{j}\|\leq\frac{1}{\epsilon}$ . Yet the convex hull of $\mu_{i}$ for all $i$ contains the origin and so

Similarly, for any $i\in[k]$ , if we choose $u$ corresponding to the direction of the maximum eigenvector of $\Sigma_{i}$ ,

and so $\|\Sigma_{i}\|_{2}\leq\frac{1}{\epsilon}$ . ∎

Suppose $F$ is an $n$ -dimensional GMM that is $\epsilon$ -statistically learnable.

Let $\hat{F}^{u},\hat{F}^{v}$ be univariate mixtures of Gaussians. Then we call components $\hat{F}^{u}_{a},\hat{F}^{v}_{b}$ paired estimates if there is some $\pi_{u},\pi_{v}$ and $i\in[k]$ such that $\pi_{u}(i)=a,\pi_{v}(i)=b$ and $(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F])$ and $(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F])$ .

Let $(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F])$ and $(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F])$ , then for every component $\hat{F}^{u}_{a}$ in $\hat{F}^{u}$ , there is some component $\hat{F}^{v}_{b}$ such that the components $\hat{F}^{u}_{a},\hat{F}^{v}_{b}$ are paired estimates.

This follows because $\pi_{u}$ is onto. ∎

Suppose $u,v$ are $\epsilon_{2}$ -close (i.e. $\|u-v\|\leq\epsilon_{2}$ ), and let $\hat{F}^{u}_{a},$ and $\hat{F}^{v}_{b}$ be paired estimates.

$D_{p}(\hat{F}^{u}_{a},\hat{F}^{v}_{b})\leq 2\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}$ .

$\pi_{u},\pi_{v}$ and $i\in[k]$ such that $\pi_{u}(i)=a,\pi_{v}(i)=b$ and $(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F])$ and $(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F])$ . Then

Then this implies that $D_{p}(P_{u}[F_{i}],P_{v}[F_{i}])\leq\|\mu_{i}\|\|u-v\|+2\|\Sigma_{i}\|_{2}\epsilon_{2}+\|\Sigma_{i}\|_{2}\epsilon_{2}^{2}$ and if we apply Claim 37, this is at most $\frac{4\epsilon_{2}}{\epsilon}$ , and this implies the claim. ∎

F.2 Reconstruction

Lemma 7. Let $\epsilon_{2},\epsilon_{1}>0$ . Suppose $|m^{0}-\mu\cdot r|$ , $|{m}^{ij}-\mu\cdot r^{ij}|$ , $|v^{0}-r^{T}\Sigma r|$ , $|v^{ij}-(r^{ij})^{T}\Sigma r^{ij}|$ are all at most $\epsilon_{1}$ . Then Solve outputs $\hat{\mu}\in{\bf R}^{n}$ and $\hat{\Sigma}\in{\bf R}^{n\times n}$ such that $\|\hat{\mu}-\mu\|<\frac{\epsilon_{1}\sqrt{n}}{\epsilon_{2}}$ , and $\|\hat{\Sigma}-\Sigma\|_{F}\leq\frac{6n\epsilon_{1}}{\epsilon_{2}^{2}}$ . Furthermore, $\hat{\Sigma}\succeq 0$ and $\hat{\Sigma}$ is symmetric.

We will again need the notion of a Window:

Given a target additive error $\epsilon$ , we call a Window $W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ well-separated if the following conditions hold:

$\max(\frac{\epsilon_{1}\sqrt{n}}{\epsilon_{2}},\frac{6n\epsilon_{1}}{\epsilon_{2}^{2}})\leq\epsilon$

$\frac{\epsilon_{2}}{\epsilon}+\epsilon_{1}<<\epsilon_{3}$

$\frac{\epsilon_{2}}{\epsilon}+\epsilon_{1}+\epsilon_{3}<<\epsilon_{4}$

Given any univariate estimate $\hat{F}$ that (weakly) satisfies a Window $W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ , the set of components of $\hat{F}$ with parameter distance at most $\epsilon_{1}$ is an equivalence class.

Let $u,v$ be two directions that are $\epsilon_{2}$ -close - i.e. $\|u-v\|\leq\epsilon_{2}$ . Suppose that $(\hat{F}^{u},\pi_{u})\in\mathcal{D}_{\epsilon_{1}}(P_{u}[F])$ and $(\hat{F}^{v},\pi_{v})\in\mathcal{D}_{\epsilon_{1}}(P_{v}[F])$ . Suppose further that $\hat{F}^{u}$ and $\hat{F}^{v}$ (weakly) satisfy the Window $(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ . Let $\mathcal{E}^{u}=\{\mathcal{E}^{u}_{1},\mathcal{E}^{u}_{2},...\mathcal{E}^{u}_{k^{\prime}}\}$ and $\mathcal{E}^{v}=\{\mathcal{E}^{v}_{1},\mathcal{E}^{v}_{2},...\mathcal{E}^{v}_{k^{\prime\prime}}\}$ be the equivalence classes of components of $\hat{F}^{u},\hat{F}^{v}$ respectively at parameter distance at most $\epsilon_{1}$ .

Then $k^{\prime}=k^{\prime\prime}$ , and there is a permutation $\pi_{u,v}:[k^{\prime}]\rightarrow[k^{\prime\prime}]$ such that $P_{u}[F_{j}]$ is mapped to the equivalence class $\mathcal{E}^{u}_{h}$ by the mapping $\pi_{u}$ iff $P_{v}[F_{j}]$ is mapped to the equivalence $\mathcal{E}^{v}_{\pi_{u,v}(h)}$ by the mapping $\pi_{v}$ . Also we can construct $\pi_{u,v}$ from the estimates $\hat{F}^{u},\hat{F}^{v}$ .

To establish this claim, consider two distinct equivalence classes $\mathcal{E}^{u}_{a},\mathcal{E}^{u}_{b}$ , and let $\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}}$ be arbitrary representative. For each component $\hat{F}^{u}_{a^{\prime}}$ in $\hat{F}^{u}$ , there is some component $P_{u}[F_{i}]$ in $P_{u}[F]$ that is mapped by $\pi_{u}$ to $\hat{F}^{u}_{a^{\prime}}$ . Then let $P_{u}[F_{i}],P_{u}[F_{j}]$ be mapped to $\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}}$ respectively - i.e. $\pi_{u}(i)=a^{\prime},\pi_{v}(j)=b^{\prime}$ . Then since $\hat{F}^{u}$ (weakly) satisfies the Window $W$ , we have that $D_{p}(\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}})\geq\epsilon_{3}$ .

Suppose that $P_{v}[F_{i}],P_{v}[F_{j}]$ are are mapped to $\hat{F}^{v}_{c^{\prime}},\hat{F}^{v}_{d^{\prime}}$ and these two components are in the same equivalence class in the mixture $\hat{F}^{v}$ . Then $D_{p}(\hat{F}^{v}_{c^{\prime}},\hat{F}^{v}_{d^{\prime}})\leq\epsilon_{1}$ . Yet $\hat{F}^{v}_{a^{\prime}},\hat{F}^{v}_{c^{\prime}}$ are paired estimates so using Claim 39, $D_{p}(\hat{F}^{v}_{a^{\prime}},\hat{F}^{v}_{c^{\prime}})\leq 2\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}$ , and similarly for $\hat{F}^{v}_{b^{\prime}},\hat{F}^{v}_{d^{\prime}}$ . Then $D_{p}(\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}})\leq\epsilon_{1}+4\epsilon_{1}+\frac{8\epsilon_{2}}{\epsilon}$ using the triangle inequality, but this contradicts the above implication that $D_{p}(\hat{F}^{u}_{a^{\prime}},\hat{F}^{u}_{b^{\prime}})\geq\epsilon_{3}$ because $\epsilon_{3}>>\epsilon_{1}+\frac{\epsilon_{2}}{\epsilon}$ .

This implies that every every two components in $\hat{F}^{u}$ that are in a different equivalence classes must be each paired to to two components in $\hat{F}^{v}$ that are also in a different equivalence class. The claim is symmetric w.r.t. $u,v$ , so this implies that $\hat{F}^{u},\hat{F}^{v}$ have the same number of equivalence classes.

And also consider any component $\hat{F}^{u}_{a}$ . Using Claim 38, there is some component $\hat{F}^{v}_{b}$ so that $\hat{F}^{u}_{a},\hat{F}^{v}_{b}$ are paired estimates. Then using Claim 39, $D_{p}(\hat{F}^{u}_{a},\hat{F}^{v}_{b})\leq 2\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}$ . Yet for any component $\hat{F}^{v}_{c}$ that is not in the same equivalence class as $\hat{F}^{v}_{b}$ ,

where the last line follows because $\hat{F}^{v}$ (weakly) satisfies the Window $W$ . So we can construct $\pi_{u,v}$ given just $\hat{F}^{u},\hat{F}^{v}$ because for any pair of equivalence classes $\mathcal{E}^{u}_{i},\mathcal{E}^{v}_{j}$ , if there is a pair of Gaussians that are paired estimates, the parameter distance between any representative from $\mathcal{E}^{u}_{i}$ to any representative from $\mathcal{E}^{v}_{j}$ must be at most $4\epsilon_{1}+\frac{4\epsilon_{2}}{\epsilon}$ . Yet if there is no such pair of Gaussians, one from each equivalence class, that are paired estimates, the parameter distance between any representative from $\mathcal{E}^{u}_{i}$ to any representative from $\mathcal{E}^{v}_{j}$ is at least $\epsilon_{3}-2\epsilon_{1}-\frac{4\epsilon_{2}}{\epsilon}$ so we can distinguish these two cases because $\epsilon_{3}>>\epsilon_{1}+\frac{\epsilon_{2}}{\epsilon}$ . ∎

Let $W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ be a well-separated window. Suppose for some root direction $r$ , and $n^{2}$ $\epsilon_{2}$ -close-by directions $r_{x,y}$ as in the Partition Pursuit Algorithm, we run the General Univariate Algorithm with precision $\epsilon_{1}$ and for each run we get an estimate $\hat{F}^{x,y}$ that (weakly) satisfies the Window $W$ . Then suppose we run Solve given the directions $r,r_{x,y}$ and the estimate $\hat{F}^{x,y}$ .

Solve returns an $n$ -dimensional estimate $\hat{F}$ that is an $\epsilon$ -correct subdivision of $F$ .

We can apply Lemma 41 and find a partition of all equivalence classes that arise in any estimate in any direction, into sets $\mathcal{E}^{h}=\{\mathcal{E}^{h}_{1},\mathcal{E}^{h}_{2},...\mathcal{E}^{h}_{n^{2}}\}$ with the property that for all $F_{i}$ , there is some $h$ such that in each direction $r_{x,y}$ , $F_{i}$ is mapped some equivalence class in $\mathcal{E}^{h}$ . Suppose in direction $r_{x,y}$ , $F_{i}$ is mapped to the equivalence class $\mathcal{E}^{h}_{j}$ . Then we can take an arbitrary $\hat{F}^{h}_{j}$ in this set, and use these parameters as an estimate for the projected mean and projected variance of $P_{r_{x,y}}[F_{i}]$ and these estimates will be $2\epsilon_{1}$ close in parameter distance to the actual values. So we can apply Lemma 7, and the component $\hat{F}_{h}$ of the estimate $\hat{F}$ output from Solve that has parameter distance at most $\epsilon$ to $F_{i}$ . So for every component $F_{i}$ , there will be some estimate $\hat{F}_{h}$ output from Solve that has parameter distance at most $\epsilon$ to $F_{i}$ . Additionally, for every set of equivalence classes $\mathcal{E}^{h}$ , there is some component $F_{i}$ with the property that in each direction $r_{x,y}$ , $F_{i}$ is mapped some equivalence class in $\mathcal{E}^{h}$ . So the mapping from a component $F_{i}$ to an estimate $\hat{F}_{h}$ that is $\epsilon$ -close in parameter distance, will be onto. Lastly, given any partition into sets $\mathcal{E}^{1},\mathcal{E}^{2},...\mathcal{E}^{k^{\prime}}$ , we can choose the weight $\hat{w}_{h}$ to be the sum of the estimated weights in any equivalence class $\mathcal{E}^{h}_{j}$ in the set, and because the General Univariate Algorithm returns an $\epsilon_{1}$ -correct subdivision, this aggregate weight $\hat{w}_{h}$ will be within an additive $k\epsilon_{1}$ of the actual aggregate weight of the components $F_{i}$ that are $\epsilon$ -close in parameter distance to $\hat{F}_{h}$ . ∎

F.3 Observed Components

Given precision $\epsilon_{1}$ (given to the General Univariate Algorithm), we say that the number of observed pairs in the estimate $\hat{F}$ returned is the maximum value of ${k^{\prime\prime}\choose 2}$ such that there is a subset of $k^{\prime\prime}$ components of $\hat{F}$ with the property that every pair is at parameter distance $>\epsilon_{1}$ . And we will say that the number of observed components is $k^{\prime\prime}$ .

Suppose we are given any well-separated Window $W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ , and an estimate $\hat{F}$ that (weakly) satisfies the Window $W$ . Suppose further that the set of equivalence classes $\mathcal{E}_{1},\mathcal{E}_{2},...\mathcal{E}_{k^{\prime}}$ (of components in $\hat{F}$ at parameter distance at most $\epsilon_{1}$ )has $k^{\prime}$ elements.

The number of observed components is $k^{\prime}$ .

So let $u,v$ be two directions that are $\epsilon_{2}$ -close (i.e. $\|u-v\|\leq\epsilon_{2}$ ), and let $\hat{F}^{u},\hat{F}^{v}$ be the estimates returned by the General Univariate Algorithm when given target precision $\epsilon_{1}$ , for the directions $u$ , $v$ respectively. Suppose further that $\hat{F}^{u}$ (strongly) satisfies the Window $W$ .

Then the estimate $\hat{F}^{v}$ will either (weakly) satisfy the Window $W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ , or the number of observed pairs in $\hat{F}^{v}$ is strictly more than the number observed in $\hat{F}^{u}$ .

Since the estimate $\hat{F}^{u}$ (strongly) satisfies the Window $W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ , it also (weakly) satisfies this Window. So we can apply Claim 43 and this implies that there are $k^{\prime}$ observed components in the estimate $\hat{F}^{u}$ (if there are $k^{\prime}$ equivalence classes of components in $\hat{F}^{u}$ at parameter distance at most $\epsilon_{1}$ ).

Let $\hat{F}^{v}_{c},\hat{F}^{v}_{d}$ be two arbitrary components in $\hat{F}^{v}$ . We can apply Claim 38 to get two components $\hat{F}^{u}_{a},\hat{F}^{u}_{b}$ in $\hat{F}^{u}$ such that $\hat{F}^{u}_{a}$ and $\hat{F}^{v}_{c}$ are paired estimates, and similarly $\hat{F}^{u}_{b}$ and $\hat{F}^{v}_{d}$ are also paired estimates.

Suppose $\hat{F}^{u}_{a},\hat{F}^{u}_{b}$ are not in the same equivalence class in $\hat{F}^{u}$ . This implies that $D_{p}(\hat{F}^{u}_{a},\hat{F}^{u}_{b})\geq\epsilon_{4}$ because $\hat{F}^{u}$ (strongly) satisfies the Window $W$ . Using Claim 39, we get that

so this implies that the parameter distance $D_{p}(\hat{F}^{v}_{c},\hat{F}^{v}_{d})$ does not contribute to $\hat{F}^{v}$ not (weakly) satisfying $W$ .

So suppose $\hat{F}^{u}_{a},\hat{F}^{u}_{b}$ are in the same equivalence class in $\hat{F}^{u}$ . Then using Claim 39, we get that

because $D(\hat{F}^{u}_{a},\hat{F}^{u}_{b})\leq\epsilon_{1}$ .

This implies that the only way that the Window $W$ could be not (weakly) satisfied if there is some pair $\hat{F}^{v}_{c},\hat{F}^{v}_{d}$ for which the paired estimates of each are in the same equivalence class in $\hat{F}^{u}$ , and yet $D_{p}(\hat{F}^{v}_{c},\hat{F}^{v}_{d})>\epsilon_{1}$ . So for each other equivalence class in $\hat{F}^{u}$ (other than the one that $\hat{F}^{u}_{a},\hat{F}^{u}_{b}$ are in), we can select a representative component $\hat{F}^{u}_{e}$ , and for each one we apply Claim 38 and find a corresponding component $\hat{F}^{v}_{e^{\prime}}$ . If we take this set, and $\hat{F}^{v}_{c},\hat{F}^{v}_{d}$ this is a set of $k^{\prime}+1$ components, and using the above argument all pairs of distances are at least $\epsilon_{3}>>\epsilon_{1}$ , except for the pair $D_{p}(\hat{F}^{v}_{c},\hat{F}^{v}_{d})$ which is still $>\epsilon_{1}$ , so we have $k^{\prime}+1$ observed components in $\hat{F}^{v}$ if $\hat{F}^{v}$ does not (weakly) satisfy the Window $W$ . ∎

F.4 Partition Pursuit

Theorem 8. Given an $\epsilon$ -statistically learnable GMM $F$ in isotropic position, the Partition Pursuit Algorithm will recover an $\epsilon$ -correct sub-division $\hat{F}$ and if $F$ has more than one component, $\hat{F}$ also has more than one component.

Given an $\epsilon$ -statistically learnable GMM $F$ in $n$ dimensions (and in isotropic position), we can project onto a direction $r$ chosen uniformly at random. Using Lemma 13, we can instantiate the Partition Pursuit Algorithm with a Window $W=(\epsilon_{1},\epsilon_{2},\epsilon_{3},\epsilon_{4})$ with $\epsilon_{4}=poly(\epsilon,\frac{1}{n})$ so that there is at least one pair of Gaussians (with high probability) that when projected onto $r$ are at parameter distance at least $\epsilon_{4}$ . So when we run the General Univariate Algorithm after projecting onto the direction $r$ , the estimate returned $\hat{F}^{r}$ will have at least two components in order for it to be an $\epsilon_{1}$ correct subdivision for $P_{r}[F]$ .

If the estimate $\hat{F}^{r}$ returned by the General Univariate Algorithm does not (strongly) satisfy the Window $W$ , we can perform a shifting operation on the Window $W$ to obtain a new Window $W^{\prime}=(\epsilon_{1}^{\prime},\epsilon_{2}^{\prime},\epsilon_{3}^{\prime},\epsilon_{1})$ so that $W^{\prime}$ is also well-separated and the number of pairwise components observed has strictly increased. So eventually we can find a Window $W^{\prime}=(\epsilon^{\prime}_{1},\epsilon^{\prime}_{2},\epsilon^{\prime}_{3},\epsilon^{\prime}_{4})$ such that the estimate $\hat{F}^{r}$ returned by General Univariate Algorithm run with target precision $\epsilon_{1}$ (strongly) satisfies the Window. Because the number of observed components strictly increases each time we perform a shifting operation, the number of times that we must slide the Window is at most $k$ . And each time we slide a Window, the parameters of the new Window are polynomially related to the parameters in the old Window. So the precision $\epsilon^{\prime}_{1}$ of this Window will be some polynomial in the original precision $\epsilon_{1}$ .

So the total number of times that we need to slide the Window is at most $k$ , and this implies that the parameters we need remain polynomially lower-bounded in $\epsilon,\frac{1}{n}$ . And when we need to perform no more slides, we have reached a root direction $r$ such that the estimate returned by the General Univariate Algorithm is (strongly) consistent with the Window $W^{\prime}$ , and for each direction $r_{i,j}$ the estimate returned by the General Univariate Algorithm (weakly) satisfies the Window $W^{\prime}$ as well.

Using Claim 42, this implies that the output of our algorithm is an $n$ -dimensional $\epsilon$ -correct sub-division $\hat{F}$ for $F$ . ∎

Appendix G Clustering and Recursion

Suppose the estimate $\hat{F}$ returned by the Partition Pursuit Algorithm is an $\epsilon_{1}$ -correct subdivision for $F$ , but is not a good estimate in terms of statistical distance. The only way that this can happen is if there is some component of $F$ which has a co-variance matrix $\Sigma_{i}$ that has a very small eigenvalue. In this case, we can use this direction (i.e. the eigenvector corresponding to this eigenvalue) to cluster samples from $F$ into two sets, and proceed in each set by induction.

In this section, we give some simple claims that will be useful building blocks for deciding how to cluster. Specifically, we will need to choose some clustering scheme for samples coming from $F$ , so that there is some bi-partition of the components of $F$ into $S\subset[k]$ and $[k]-S$ such that any sample generated from $F_{i}$ ( $i\in S$ ) has a negligible probability of being mis-clustered.

Given a set of $k$ points $x_{1},x_{2},...x_{k}\in\Re$ on the line and the maximum distance between any pair is $\Delta$ . Then there is a bi-partition $A\subset\{x_{1},x_{2},...x_{k}\}$ , $B=\{x_{1},x_{2},...x_{k}\}-A$ such that $D(A,B)\geq\frac{\Delta}{2^{k-2}}$ (and $A,B\neq\emptyset$ ) and $diam(A),diam(B)\leq\Delta(1-2^{k-1})$ .

Assume that $x_{1}$ is at least as small as any other value in the set, and assume that $x_{2}$ is at least as large as any other value in the set. Then set $A_{2}=\{x_{1}\},B_{2}=\{x_{2}\}$ . Clearly $D(A_{2},B_{2})\geq\Delta$ . Consider the point $x_{3}$ . Either $D(A_{2},x_{3})$ or $D(B_{2},x_{e})$ must be at least $\frac{\Delta}{2}$ , using the triangle inequality (because $D(A_{2},B_{2})\geq\Delta$ ). Add the point $x_{3}$ to the side that it is closest to, and the resulting subsets $A_{3},B_{3}$ are at distance at least $\frac{\Delta}{2}$ . Iterating this procedure yields two subset $A_{k},B_{k}$ that are disjoint, have $D(A_{k},B_{k})\geq\frac{\Delta}{2^{k-2}}$ and $A_{k}\cup B_{k}=\{x_{1},x_{2},...x_{k}\}$ . Also $diam(A_{k})=\max_{x_{i}\in A_{k}}x_{i}-x_{1}\leq x_{2}-D(A_{k},B_{k})-x_{1}\leq\Delta(1-2^{k-1})$ , and similarly for $B_{k}$ . So take $A=A_{k},B=B_{k}$ , and this implies the claim. ∎

Given a set of $k$ points $x_{1},x_{2},...x_{k}\in\Re^{+}$ on the line that are strictly positive s.t. the maximum ratio of any two points in the set is $C>1$ . Then there is a bi-partition $A\subset\{x_{1},x_{2},...x_{k}\}$ , $B=\{x_{1},x_{2},...x_{k}\}-A$ such that for all $x_{i}\in A,x_{j}\in B$ ,

(and $A,B\neq\emptyset$ ) and also for all $x_{i},x_{j}\in A$ , $\frac{x_{i}}{x_{j}}\leq C^{1-\frac{1}{2^{k}}}$ and also for all $x_{i},x_{j}\in B$ , $\frac{x_{i}}{x_{j}}\leq C^{1-\frac{1}{2^{k}}}$ .

Let $y_{1},y_{2},...y_{k}\in\Re$ be the logarithm of each point $x_{i}$ - i.e. $y_{i}=\log x_{i}$ . Then the maximum distance between any two points in $y_{1},y_{2},...y_{k}$ is $\max_{i,j}\log x_{i}-\log x_{j}=\max_{i,j}\frac{x_{i}}{x_{j}}=\log C$ . So let $\Delta=\log C$ and apply Claim 45 to the set $y_{1},y_{2},..y_{k}$ . Then we get a bipartition $A^{\prime},B^{\prime}$ of $y_{1},y_{2},...y_{k}$ and let $A,B$ be the corresponding bi-partition of $x_{1},x_{2},...x_{k}$ - i.e. $x_{i}\in A$ iff $y_{i}\in A^{\prime}$ .

Then $\min_{y_{i}\in A^{\prime},y_{j}\in B^{\prime}}y_{i}-y_{j}\geq\frac{\Delta}{2^{k-1}}$ and $y_{i}-y_{j}=\log\frac{x_{i}}{x_{j}}$ . So this implies that

Also from Claim 45, we have that $\max_{y_{i},y_{j}\in A^{\prime}}y_{i}-y_{j}\geq\Delta(1-\frac{1}{2^{k-1}})$ and $y_{i}-y_{j}=\log\frac{x_{i}}{x_{j}}$ and so

Let $\hat{F}$ be a mixture of $n$ -dimensional Gaussians s.t. $\hat{F}$ is an $\epsilon_{1}$ -correct sub-division for $F$ . Also we assume that $F$ is in isotropic position.

Let $F$ be an $\epsilon$ -statistically learnable distribution in isotropic position. Let $(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F)$ . Then for any direction $r$ , $var(P_{r}[\hat{F}])\geq 1-k^{2}O(\frac{\epsilon_{1}}{\epsilon^{2}})$

Let $\mu=\sum_{i}w_{i}\mu_{i},\hat{\mu}=\sum_{i}\hat{w}_{i}\hat{\mu}_{i}$ . We can apply Claim 37 to get that $\|\mu-\hat{\mu}\|\leq\epsilon_{1}+kO(\frac{\epsilon_{1}}{\epsilon})=O(\frac{k\epsilon_{1}}{\epsilon})$ . Also using Claim 37, we obtain $\|\Sigma_{i}\|_{2}\leq\frac{1}{\epsilon}$ and $\|\hat{\Sigma}_{\pi(i)}\|_{2}\leq\frac{1}{\epsilon}+\epsilon_{1}$ .

Consider any symmetric matrix $A$ : $(u+v)^{T}A(u+v)=u^{T}Au+2v^{T}Au+v^{T}Av$ . And so

And we can apply this equation using $A=rr^{T}$ , $u=\hat{\mu}_{\pi(i)}-\hat{\mu}$ and $v=\mu_{i}-\mu-u$ and note that $\|A\|_{2}=1,\|u\|\leq O(\frac{1}{\epsilon}+\frac{k\epsilon_{1}}{\epsilon})=O(\frac{1}{\epsilon})$ and $\|v\|\leq O(\frac{k\epsilon_{1}}{\epsilon})$ . Then this implies that $(r^{T}(\mu_{i}-\mu))^{2}\leq(r^{T}(\hat{\mu}_{\pi(i)}-\hat{\mu}))^{2}+O(k\frac{\epsilon_{1}}{\epsilon^{2}})$ . Then if we take $\Delta$ to be the discrete distribution $r^{T}\mu_{i}$ with probability $w_{i}$ , and similarly $\hat{\Delta}$ to be the discrete distribution $r^{T}\hat{\mu}_{i}$ with probability $\hat{w}_{i}$ , $var(\hat{\Delta})\geq var(\Delta)-O(k^{2}\frac{\epsilon_{1}}{\epsilon^{2}})$ .

Also $|r^{T}(\Sigma_{i}-\hat{\Sigma}_{\pi(i)})r|\leq\|\Sigma_{i}-\hat{\Sigma}_{\pi(i)}\|_{F}\leq\frac{1}{\epsilon}$ . These facts are enough to be able to apply Fact 57 to get that $var(P_{r}[\hat{F}])\geq var(P_{r}[F])-O(k^{2}\frac{\epsilon_{1}}{\epsilon^{2}})$ ∎

G.2 How to Cluster

We will call $A,B\subset\Re^{n}$ a clustering scheme if $A\cap B=\emptyset$

For $A\subset\Re^{n}$ , we will write $P[F_{i},A]$ to denote $Pr_{x\sim F_{i}}[x\in A]$ - i.e. the probability that a randomly chosen sample from $F_{i}$ is in the set $A$ .

Let $(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F)$ . Suppose also that $\hat{F}$ is a mixture of $k^{\prime}$ components.

Lemma 9. Suppose that for some direction $v$ , for all $i$ : $v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{2}$ , for $\epsilon_{1}\leq\frac{\sqrt{\epsilon_{2}}}{2\epsilon_{3}}$ . If there is some bi-partition $S\subset[k^{\prime}]$ s.t. $\forall_{i\in S,j\in[k^{\prime}]-S}|v^{T}\hat{\mu}_{i}-v^{T}\hat{\mu}_{j}|\geq\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}}$ then there is a clustering scheme $(A,B)$ (based only on $\hat{F}$ ) so that for all $i\in S,j\in\pi^{-1}(i)$ , $P[F_{i},A]\geq 1-\epsilon_{3}$ and for all $i\notin S,j\in\pi^{-1}(i)$ , $Pr[F_{i},B]\geq 1-\epsilon_{3}$ .

For each $i$ , consider the interval $I_{i}=[v^{T}\hat{\mu}_{i}-\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}},v^{T}\hat{\mu}_{i}+\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}}]$ . Then we will choose $A=\{x\in\Re^{n}|v^{T}x\in\cup_{i\in S}I_{i}\}$ and similarly we choose $B=\{x\in\Re^{n}|v^{T}x\in\cup_{i\notin S}I_{i}\}$ .

We first demonstrate that $A\cap B=\emptyset$ . Because of how $A,B$ are defined, this condition is equivalent to the condition that $A_{i}=\cup_{i\in S}I_{i}$ and $B_{i}=\cup_{i\notin S}I_{i}$ be disjoint. ( $A_{i},B_{i}\subset\Re$ and $A_{i}\cap B_{i}=\emptyset$ ). So consider any two intervals $I_{i},I_{j}$ for $i\in S,j\notin S$ . Then because $i,j$ are on different sides of the bipartition $S,[k^{\prime}]-S$ , we get that $|v^{T}\hat{\mu}_{i}-v^{T}\hat{\mu}_{j}|\geq\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}}$ so $I_{i},I_{j}$ are in fact disjoint. This implies $A_{i},B_{i}$ are disjoint, and this implies that $A,B$ are disjoint.

Since the standard deviation of $F_{j}$ in the direction of $v$ is at most $\sqrt{2\epsilon_{2}}$ , points outside $I_{\pi(j)}$ are at least $1/(2\epsilon_{3})$ standard deviations from their true mean. Using the fact that, for a one-dimensional Gaussian random variable, the probability of being at least $s$ standard deviations from the mean is at most $2e^{-s^{2}/2}/(\sqrt{2\pi}s)\leq 1/s$ , we get that the probability that $x$ sampled from $F_{i}$ is outside the range $[v^{T}\hat{\mu}_{i}-\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}},v^{T}\hat{\mu}_{i}+\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}}]$ is at most $\epsilon_{3}$ . And this implies the lemma. ∎

Let $(\hat{F},\pi)\in\mathcal{D}_{\epsilon_{1}}(F)$ . Suppose also that $\hat{F}$ is a mixture of $k^{\prime}$ components.

Lemma 10. Suppose that for some direction $v$ and some $i\in[k^{\prime}]$ such that: $v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{m}$ , for $\epsilon_{m}>>\epsilon_{1}$ . If there is some bi-partition $S\subset[k^{\prime}]$ s.t.

Let $T=[k^{\prime}]-S$ . Let $\hat{\sigma}_{S}=\min_{i\in S}v^{T}\hat{\Sigma}_{i}v$ , $\hat{\sigma}_{T}=\max_{j\in T}v^{T}\hat{\Sigma}_{j}v$ . So we are given that $\frac{\hat{\sigma}_{S}}{\max(\hat{\sigma}_{T},\epsilon_{m})}\geq\frac{1}{\epsilon_{t}}$ .

Let $B_{v}=\cup_{i\in T}I_{i}$ where $I_{i}=[v^{T}\hat{\mu}_{i}-\frac{\sqrt{\max(\hat{\sigma}_{T},\epsilon_{m})}}{\epsilon_{3}}-\epsilon_{1},v^{T}\hat{\mu}_{i}+\frac{\sqrt{\max(\hat{\sigma}_{T},\epsilon_{m})}}{\epsilon_{3}}+\epsilon_{1}]$ .

Let $F_{j}$ be a component in $F$ s.t. $\pi(j)=i\in T$ . Then the variance of $F_{j}$ in the direction $v$ is at most $\hat{\sigma}_{T}+\epsilon_{1}\leq 2\max(\hat{\sigma}_{T},\epsilon_{m})$ where here we have used the condition that $\epsilon_{m}>>\epsilon_{1}$ . So any point $x$ outside the interval $I_{\pi(j)}$ is at least $1/(2\epsilon_{3})$ standard deviations from their true mean. Using the fact that, for a one-dimensional Gaussian random variable, the probability of being at least $s$ standard deviations from the mean is at most $2e^{-s^{2}/2}/(\sqrt{2\pi}s)\leq 1/s$ , we get that the probability that $v^{T}x$ (when $x$ is sampled from $F_{j}$ ) is outside the range $B_{v}$ is at most $\epsilon_{3}$ .

We will we take as our clustering algorithm $B=\{x\in\Re^{n}|v^{T}x\in B_{v}\}$ and and $A=\Re^{n}-B$ , then clearly $A\cap B=\emptyset$ . So the above statement implies that $Pr[F_{j},B]\geq 1-\epsilon_{3}$ for any $i\notin S,j\in\pi^{-1}(i)$ .

Also, for any $F_{j}$ with $\pi(j)\in S$ , the variance when projected onto $v$ is at least $\hat{\sigma}_{S}-\epsilon_{1}$ . So the probability that a point $v^{T}x$ (where $x$ is sampled from $F_{j}$ ) is inside the range $B_{v}$ is at most the measure of $B_{v}$ times the maximum density of $P_{v}[F_{j}]$ . This is at most

where the last line follows because $\hat{\sigma}_{S}>>\epsilon_{m}>>\epsilon_{m}$ because $\epsilon_{\bar{S}}\geq\epsilon_{1}$ and the ratio $\frac{\hat{\sigma}_{S}}{\max(\hat{\sigma}_{T},\epsilon_{m})}\geq\frac{1}{\epsilon_{t}}$ is large, and because $\epsilon_{t}<<\epsilon_{3}^{3}$ .

So we also have that $Pr[F_{j},B]\leq\epsilon_{3}$ for all $i\in S,j\in\pi^{-1}(i)$ . So $Pr[F_{j},A]\geq 1-\epsilon_{3}$ , and this implies the lemma. ∎

G.3 Making Progress when there is a Small Variance

Lemma 11. Suppose $\|\hat{\mu}_{i}-\mu_{i}\|\leq\epsilon_{1}$ , $\|\hat{\Sigma}_{i}-\Sigma_{i}\|_{F}\leq\epsilon_{1}$ , and $|\hat{w}_{i}-w_{i}|\leq\epsilon_{1}$ , if either $\|\Sigma^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}}$ or $\|\hat{\Sigma}^{-1}_{i}\|_{2}\leq\frac{1}{2\epsilon_{m}}$ then

Now we can describe the idea behind the hierarchical clustering. Suppose the entire algorithm on $k-1$ Gaussians requires $m$ samples. Then choose $\epsilon_{3}=\frac{\epsilon\delta}{m}$ so that if we take $\frac{m}{\epsilon}$ samples in total, then each side in the bipartition that results from clustering would get at least $m$ samples and none of the samples obtained from the oracle are mis-clustered. Then we can run the $k-1$ Gaussian algorithm on each side of the bi-partition in order to get a statistically good estimate for the original mixture of $k$ Gaussians.

Given $\epsilon_{3}$ , choose $\epsilon_{2}$ s.t. $\frac{\sqrt{\epsilon_{2}}}{\epsilon_{3}}\leq\frac{1}{2^{k+1}}$ . Also choose $\epsilon_{m}<<\epsilon_{2}$ s.t. $(\frac{\epsilon_{m}}{\epsilon_{2}})^{\frac{1}{2^{k}}}<<\epsilon_{3}^{3}$ . Then choose $\epsilon_{1}<<\epsilon_{m}$ .

We call the set of parameters $\epsilon_{1}<<\epsilon_{m}<<\epsilon_{2}<<\epsilon_{3}$ good if

$\frac{2n\epsilon_{1}}{\epsilon_{m}}+\frac{\epsilon_{1}^{2}}{\epsilon_{m}}\leq\epsilon^{2}$

$k^{2}\frac{\epsilon_{1}}{\epsilon^{2}}=o(1)$

$\epsilon_{1}\leq\frac{\sqrt{\epsilon_{2}}}{2\epsilon_{3}}$

$\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}}=o(2^{-k})$

$(\frac{\epsilon_{m}}{\epsilon_{2}})^{\frac{1}{2^{k}}}<<\epsilon_{3}^{3}$

Suppose we choose a set of good parameters $\epsilon_{1}<<\epsilon_{m}<<\epsilon_{2}<<\epsilon_{3}$ . Then the Hierarchical Clustering Algorithm will either return an $\epsilon$ -close statistical estimate $\hat{F}$ for $F$ or make progress by returning a clustering scheme.

Theorem 12. The Hierarchical Clustering Algorithm either returns an $\epsilon$ -close statistical estimate $\hat{F}$ for $F$ , or returns a clustering scheme $A,B$ such that there is some bipartition $S\subset[k]$ such that for all $i\in S,j\in\pi^{-1}(i)$ , $P[F_{i},A]\geq 1-\epsilon_{3}$ and for all $i\notin S,j\in\pi^{-1}(i)$ , $Pr[F_{i},B]\geq 1-\epsilon_{3}$ . And also $S,[k]-S$ are both non-emtpy.

We analyze the output of the Hierarchical Clustering Algorithm via a case analysis:

Case 1: Suppose that no Gaussian $\hat{F}_{i}$ has any variance (i.e. in any direction) that is at most $\epsilon_{m}$ .

Suppose that no Gaussian $\hat{F}_{i}$ has any variance (i.e. in any direction) that is at most $\epsilon_{m}$ . Then we can apply Lemma 11 and because $\frac{2n\epsilon_{1}}{\epsilon_{m}}+\frac{\epsilon_{1}^{2}}{\epsilon_{m}}\leq\epsilon^{2}$ , and this will imply that the estimate $\hat{F}$ is statistically close to the actual mixture $F$ .

Case 2: So suppose there is a Gaussian $\hat{F}_{i}$ which has a variance of at most $\epsilon_{m}$ on some direction $v$ .

Then using Claim 47, $var(P_{v}[\hat{F}])\geq 1-O(k^{2}\frac{\epsilon_{1}}{\epsilon^{2}})$ . Because the parameters are good, we know that $k^{2}\frac{\epsilon_{1}}{\epsilon^{2}}=o(1)$ and so $var(P_{v}[\hat{F}])=\Omega(1)$ . Suppose that for all $\hat{F}_{j}$ , $D_{p}(P_{v}[\hat{F}_{i}],P_{v}[\hat{F}_{j}])=o(1)$ . In this case, we could apply Fact 57 and $\sum_{j}\hat{w}_{j}v^{T}\hat{\Sigma}_{j}v=o(1)$ and similarly $var(\hat{\Delta})$ (where $\hat{\Delta}$ is the discrete distribution on $\Re$ which takes value $v^{T}\hat{\mu}_{j}$ with probability $\hat{w}_{j}$ ) will be upper bounded by $\max_{j}(v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j}))^{2}=o(1)$ . So if for all $\hat{F}_{j}$ , $D_{p}(P_{v}[\hat{F}_{i}],P_{v}[\hat{F}_{j}])=o(1)$ , we would have $var(P_{v}[\hat{F}])=o(1)$ which is not possible, hence there must be some other Gaussian $\hat{F}_{j}$ s.t. $D_{p}(P_{v}[\hat{F}_{i}],P_{v}[\hat{F}_{j}])=\Omega(1)$ .

Case 2a: Suppose that each Gaussian $\hat{F}_{h}$ has projected variance $v^{T}\hat{\Sigma}_{h}v\leq\epsilon_{2}$ , and there is a Gaussian $\hat{F}_{j}$ s.t. the difference in projected means $|v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j})|=\Omega(1)$ .

In this case, we can apply Claim 45 to get a bipartition $S^{\prime}\subset[k^{\prime}]$ (let $T^{\prime}=[k^{\prime}]-S^{\prime}$ ) such that $S^{\prime},T^{\prime}\neq\emptyset$ and such that for all $i\in S^{\prime},j\in T^{\prime}$ , $|v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j})|\geq\Omega(2^{-k})$ . Because the parameters are good, we have that $\frac{3\sqrt{\epsilon_{2}}}{\epsilon_{3}}=o(2^{-k})$ . Then we can apply Lemma 9 to obtain a clustering so that each successive point sampled from the oracle has probability at most $\epsilon_{3}$ of being mis-clustered, as desired. And since both $S^{\prime},T^{\prime}$ are non-empty, this clustering scheme returned by Lemma 9 has the property that for either side of the clustering scheme, there is some component $F_{i}$ in the original mixture that is mapped to that side w.h.p.

Case 2b: Either there is some Gaussian $\hat{F}_{h}$ which has projected variance $v^{T}\hat{\Sigma}_{h}v\geq\epsilon_{2}$ , or for all Gaussians $\hat{F}_{j}$ ( $j\neq i$ ) the difference in projected means $|v^{T}(\hat{\mu}_{i}-\hat{\mu}_{j})|=o(1)$ .

Either case implies that there is some Gaussian $\hat{F}_{h}$ such that when projected onto $v$ , $\hat{F}_{h}$ has variance at least $\epsilon_{2}$ . In the first case, this is directly true. In the second case, (if we let $\hat{\Delta}$ be the discrete distribution on $\Re$ which takes value $v^{T}\hat{\mu}_{j}$ with probability $\hat{w}_{j}$ ), $var(\hat{\Delta})=o(1)$ . And using Claim 47 and Fact 57, then there must be some component $\hat{F}_{h}$ with $v^{T}\hat{F}_{h}v=\Omega(1)>>\epsilon_{2}$ .

So let $\hat{F}_{h}$ be the component for which $v^{T}\hat{F}_{h}v$ is the largest (and is at least $\epsilon_{2}$ ).

We can do the following: Let $A_{1}\subset[k^{\prime}]=\{i\in[k^{\prime}]|v^{T}\hat{\Sigma}_{i}v\leq\epsilon_{m}\}$ . Let $B_{1}=[k^{\prime}]-A_{1}$ , which is necessarily non-empty because $h\in B_{1}$ . Then take $B_{2}=\{\epsilon_{m}\}\cup\{v^{T}\hat{\Sigma}_{i}v|i\in B_{1}\}$ and we can apply Claim 46 to get a bi-partition $A_{3},B_{3}$ of $B_{2}$ with the property that $\epsilon_{m}\in A_{3}$ , both $A_{3},B_{3}$ are non-empty and (choosing $C=\frac{\epsilon_{2}}{\epsilon_{m}}$ in Claim 46 and $C^{\frac{1}{2^{k}}}>>\frac{1}{\epsilon_{3}^{3}}$ ) the ratio $\frac{\min(B_{3})}{\max(A_{3})}\geq\frac{1}{\epsilon_{t}}>>\frac{1}{\epsilon_{3}^{3}}$ .

Then every projected variance $v^{T}\hat{\Sigma}_{i}v$ is in the set $A_{1}\cup A_{3}\cup B_{3}$ . So we can take $A$ to be the set of indices $i\in[k^{\prime}]$ such that $v^{T}\hat{\Sigma}_{i}v\in A_{1}\cup A_{3}$ and similarly we take $B$ to be the set of indices $i\in[k^{\prime}]$ such that $v^{T}\hat{\Sigma}_{i}v\in B_{3}$ . Then $A,B$ is a bipartition of $[k^{\prime}]$ .

Also $\frac{\min_{i\in B}v^{T}\hat{\Sigma}_{i}v}{\max(\epsilon_{m},\max_{i\in A}v^{T}\hat{\Sigma}_{i}v)}=\frac{\min(B_{3})}{\max(A_{3})}\geq\frac{1}{\epsilon_{t}}>>\frac{1}{\epsilon_{3}^{3}}$ . And then we can apply Lemma 10 and this yields a clustering so that each successive point sampled from the oracle has probability at most $\epsilon_{3}$ of being mis-clustered, as desired. Note that $i\in A$ , and $h\in B$ , so both of the sides of this clustering scheme are non-empty (for either side of the clustering scheme, there is some component $F_{i}$ in the original mixture that is mapped to that side w.h.p.).

This completes the description of the Hierarchical Clustering Algorithm.

G.4 Recursion

Appendix H The Isotropic Projection Lemma for k𝑘k Gaussians

Lemma 13. [Isotropic Projection Lemma] Given a mixture of $k$ $n$ -Dimensional Gaussians $F=\sum_{i}w_{i}F_{i}$ that is in isotropic position and is $\epsilon$ -statistically learnable, with probability $\geq 1-\delta$ over a randomly chosen direction $u$ , there is some pair of Gaussians $F_{i},F_{j}$ s.t. $D_{p}(P_{u}[F_{i}],P_{u}[F_{j}])\geq\frac{\epsilon^{5}\delta^{2}}{50n^{2}}$ .

Let $\epsilon_{1}=\frac{\epsilon^{5}\delta^{2}}{100n^{2}}$ , and $\epsilon_{2}=\frac{4\epsilon_{1}}{\epsilon}$

Case 1: $\|\mu_{i}-\mu_{j}\|>t$ for some $i,j\in[k]$ . In this case, by Lemma 50, with probability $\geq 1-\delta$ , $|u\cdot(\mu_{i}-\mu_{j})|\geq\delta t/\sqrt{n}=2\epsilon_{1}$ , as desired.

Case 2: $\|\mu_{i}-\mu_{j}\|\leq t$ for all $i,j\in[k]$ . By Lemma 48, with probability $\geq 1-\delta$ , for some $h$ ,

If $|u\cdot(\mu_{i}-\mu_{j})|\geq 2\epsilon_{1}$ , then we are done. If not, then $|u\cdot(\mu_{i}-\mu_{j})|\leq 2\epsilon_{1}$ for all $i,j\in[k]$ . Then using Fact 57, $var(\Delta)+\sum_{j}w_{j}u^{T}\Sigma_{j}u=1$ where $\Delta$ is the discrete distribution on points in $1$ -dimension which is $u^{T}\mu_{j}$ with probability $w_{j}$ . The variance of this mixture $\Delta$ is upper bounded by $\max_{i,j}|u^{T}\mu_{i}-u^{T}\mu_{j}|^{2}$ which is at most $4\epsilon_{1}^{2}\leq 2\epsilon_{1}$ .

Let $\epsilon,\delta>0$ , $t\in(0,\epsilon^{2})$ . Let $F$ be an $\epsilon$ -statistically learnable distribution in isotropic position. Suppose for all $i,j\in[k]$ that $\|\mu_{i}-\mu_{j}\|\leq t$ . Then, for uniformly random $r$ ,

We can apply Lemma 52 and then apply Lemma 51. So with probability at least $1-\delta$ , there is some $i$ s.t. $u^{T}\Sigma_{i}u\notin[1-c,1+c]$ for $c==\frac{\delta^{2}a}{4n},a=\frac{\epsilon^{3}-t^{2}}{3n}$ . If $u^{T}\Sigma_{i}u<1-c$ then we are done. If instead $u^{T}\Sigma_{i}u>1+c$ , we can apply Fact 57 which implies that $\sum_{j}w_{j}u^{T}\Sigma_{j}u\leq 1$ and we have that $w_{i}u^{T}\Sigma_{i}u>w_{i}(1+c)$ . We can apply Claim 49 which implies that there is then some $j\neq i$ s.t. $u^{T}\Sigma_{j}u<1-\epsilon c$ which implies the lemma. ∎

Suppose $w_{1}(1+\alpha)+w_{2}(1-\beta)\leq 1$ , $w_{1},w_{2}\geq\epsilon\geq 0$ , $w_{1}+w_{2}=1$ and $\alpha>0$ . Then, $\beta\geq\epsilon\alpha$ .

For any $\mu_{i},\mu_{j}\in{\bf R}^{n},\delta>0$ , over uniformly random unit vectors $u$ ,

Suppose the mixture $F=\sum_{i}w_{i}F_{i}$ is in isotropic position and is $\epsilon$ -statistically learnable, and that for all $i,j\in[k]$ , $\|\mu_{i}-\mu_{j}\|\leq t$ . Then $\max_{i}\{~{}\|\Sigma_{i}^{-1}\|_{2}~{}\}\geq 1+a,\quad a=\frac{\epsilon^{3}-t^{2}}{3n}.$

By Fact 58, the squared variation distance between $F_{i}$ and $F_{j}$ is,

Where $\lambda_{1},\ldots,\lambda_{n}>0$ are the eigenvalues of $\Sigma_{i}^{-1}\Sigma_{j}$ . Suppose $(\mu_{1}-\mu_{2})^{T}\Sigma_{i}^{-1}(\mu_{1}-\mu_{2})\geq\frac{t^{2}}{\epsilon}$ , then this implies $\|\Sigma_{i}^{-1}\|_{2}\geq\frac{1}{\epsilon}$ because $\|\mu_{1}-\mu_{2}\|\leq t$ , and we would be done in this case. If not, then from the above equation we get

In particular, there must be some eigenvalue $\lambda$ , such that, $\lambda+1/\lambda-2\geq\frac{2}{n}\ (\epsilon^{2}-\frac{t^{2}}{\epsilon})=\frac{6a}{\epsilon^{2}}.$ Let $v$ be a unit (eigen)vector corresponding to $\lambda$ , i.e., $v=\lambda\Sigma_{i}^{-1}\Sigma_{j}v$ . Then we have that $v^{T}\Sigma_{i}v=\lambda v^{T}\Sigma_{j}v$ and this yields

Since one of the two terms in parentheses above must be at least $3a/\epsilon^{2}$ , WLOG, we can take $\frac{v^{T}\Sigma_{i}v}{v^{T}\Sigma_{j}v}\geq 1+3a/\epsilon^{2}$ . This means that the numerator or denominator is bounded from 1. We can break this into two cases.

Case 1: $v^{T}\Sigma_{j}v<1/(1+a)$ . This establishes the lemma immediately.

Case 2: $v^{T}\Sigma_{i}v\geq(1+3a/\epsilon^{2})/(1+a)=1+(3/\epsilon^{2}-1)a/(1+a)\geq 1+(3/\epsilon^{2}-1)a/2$ . By Claim 49, since $\sum_{h}w_{h}v^{T}\Sigma_{h}v\leq 1$ , we have there is some $g\in[k]$ , $g\neq i$ such that

This means that $\|\Sigma_{g}^{-1}\|_{2}\geq 1/(1-a)\geq 1+a$ .

Appendix I Approximate Isotropic Position

Let $F_{1}=\mathcal{N}(\mu_{1},\Sigma_{1})$ . Let $m=O(\frac{n^{4}\ln\frac{1}{\delta}}{\epsilon^{4c}})$ . Then given $m$ samples from $F_{1}$ , $x_{1},x_{2},...x_{m}$ compute $\hat{\mu}_{1}=\frac{1}{m}\sum_{i}x_{i}$ and $\hat{\Sigma}_{1}=\frac{1}{m}\sum_{i}x_{i}x_{i}^{T}-\hat{\mu}_{1}\hat{\mu}_{1}^{T}$ . Let $\hat{F}_{1}=\mathcal{N}(\hat{\mu}_{1},\hat{\Sigma}_{1})$ . Then with probability at least $1-\delta$ , $D(F_{1},\hat{F}_{1})\leq O(\epsilon^{c})$ .

For $m=O(\frac{n^{4}\ln\frac{k}{\delta}}{\epsilon^{4c+1}})$ , $D(F,\hat{F}),\max_{i}D(F_{i},\hat{F}_{i}),\max_{i}|w_{i}-\hat{w}_{i}|\leq O(\epsilon^{c})$ , with probability at least $1-\delta$ .

because $m\geq\Omega(\frac{\log\frac{k}{\delta}}{\epsilon^{2c}})$ .

So each $i$ receives at least $\Omega(\epsilon m)=\Omega(\frac{n^{4}\ln\frac{k}{\delta}}{\epsilon^{4c}})$ samples, so using Theorem 53, $D(F_{i},\hat{F}_{i})\leq O(\epsilon^{c})$ with probability at least $1-\frac{\delta}{4k}$ .

Then $D(F,\hat{F})\leq\max_{i}D(F_{i},\hat{F}_{i})+\sum_{i}|w_{i}-\hat{w}_{i}|=O(\epsilon^{c})$ and the total probability of any bad event occurring is at most $\delta$ so this implies the corollary. ∎

$E_{x\sim\hat{F}}=\frac{1}{m}\sum_{i}x_{i}$ and $E_{x\sim F}[xx^{T}]=\frac{1}{m}\sum_{i}x_{i}x_{i}^{T}$

Given an $\epsilon^{\prime}$ -statistically learnable distribution (for $\epsilon^{\prime}\geq\epsilon$ ) $F$ , given $m=O(\frac{n^{4}\ln\frac{k}{\delta}}{\epsilon^{5}})$ samples from $F$ , one can compute a transformation $\hat{T}$ such that there is $\epsilon^{\prime}-O(\epsilon)$ -statistically learnable distribution $\hat{F}$ s.t. with probability at least $1-\delta$

computing an $\gamma$ -close estimate for $\hat{F}$ is also an $\gamma+O(\epsilon)$ -close statistical estimate for $F$

a transformation $\hat{T}$ places $\hat{F}$ in exactly isotropic position

$\hat{T}$ can be computed from just the sample points $x_{1},x_{2},...x_{m}$

Appendix J Basic Properties of Gaussians

In this section we state many useful basic facts about univariate Gaussian distributions that are used throughout this paper.

Given a discrete distribution on points in $1$ -dimension, $\Delta$ , we will define $var(\Delta)$ to be the variance of this distribution.

Given a GMM of $1$ -dimensional Gaussians, $F=\sum_{i}w_{i}\mathcal{N}(\mu_{i},\sigma_{i}^{2})$ ,

where $\Delta$ is the discrete distribution on points in $1$ -dimension corresponding to selecting each $\mu_{i}$ with probability $w_{i}$ .

Also $E_{x\sim F}[x]=\sum_{i}w_{i}\mu_{i}=E_{x\sim\Delta}[x]$ and combining these equations yields:

Let $F_{1}=\mathcal{N}(\mu_{1},\Sigma_{1})$ and $F_{2}=\mathcal{N}(\mu_{2},\Sigma_{2})$ be two $n$ -dimensional Gaussian distributions. Let $\lambda_{1},\ldots,\lambda_{n}>0$ be the eigenvalues of $\Sigma_{1}^{-1}\Sigma_{2}$ . Then the variation distance between them satisfies,

It is easy to verify that $\text{argmax}_{\sigma^{2}}\mathcal{N}(0,\sigma^{2},\gamma)=\gamma^{2},$ from which the fact follows. ∎

Either $\mu\geq\gamma/2,$ or $\sigma^{2}\geq\gamma/2$ . In the first case, using Fact 59,

[Lemma 29 from ] Given $\sigma^{2}\leq 2,$

Using Lemma 61, the above bound follows by a change of variables and induction. Note that the constant inside the $O()$ depends (exponentially) on $i$ . ∎

Given $\mu,\mu^{\prime},\sigma^{2},\sigma^{\prime 2}$ such that $|\mu|,|\mu^{\prime}|<c$ and $\epsilon^{1/3}\leq\sigma^{2},\sigma^{\prime 2}\leq 2$ and $|\mu-\mu^{\prime}|+|\sigma^{2}-\sigma^{\prime 2}|\leq\epsilon,$ (and we also assume that $\epsilon c^{2}=o(1)$ and $c\geq 1$ ) then

Consider the interval $I=[-2c,2c]$ . Then in order to bound $\max(|\mathcal{N}(\mu,\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|)$ over $I$ , we first bound $\max(|\mathcal{N}(\mu,\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x)|)$ over $I$ and next we bound $\max(|\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|)$ over $I$ .

We prove this claim in two parts: first we bound $\max_{x\in I}|\mathcal{N}(\mu,\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x)|$ :

Next, we bound the term $\max_{x\in I}(|\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|)$ . We accomplish this by bounding both $\max_{x\in I}(\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)$ and $\max_{x\in I}(\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x))$ . Assume that $\sigma^{2}\geq\sigma^{\prime 2}$ . Then it follows that: $\max(\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)-\mathcal{N}(\mu^{\prime},\sigma^{2},x))=\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},\mu^{\prime})-\mathcal{N}(\mu^{\prime},\sigma^{2},\mu^{\prime})=\frac{1}{\sqrt{2\pi}}[\frac{1}{\sigma^{\prime}}-\frac{1}{\sigma}]$ because $\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)$ decreases at a faster rate than $\mathcal{N}(\mu^{\prime},\sigma^{2},x)$ whenever $\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)>\mathcal{N}(\mu^{\prime},\sigma^{2},x)$ . Also using the restriction that $\sigma^{\prime 2},\sigma^{2}\geq\epsilon^{1/3}$ yields $[\frac{1}{\sigma^{\prime}}-\frac{1}{\sigma}]\leq\frac{1}{\sigma}O(\frac{\epsilon}{\sigma^{2}})\leq O(\sqrt{\epsilon})$ .

Lastly, we bound the term $\max_{x\in I}\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)$ :

Thus these bounds imply that $\max_{x\in I}(|\mathcal{N}(\mu^{\prime},\sigma^{2},x)-\mathcal{N}(\mu^{\prime},\sigma^{\prime 2},x)|)=O(c^{2}\epsilon^{1/6})$ ∎

So we can use the Claim 64 to conclude that

And we can use Corollary 62 to conclude that

The $k^{th}$ raw moment of a univariate Gaussian, $M_{k}(\mathcal{N}(\mu,\sigma^{2}))=\sum_{i=0}^{k}c_{i}\mu^{i}\sigma^{2(k-i)},$ where $|c_{i}|\leq(k+2)!.$

Consider the moment generating function $M_{X}(t)=e^{t\mu+\sigma^{2}t^{2}/2}.$ We claim that $\frac{d^{i}M-X(t)}{dt^{i}}=poly_{i}(\mu,\sigma,t)\cdot M_{X}(t),$ where $poly_{i}(\mu,\sigma,t)$ is a polynomial of $\mu,\sigma^{2},t$ , whose degree when viewed as a polynomial over $t$ is at most $i$ , whose degree when viewed as a polynomial over $\mu,\sigma^{2}$ is at most $i$ , and whose coefficients are bounded in magnitude by $i!.$ We prove this by induction, with the base case $i=1$ being trivial. Assuming the statement holds for some value $i\geq 1,$ we have

Thus $poly_{i+1}(\mu,\sigma,t)=poly_{i}(\mu,\sigma,t)(2\sigma^{2}t+\mu)+\frac{dpoly_{i}(\mu,\sigma,t)}{dt}.$ Clearly $deg_{t}(poly_{i+1}(\mu,\sigma,t))=i+1,$ and the degree in terms of $\mu$ and $\sigma^{2}$ increases by at most one. To get from $poly_{i}$ to $poly_{i+1},$ each coefficient is multiplied by 2 in the first product, and multiplied by at most $i$ in the second term because of the differentiation. Thus if $c$ is the maximum magnitude of a coefficient of $poly_{i},$ the maximum magnitude of a coefficient of $poly_{i+1}$ will be at most $(2+i)c,$ from which the claim follows. ∎