Near-optimal-sample estimators for spherical Gaussian mixtures

Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, Ananda Theertha Suresh

Introduction

Meaningful information often resides in high-dimensional spaces: voice signals are expressed in many frequency bands, credit ratings are influenced by multiple parameters, and document topics are manifested in the prevalence of numerous words. Some applications, such as topic modeling and genomic analysis consider data in over 1000 dimensions, .

Typically, information can be generated by different types of sources: voice is spoken by men or women, credit parameters correspond to wealthy or poor individuals, and documents address topics such as sports or politics. In such cases the overall data follow a mixture distribution .

Mixtures of high-dimensional distributions are therefore central to the understanding and processing of many natural phenomena. Methods for recovering the mixture components from the data have consequently been extensively studied by statisticians, engineers, and computer scientists.

An important and extensively studied special case of mixture distributions are spherical-Gaussians , where different coordinates have the same variance, though potentially different means. Due to their simple structure, they are easier to analyze and under a minimum-separation assumption have provably-practical algorithms for clustering and parameter estimation .

2 Sample complexity

Reducing the number of samples is of great practical significance. For example, in topic modeling every sample is a whole document, in credit analysis every sample is a person’s credit history, and in genetics, every sample is a human DNA. Hence samples can be very scarce and obtaining them can be very costly. By contrast, current CPUs run at several Giga Hertz, hence samples are typically much more scarce of a resource than time.

Note that for one-dimensional statistical problems, the need for sample-efficient algorithms has been broadly recognized. The sample complexity of many problems is known quite accurately, often to within a constant factor. For example, for discrete distributions over {1,,s}\{1{,}\ldots{,}s\}, an approach proposed in and its modifications were used in to estimate the probability multiset using Θ(s/logs)\Theta(s/\log s) samples. Learning one-dimensional mm-modal distributions over {1,,s}\{1{,}\ldots{,}s\} requires Θ(mlog(s/m)/ϵ3)\Theta(m\log(s/m)/\epsilon^{3}) samples . Similarly, one-dimensional mixtures of kk structured distributions (log-concave, monotone hazard rate, and unimodal) over {1,,s}\{1{,}\ldots{,}s\} can be learned with O(k/ϵ4){\cal O}(k/\epsilon^{4}), O(klog(s/ϵ)/ϵ4){\cal O}(k\log(s/\epsilon)/\epsilon^{4}), and O(klog(s)/ϵ4){\cal O}(k\log(s)/\epsilon^{4}) samples, respectively, and these bounds are tight up to a factor of ϵ\epsilon .

Compared to one dimensional problems, in high dimensions there is a polynomial gap in the sample complexity. For example, for learning spherical Gaussian mixtures, the number of samples required by previous algorithms is O(d12){\cal O}(d^{12}) for k=2k=2 components, and increased exponentially with kk . In this paper we bridge this gap, by constructing near-linear sample complexity estimators.

3 Previous and new results

Our main contribution is PAC learning dd dimensional Gaussian mixtures with near-linear samples. We show few auxiliary results for one-dimensional Gaussians.

time. Observe that recent algorithms typically construct the covariance matrix , hence require nd2\geq nd^{2} time. In that sense, for small values of kk, the time complexity we derive is comparable to the best such algorithms can hope for. Observe also that the exponential dependence on kk is of the form d^{2}\Big{(}\frac{k^{7}}{\epsilon^{3}}\log\frac{d}{\delta}\Big{)}^{k^{2}}, which is significantly lower than the dO(k3)d^{{\cal O}(k^{3})} dependence in previous results.

By contrast, Theorem 2 shows that PAC learning kk-component spherical Gaussian mixtures require Ω(dk/ϵ2)\Omega(dk/\epsilon^{2}) samples for any algorithm, hence our distribution learning algorithms are nearly sample optimal. In addition, their time complexity significantly improves on previously known ones.

3.2 One-dimensional Gaussian mixtures

Independently and around the same time as this work showed that mixtures of two one-dimensional Gaussians can be learnt with O~(ϵ2)\widetilde{{\cal O}}(\epsilon^{-2}) samples and in time O(ϵ7.01){\cal O}(\epsilon^{-7.01}). We provide a natural estimator for learning mixtures of kk one dimensional Gaussians using some basic properties of Gaussian distributions and show that mixture of any kk-one dimensional Gaussians can be learnt with O~(kϵ2)\widetilde{{\cal O}}(k\epsilon^{-2}) samples and in time \widetilde{{\cal O}}\left(\bigl{(}\frac{k}{\epsilon}\bigr{)}^{3k+1}\right).

4 The approach and technical contributions

The popular Scheffe estimator takes a collection F{\cal F} of distributions and uses O(logF){\cal O}(\log|{\cal F}|) independent samples from an underlying distribution f\mathbf{f} to find a distribution in F{\cal F} whose distance from f\mathbf{f} is at most a constant factor larger than that of the distribution in F{\cal F} that is closet to ff . In Lemma 1, we lower the time complexity of the Scheffe algorithm from O(F2){\cal O}(|{\cal F}|^{2}) time to O~(F)\widetilde{{\cal O}}(|{\cal F}|), helping us reduce the time complexity of our algorithms.

Our goal is therefore to construct a small class of distributions that is ϵ\epsilon-close to any possible underlying distribution. For simplicity, consider spherical Gaussians with the same variance and means bounded by BB. Take the collection of all distributions derived by quantizing the means of all components in all coordinates to ϵm\epsilon_{m} accuracy, and quantizing the weights to ϵw\epsilon_{w} accuracy. It can be shown that to get distance ϵ\epsilon from the underlying distribution, it suffices to take ϵm,ϵw1/polyϵ(dk)\epsilon_{m},\epsilon_{w}\leq 1/{\rm{poly}}_{\epsilon}(dk). There are at most \bigl{(}\frac{B}{\epsilon_{m}}\bigr{)}^{dk}\cdot\bigl{(}\frac{1}{\epsilon_{w}}\bigr{)}^{k}=2^{\widetilde{{\cal O}}_{\epsilon}(dk)} possible combinations of the kk mean vectors and weights. Hence Scheffe implies an exponential-time algorithm with sample complexity O~(dk)\widetilde{{\cal O}}(dk).

To reduce the dependence on dd, one can approximate the span of the kk mean vectors. This reduces the problem from dd to kk dimensions, allowing us to consider a distribution collection of size 2O(k2)2^{{\cal O}(k^{2})}, with Scheffe sample complexity of just O(k2){\cal O}(k^{2}). constructs the sample correlation matrix and uses kk of its columns to approximate the span of mean vectors. This approach requires the kk columns of the sample correlation matrix to be very close to the actual correlation matrix, and thus requires a lot more samples.

We derive a spectral algorithm that uses the top kk eigenvectors of the sample covariance matrix to approximate the span of the kk mean vectors. Since we use the entire covariance matrix instead of just kk columns, a weaker concentration is sufficient and we gain on the sample complexity.

Using recent tools from non-asymptotic random matrix theory , we show that the approximation of the span of the means converges in O~(d)\widetilde{{\cal O}}(d) samples. This result allows us to address most “reasonable” distributions, but still there are some “corner cases” that need to be analyzed separately. To address them, we modify some known clustering algorithms such as single-linkage, and spectral projections. While the basic algorithms were known before, our contribution here, which takes a fair bit of effort and space, is to show that judicious modifications of the algorithms and rigorous statistical analysis yield polynomial time algorithms with near optimal sample complexity.

Our approach applies most directly to mixtures of spherical Gaussians. We provide a simple and practical recursive clustering and spectral algorithm that estimates all such distributions in Ok(dlog2d){\cal O}_{k}(d\log^{2}d) samples.

The paper is organized as follows. In Section 2, we introduce notations, describe results on the Scheffe estimator, and state a lower bound. In Section 3, we present the algorithm for kk-spherical Gaussians. In Section 4 we show a simple learning algorithm for one-dimensional Gaussian mixtures. To preserve readability, most of the technical details and proofs are given in the appendix.

Preliminaries

2 Selection from a pool of distributions

Let ϵ>0\epsilon>0. For some constant cc, given \frac{c}{\epsilon^{2}}{\log\bigl{(}\frac{|{\cal F}|}{\delta}\bigr{)}} independent samples from a distribution f\mathbf{f}, with probability 1δ\geq 1-\delta, the output f^\hat{\mathbf{f}} of modified scheffe D(f^,f)1000max(ϵ,D(f,F)).D({\hat{\mathbf{f}}},{\mathbf{f}})\leq 1000\max(\epsilon,D({\mathbf{f}},{{\cal F}})). Furthermore, the algorithm runs in time {\cal O}\bigl{(}\frac{|{\cal F}|T\log(|{\cal F}|/\delta)}{\epsilon^{2}}\bigr{)}.

We therefore find a small class F{\cal F} with at least one distribution close to the underlying mixture. For our problem of estimating kk component mixtures in dd-dimensions, T=O(dk)T={\cal O}(dk) and F=O~k,ϵ(d2)|{\cal F}|=\widetilde{{\cal O}}_{k,\epsilon}(d^{2}). Note that we have not optimized the constant 10001000 in the above lemma.

3 Lower bound

Using Fano’s inequality, we show an information theoretic lower bound of Ω(dk/ϵ2)\Omega(dk/\epsilon^{2}) samples to learn kk-component dd-dimensional mixtures of spherical Gaussians for any algorithm. More precisely,

Mixtures in d𝑑d dimensions

We now concentrate on estimating the means. As stated in the introduction, given the span of the mean vectors μi{\boldsymbol{\mu}_{i}}, we can grid the kk dimensional span to the required accuracy ϵg\epsilon_{g} and use Scheffe, to obtain a polynomial time algorithm. One of the natural and well-used methods to estimate the span of mean vectors is using the correlation matrix . Consider the correlation-type matrix,

In expectation, the fraction of terms from pi\mathbf{p}_{i} is wiw_{i}. Furthermore for a sample XX from a particular component jj,

Therefore, as nn\to\infty, the matrix SS converges to j=1kwjμjμjt\sum^{k}_{j=1}w_{j}{\boldsymbol{\mu}_{j}}{\boldsymbol{\mu}_{j}}^{t}, and its top kk eigenvectors span of means.

While the above intuition is well understood, the number of samples necessary for convergence is not well studied. Ideally, irrespective of the values of the means, we wish O~(d)\widetilde{{\cal O}}(d) samples to be sufficient for the convergence. However this is not true, as we demonstrate by a simple example.

Consider the special case, d=1d=1, k=2k=2, σ2=1\sigma^{2}=1, w1=w2=1/2w_{1}=w_{2}=1/2, and the difference of means μ1μ2=L|\mu_{1}-\mu_{2}|=L for a large L1L\gg 1. Given this prior information, one can estimate the the average of the mixture, that yields μ1+μ22\frac{\mu_{1}+\mu_{2}}{2}. Solving equations obtained by μ1+μ2\mu_{1}+\mu_{2} and μ1μ2=L\mu_{1}-\mu_{2}=L, yields μ1\mu_{1} and μ2\mu_{2}. The variance of the mixture is 1+L24>L241+\frac{L^{2}}{4}>\frac{L^{2}}{4}. With additional Chernoff type bounds, one can show that given nn samples the error in estimating the average is

Therefore to estimate the means to a small accuracy we need nL2n\geq L^{2}, i.e., more the separation, more samples are necessary.

A similar phenomenon happens in the convergence of the correlation matrices, where the variances of quantities of interest increases with separation. In other words, for the span to be accurate the number of samples necessary increases with the separation. To overcome this phenomenon, a natural idea is to cluster the Gaussians such that the means of components in the same cluster are close and then apply scheffe on the span within each cluster.

Even though spectral clustering algorithms are studied in , they assume that the weights are strictly bounded away from , which does not hold here. We use a simple recursive clustering algorithm that takes a cluster CC with average μ(C){\overline{\boldsymbol{\mu}}}(C). If there is a component in the cluster such that wiμiμ(C)2\sqrt{w_{i}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2} is Ω(log(n/δ))\Omega(\log(n/\delta)), then the algorithm divides the cluster into two nonempty clusters without any mis-clustering.

For technical reasons similar to the above example, we also use a coarse clustering algorithm that ensures that the mean separation is O~(d1/4)\widetilde{{\cal O}}(d^{1/4}) within each cluster. The algorithm can be summarized as:

Variance estimation: Use first k+1k+1 samples and estimate the minimum distance among sample-pairs to estimate σ2\sigma^{2}.

Coarse clustering: Using a single-linkage algorithm, group the samples such that within each cluster formed, the mean separation is smaller than O~(d1/4)\widetilde{{\cal O}}(d^{1/4}).

Recursive clustering: As long as there is a cluster that has samples from more than one component with means far apart, (described by a condition on the norm of its covariance matrix in the algorithm) estimate its largest eigenvector and project samples of this cluster onto this eigenvector and cluster them. This hierarchical method is continued until there are clusters that contain close-by-components.

Search in the span: The resulting clusters contain components that are close-by, i.e., μiμj2<O(k3/2σ^2lognδ)\left|\left|{\boldsymbol{\mu}_{i}}-{\boldsymbol{\mu}_{j}}\right|\right|_{2}<{\cal O}(k^{3/2}{\hat{\sigma}}^{2}\log\frac{n}{\delta}). We approximate the span of means by the top k1k-1 eigenvectors and the mean vector, and perform an exhaustive search using Modified scheffe.

We now describe these steps stating the performance of each step.

2 Sketch of correctness

To simplify the bounds and expressions, we assume that d>1000d>1000 and δmin(2n2ed/10,1/3)\delta\geq\min(2n^{2}e^{-d/10},1/3). For smaller values of δ\delta, we run the algorithm with error 1/31/3 and repeat it O(log1δ){\cal O}(\log\frac{1}{\delta}) times to choose a set of candidate mixtures Fδ{\cal F}_{\delta}. By Chernoff-bound with error δ\leq\delta, Fδ{\cal F}_{\delta} contains a mixture ϵ\epsilon-close to f\mathbf{f}. Finally, we run modified scheffe on Fδ{\cal F}_{\delta} to obtain a mixture that is close to f\mathbf{f}. By the union bound and Lemma 1, the error is 2δ\leq 2\delta.

Variance estimation: Let σ^\hat{\sigma} be the variance estimate from step 1. In high dimensions, the difference between two random samples from a Gaussian concentrates. This is made precise in the next lemma which states σ^\hat{\sigma} is a good estimate of the variance. Then the following is a simple application of Gaussian tail bounds.

Given nn samples from the kk-component mixture, with probability 12δ1-2\delta,

Coarse single-linkage clustering: The second step is a single-linkage routine that clusters mixture components with far means. Single-linkage is a simple clustering scheme that starts out with each data point as a cluster, and at each step merges the two that are closest to form larger clusters. The algorithm stops when the distance between clusters is larger than a pre-specified threshold.

Suppose the samples are generated by an one-dimensional mixture of kk components that are far, then with high probability, when the algorithm generates kk clusters and all the samples within a cluster are generated by a single component. More precisely, if i,j[k]\forall i,j\in[k], μiμj=Ω(σlogn)|\mu_{i}-\mu_{j}|=\Omega(\sigma\log n), then all the nn samples concentrate around their respective means and the separation between any two samples from different components would be larger than the largest separation between any two samples from the same component. Hence for a suitable value of threshold, single-linkage correctly identifies the clusters. For dd-dimensional Gaussian mixtures a similar notion holds true, with minimum separation Ω(d1/4lognδ)\Omega(d^{1/4}\log\frac{n}{\delta}). More precisely,

After Step 2 of Learn k-sphere, with probability 12δ\geq 1-2\delta, all samples from each component will be in the same cluster and the maximum distance between two components within each cluster is \leq 10k\sigma\bigl{(}d\log\frac{n^{2}}{\delta}\bigr{)}^{1/4}.

Recursive spectral-clustering: The clusters formed at this step consists of components with mean separation O(d1/4lognδ){\cal O}(d^{1/4}\log\frac{n}{\delta}). We now recursively zoom into the clusters formed and show that it is possible to cluster the components with much smaller mean separation. Note that since the matrix is symmetric, the largest magnitude of the eigenvalue is same as the spectral norm. We first find the largest eigenvector of

which is the sample covariance matrix with its diagonal term reduced by σ^2.\hat{\sigma}^{2}. If there are two components with means far apart, then using single-linkage we divide the cluster into two. The following lemma shows that this step performs accurate clustering of components with means well separated.

Let ncdk4ϵlogn3δn\geq c\cdot\frac{dk^{4}}{\epsilon}\log\frac{n^{3}}{\delta}. After recursive clustering, with probability 14δ\geq 1-4\delta. the samples are divided into clusters such that for each component ii within any cluster CC, wiμiμ(C)225σk3logn3δ\sqrt{w_{i}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}\leq 25\sigma\sqrt{k^{3}\log\frac{n^{3}}{\delta}} . Furthermore, all the samples from one component remain in a single cluster.

Exhaustive search and Scheffe: After step 3, all clusters have a small weighted radius wiμiμ(C)225σk3logn3δ\sqrt{w_{i}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}\leq 25\sigma\sqrt{k^{3}\log\frac{n^{3}}{\delta}}, the the eigenvectors give an accurate estimate of the span of μiμ(C){\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C) within each cluster. More precisely,

Let ncdk9ϵ4log2dδn\geq c\cdot\frac{dk^{9}}{\epsilon^{4}}\log^{2}\frac{d}{\delta} for some constant cc. After step 3, with probability 17δ\geq 1-7\delta the following holds: if Cnϵ/5k|C|\geq n\epsilon/5k, then the projection of [μiμ(C)]/μiμ(C)2[{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)]/\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2} on the space orthogonal to the span of top k1k-1 eigenvectors has magnitude ϵσ82kwiμiμ(C)2\leq\frac{\epsilon\sigma}{8\sqrt{2}k\sqrt{w_{i}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|}_{2}.

We now have accurate estimates of the spans of the clusters and each cluster has components with close means. It is now possible to grid the set of possibilities in each cluster to obtain a set of distributions such that one of them is close to the underlying. There is a trade-off between a dense grid to obtain a good estimation and the computation time required. The final step takes the sparsest grid possible to ensure an error ϵ\leq\epsilon. This is quantized below.

Let ncdk9ϵ4log2dδn\geq c\cdot\frac{dk^{9}}{\epsilon^{4}}\log^{2}\frac{d}{\delta} for some constant cc. Then Algorithm Learn k-sphere with probability 19δ\geq 1-9\delta, outputs a distribution f^\hat{\mathbf{f}} such that D(f^,f)1000ϵD({\hat{\mathbf{f}}},{\mathbf{f}})\leq 1000\epsilon. Furthermore, the algorithm runs in time {\cal O}\Big{(}n^{2}d\log n+d^{2}\Big{(}\frac{k^{7}}{\epsilon^{3}}\log\frac{d}{\delta}\Big{)}^{k^{2}}\Big{)}.

Note that the run time is calculated based on the efficient implementation of single-linkage and the exponential term is not optimized. We now study mixtures in one-dimension and provide an estimator using Modified Scheffe.

Mixtures in one dimension

Given nn independent samples x1,,xnx_{1},\ldots,x_{n} from N(μ,σ2)N(\mu,\sigma^{2}), there are two samples xj,xkx_{j},x_{k} such that xjμσ7log2/δ2n|x_{j}-\mu|\leq\sigma\frac{7\log 2/\delta}{2n} and xjxkσ2σ7log2/δ2n|x_{j}-x_{k}-\sigma|\leq 2\sigma\frac{7\log 2/\delta}{2n} with probability 1δ\geq 1-\delta.

Proof The density of N(μ,σ2)N(\mu,\sigma^{2}) is (7σ)1\geq(7\sigma)^{-1} in the interval [μ2σ,μ+2σ][\mu-\sqrt{2}\sigma,\mu+\sqrt{2}\sigma]. Therefore, the probability that a sample occurs in the interval μϵσ,μ+ϵσ\mu-\epsilon\sigma,\mu+\epsilon\sigma is 2ϵ/7\geq 2\epsilon/7. Hence, the probability that none of the nn samples occurs in [μϵσ,μ+ϵσ][\mu-\epsilon\sigma,\mu+\epsilon\sigma] is (12ϵ/7)ne2nϵ/7\leq(1-2\epsilon/7)^{n}\leq e^{-2n\epsilon/7}. If ϵ7log2/δ2n\epsilon\geq\frac{7\log 2/\delta}{2n}, then the probability that none of the samples occur in the interval is δ/2\leq\delta/2. A similar argument shows that there is a sample within interval, [μ+σϵσ,μ+σ+ϵσ][\mu+\sigma-\epsilon\sigma,\mu+\sigma+\epsilon\sigma], proving the lemma.

The above observation can be translated into selecting a pool of candidate distributions such that one of the distributions is close to the underlying distribution.

Given n120klog4kδϵn\geq\frac{120k\log\frac{4k}{\delta}}{\epsilon} samples from a mixture ff of kk Gaussians. Let S={N(xj,(xjxk)2):1j,kn}S=\{N(x_{j},(x_{j}-x_{k})^{2})\,:\,1\leq j,k\leq n\} be a set of Gaussians and W={0,ϵ2k,2ϵ2k,1}W=\{0,\frac{\epsilon}{2k},\frac{2\epsilon}{2k}\ldots,1\} be the set of weights. Let

be a set of n^{2k}\bigl{(}\frac{2k}{\epsilon}\bigr{)}^{k-1}\leq n^{3k-1} candidate mixture distributions. There exists a f^F\hat{f}\in{\cal F} such that D(f,f^)ϵD({f},{\hat{f}})\leq\epsilon.

Let f=(w1,w2,wk,p1,p2,pk)f=(w_{1},w_{2},\ldots w_{k},p_{1},p_{2},\ldots p_{k}). For f^=(w^1,w^2,,w^k1,1i=1k1w^i,p^1,p^2,p^k)\hat{f}=(\hat{w}_{1},\hat{w}_{2},\ldots,\hat{w}_{k-1},1-\sum^{k-1}_{i=1}\hat{w}_{i},\hat{p}_{1},\hat{p}_{2},\ldots\hat{p}_{k}), by the triangle inequality,

We show that there is a distribution in f^F\hat{f}\in{\cal F} such that the sum above is bounded by ϵ\epsilon. Since we quantize the grids as multiples of ϵ/2k\epsilon/2k, we consider distributions in F{\cal F} such that each w^iwiϵ/4k|\hat{w}_{i}-w_{i}|\leq\epsilon/4k, and therefore iw^iwiϵ2\sum_{i}|\hat{w}_{i}-w_{i}|\leq\frac{\epsilon}{2}.

We now show that for each pip_{i} there is a p^i\hat{p}_{i} such that wiD(pi,p^i)ϵ2kw_{i}D({p_{i}},{\hat{p}_{i}})\leq\frac{\epsilon}{2k}, thus proving that D(f,f^)ϵD({f},{\hat{f}})\leq\epsilon. If wiϵ4kw_{i}\leq\frac{\epsilon}{4k}, then wiD(pi,p^i)ϵ2kw_{i}D({p_{i}},{\hat{p}_{i}})\leq\frac{\epsilon}{2k}. Otherwise, let wi>ϵ4kw^{\prime}_{i}>\frac{\epsilon}{4k} be the fraction of samples from pip_{i}. By Lemma 9 and 14, with probability 1δ/2k\geq 1-\delta/2k,

Since wi>ϵ/4kw_{i}>\epsilon/4k, with probability 1δ/2k\geq 1-\delta/2k, wi2wiw_{i}\leq 2w^{\prime}_{i}. By the union bound with probability 1δ/k\geq 1-\delta/k, wiD(pi,p^i)60log4kδnw_{i}D({p_{i}},{\hat{p}_{i}})\leq\frac{60\log\frac{4k}{\delta}}{n}. Hence if n120klog4kδϵn\geq\frac{120k\log\frac{4k}{\delta}}{\epsilon}, the above quantity is less than ϵ/2k\epsilon/2k. The total error probability is δ\leq\delta by the union bound. ∎

Running Modified Scheffe algorithm on the above set of candidates F{\cal F} yields a mixture that is close to the underlying one. By Lemma 1 and the above lemma we get

Let ncklogkϵδϵ2n\geq c\cdot\frac{k\log\frac{k}{\epsilon\delta}}{\epsilon^{2}} for some constant cc. There is an algorithm that runs in time

and returns a mixture f^\hat{f} such that D(f,f^)1000ϵD({f},{\hat{f}})\leq 1000\epsilon with error probability 2δ\leq 2\delta.

The above bound matches the independent and contemporary result by for k=2k=2. While the process of identifying the candidate means is same for both the papers, the process of identifying the variances and proof techniques are different.

Acknowledgements

We thank Sanjoy Dasgupta, Todd Kemp, and Krishnamurthy Vishwanathan for helpful discussions.

References

Appendix A Useful tools

The Bhattacharyya parameter for two one dimensional Gaussian distributions p1=N(μ1,σ12)p_{1}=N(\mu_{1},\sigma_{1}^{2}) and p2=N(μ2,σ22)p_{2}=N(\mu_{2},\sigma_{2}^{2}) is

For Gaussian distributions the Bhattacharyya parameter is (see ), B(p1,p2)=yexB(p_{1},p_{2})=ye^{-x}, where x=(μ1μ2)2)4(σ12+σ22)x=\frac{(\mu_{1}-\mu_{2})^{2})}{4(\sigma_{1}^{2}+\sigma_{2}^{2})} and y=2σ1σ2σ12+σ22y=\sqrt{\frac{2\sigma_{1}\sigma_{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}}} . Observe that

Substituting the value of xx results in the lemma. ∎

For any two Gaussian product distributions p1\mathbf{p}_{1} and p2\mathbf{p}_{2},

A.2 Concentration inequalities

We use the following concentration inequalities for Gaussian, Chi-Square, and sum of Bernoulli random variables in the rest of the paper.

For a Gaussian random variable XX with mean μ\mu and variance σ2\sigma^{2},

If Y1,Y2,YnY_{1},Y_{2},\ldots Y_{n} be nn i.i.d.Gaussian variables with mean and variance σ2\sigma^{2}, then

Furthermore for a fixed vector a\mathbf{a},

If X1,X2XnX_{1},X_{2}\ldots X_{n} are distributed according to Bernoulli pp, then with probability 1δ1-\delta,

We now state a non-asymptotic concentration inequality for random matrices that helps us bound errors in spectral algorithms.

Let y(1),y(2),,y(n)\mathbf{y}(1),\mathbf{y}(2),\ldots,\mathbf{y}(n) be generated according to N(0,Σ)N(0,\Sigma). For every ϵ(0,1)\epsilon\in(0,1) and t1t\geq 1, if n\geq c^{\prime}d\bigl{(}\frac{t}{\epsilon}\bigr{)}^{2} for some constant cc^{\prime}, then with probability 12et2n\geq 1-2e^{-t^{2}n},

A.3 Matrix eigenvalues

We now state few simple lemmas on the eigenvalues of perturbed matrices.

Let λ1AλAλdA0\lambda^{A}_{1}\geq\lambda^{A}\geq\ldots\lambda^{A}_{d}\geq 0 and λ1BλBλdB0\lambda^{B}_{1}\geq\lambda^{B}\geq\ldots\lambda^{B}_{d}\geq 0 be the eigenvalues of two symmetric matrices AA and BB respectively. If ABϵ\left|\left|A-B\right|\right|\leq\epsilon, then i\forall\,i, λiAλiBϵ|\lambda^{A}_{i}-\lambda^{B}_{i}|\leq\epsilon.

Let u1,u2,ud\mathbf{u}_{1},\mathbf{u}_{2},\ldots\mathbf{u}_{d} be a set of eigenvectors of AA that corresponds to λ1A,λ2A,λdA\lambda^{A}_{1},\lambda^{A}_{2},\ldots\lambda^{A}_{d}. Similarly let v1,v2,vd\mathbf{v}_{1},\mathbf{v}_{2},\ldots\mathbf{v}_{d} be eigenvectors of BB Consider the first eigenvalue of BB,

Now consider an i>1i>1. If λiB<λiAϵ\lambda^{B}_{i}<\lambda^{A}_{i}-\epsilon, then by definition of eigenvalues

Now consider a unit vector j=1iαjuj\sum^{i}_{j=1}\alpha_{j}\mathbf{u}_{j} in the span of u1,ui\mathbf{u}_{1},\ldots\mathbf{u}_{i}, that is orthogonal to v1,vi1\mathbf{v}_{1},\ldots\mathbf{v}_{i-1}. For this vector,

a contradiction. Hence, id\forall i\leq d, λiBλiAϵ\lambda^{B}_{i}\geq\lambda^{A}_{i}-\epsilon. The proof in the other direction is similar and omitted. ∎

Let A=i=1kηi2uiuitA=\sum^{k}_{i=1}\eta^{2}_{i}\mathbf{u}_{i}\mathbf{u}^{t}_{i} be a positive semidefinite symmetric matrix for kdk\leq d. Let u1,u2,uk\mathbf{u}_{1},\mathbf{u}_{2},\ldots\mathbf{u}_{k} span a k1k-1 dimensional space. Let B=A+RB=A+R, where Rϵ\left|\left|R\right|\right|\leq\epsilon. Let v1,v2,vk1\mathbf{v}_{1},\mathbf{v}_{2},\ldots\mathbf{v}_{k-1} be the top k1k-1 eigenvectors of BB. Then the projection of ui\mathbf{u}_{i} in space orthogonal to v1,v2,vk1\mathbf{v}_{1},\mathbf{v}_{2},\ldots\mathbf{v}_{k-1} is 2ϵηi\leq\frac{2\sqrt{\epsilon}}{\eta_{i}}.

Let ui=j=1k1αi,jvj+1j=1k1αi,j2u\mathbf{u}_{i}=\sum^{k-1}_{j=1}\alpha_{i,j}\mathbf{v}_{j}+\sqrt{1-\sum^{k-1}_{j=1}\alpha^{2}_{i,j}}\mathbf{u}^{\prime}, for a vector u\mathbf{u}^{\prime} orthogonal to v1,v2,vk1\mathbf{v}_{1},\mathbf{v}_{2},\ldots\mathbf{v}_{k-1}. We compute utAu\mathbf{u}^{\prime t}A\mathbf{u}^{\prime} in two ways. Since A=BRA=B-R,

Since u\mathbf{u}^{\prime} is orthogonal to first kk eigenvectors, we have Bu23ϵ\left|\left|B\mathbf{u}^{\prime}\right|\right|_{2}\leq 3\epsilon and hence u(BR)u4ϵ|\mathbf{u}^{\prime(}B-R)\mathbf{u}^{\prime}|\leq 4\epsilon.

We have shown that the above quantity is 4ϵ\leq 4\epsilon. Therefore \bigl{(}1-\sum^{k-1}_{j=1}\alpha^{2}_{i,j}\bigr{)}^{1/2}\leq 2\sqrt{\epsilon}/\eta_{i}. ∎

Appendix B Selection from a set of candidate distributions

We first present the algorithm Scheffe* with running time O~(F2Tn)\widetilde{{\cal O}}(|{\cal F}|^{2}Tn).

We make the following modification to the algorithm where we reduce the size of potential distributions by half in every iteration.

Algorithm modified Scheffe Input: set F{\cal F} of candidate distributions, ϵ:\epsilon: upper bound on minfiFD(f,fi)\min_{f_{i}\in{\cal F}}D({f},{f_{i}}), nn independent samples x1,,xnx_{1},\ldots,x_{n} from ff. 1. Let G=F{\cal G}={\cal F}, C{\cal C}\leftarrow\emptyset 2. Repeat until G>1|{\cal G}|>1: (a) Randomly form G/2|{\cal G}|/2 pairs of distributions in G{\cal G} and run Scheffe* on each pair using the nn samples. (b) Replace G{\cal G} with the G/2|{\cal G}|/2 winners. (c) Randomly select a set A{\cal A} of min{G,F1/3}\min\{|{\cal G}|,|{\cal F}|^{1/3}\} elements from G{\cal G}. (d) Run Scheffe* on each pair in A{\cal A} and add the distributions with most wins to C{\cal C}. 3. Run Scheffe* on C{\cal C} and output the winner

For the ease of proof, we assume that δ10logFF1/3\delta\geq\frac{10\log|{\cal F}|}{|{\cal F}|^{1/3}}. If δ<10logFF1/3\delta<\frac{10\log|{\cal F}|}{|{\cal F}|^{1/3}}, we run the algorithm with error probability 1/31/3 and repeat it O(log1δ){\cal O}(\log\frac{1}{\delta}) times to choose a set of candidate mixtures Fδ{\cal F}_{\delta}. By Chernoff-bound with error probability δ\leq\delta, Fδ{\cal F}_{\delta} contains a mixture close to ff. Finally, we run Scheffe* on Fδ{\cal F}_{\delta} to obtain a mixture that is close to ff.

For any set A{\cal A} and a distribution pp, given nn independent samples from pp the empirical probability μn(A)\mu_{n}({\cal A}) has a distribution around p(A)p({\cal A}) with standard deviation 1n\sim\frac{1}{\sqrt{n}}. Together with an observation in Scheffe estimation in one can show that if the number of samples n=O(logFδϵ2)n={\cal O}\left(\frac{\log\frac{|{\cal F}|}{\delta}}{\epsilon^{2}}\right), then Scheffe* has a guarantee 10max(ϵ,D(f,F))10\max(\epsilon,D({f},{{\cal F}})) with probability 1δ\geq 1-\delta.

Since we run Scheffe* at most F(2logF+1)|{\cal F}|(2\log|{\cal F}|+1) times, choosing δ=δ/(4FlogF+2F)\delta=\delta/(4|{\cal F}|\log|{\cal F}|+2|{\cal F}|) results in the sample complexity of

and the total error probability of δ/2\delta/2 for all runs of Scheffe* during the algorithm. The above value of nn dictates our sample complexity. We now consider the following two cases:

Appendix C Lower bound

We first show a lower bound for a single Gaussian distribution and generalize it to mixtures.

The proof is an application of the following version of Fano’s inequality . It states that we cannot simultaneously estimate all distributions in a class using nn samples if they satisfy certain conditions.

(Fano’s Inequality) Let f1,,fr+1f_{1},\ldots,f_{r+1} be a collection of distributions such that for any iji\neq j, D(fi,fj)αD({f_{i}},{f_{j}})\geq\alpha, and KL(fi,fj)βKL(f_{i},f_{j})\leq\beta. Let ff be an estimate of the underlying distribution using nn i.i.d. samples from one of the fif_{i}’s. Then,

We consider dd-dimensional spherical Gaussians with identity covariance matrix, with means along any coordinate restricted to ±cϵd\pm\frac{c\epsilon}{\sqrt{d}}. The KL divergence between two spherical Gaussians with identity covariance matrix is the squared distance between their means. Therefore, any two distributions we consider have KL distance at most

We now consider a subset of these 2d2^{d} distributions to obtain a lower bound on α\alpha. By the Gilbert-Varshamov bound, there exists a binary code with 2d/8\geq 2^{d/8} codewords of length dd and minimum distance d/8d/8. Consider one such code. Now for each codeword, map 1cϵd1\to\frac{c\epsilon}{\sqrt{d}} and 0cϵd0\to-\frac{c\epsilon}{\sqrt{d}} to obtain a distribution in our class. We consider this subset of 2d/8\geq 2^{d/8} distributions as our fif_{i}’s.

C.2 Mixtures of k𝑘k Gaussians

We now provide a lower bound on the sample complexity of learning mixtures of kk Gaussians in dd dimensions. We extend the construction for learning a single spherical Gaussian to mixtures of kk Gaussians and show a lower bound of Ω(kd/ϵ2)\Omega(kd/\epsilon^{2}) samples. We will again use Fano’s inequality over a class of 2kd/642^{kd/64} distributions as described next.

We now choose μ1,,μk\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{k}’s extremely well-separated. The class of distributions we consider will be a mixture of kk components, where the jjth component is a distribution from P{\cal P} shifted by μj\boldsymbol{\mu}_{j}. Since the μ\boldsymbol{\mu}’s will be well separated, we will use the results from last section over each component.

of kk spherical Gaussians. We consider this class of Tk=2kd/8T^{k}=2^{kd/8} distributions. By the Gilbert-Varshamov bound, for any T2T\geq 2, there is a TT-ary codes of length kk, with minimum distance k/8\geq k/8 and number of codewords 2k/8\geq 2^{k/8}. This implies that among the Tk=2dk/8T^{k}=2^{dk/8} distributions, there are 2kd/642^{kd/64} distributions such that any two tuples (i1,,ik)(i_{1},\ldots,i_{k}) and (i1,,ik)(i_{1}^{\prime},\ldots,i_{k}^{\prime}) corresponding to different distributions differ in at least k/8k/8 locations.

If we choose the μ\boldsymbol{\mu}’s well separated, the components of any mixture distribution have very little overlap. For simplicity, we choose μj\boldsymbol{\mu}_{j}’s satisfying

This implies that for jlj\neq l, PijPil1<(ϵ/2dk)10\left|\left|P_{ij}-P_{i^{\prime}l}\right|\right|_{1}<(\epsilon/2dk)^{10}. Therefore, for two different mixture distributions,

where (a)(a) follows form the fact that two mixtures have overlap only in the corresponding components, (b)(b) uses the fact that at least in k/8k/8 components ijiji_{j}\neq i_{j}^{\prime}, and then uses the lower bound from the previous section.

Now, to upper bound the KL divergence, we simply use the convexity, namely for any distributions P1PkP_{1}\ldots P_{k} and Q1QkQ_{1}\ldots Q_{k}, let Pˉ\bar{P} and Qˉ\bar{Q} be the mean distributions. Then,

By the construction and from the previous section, for any jj,

Therefore, we can take β=4c2ϵ2\beta=4c^{2}\epsilon^{2}.

which for c1=128,c=128.1c_{1}=128,c=128.1 and n<dk88ϵ2n<\frac{dk}{8^{8}\epsilon^{2}} is at least ϵ\epsilon.

Appendix D Proofs for k𝑘k spherical Gaussians

We first state a simple concentration result that helps us in other proofs.

We prove the lower bound, the proof for the upper bound is similar and omitted. Since X\mathbf{X} and Y\mathbf{Y} are Gaussians, XY\mathbf{X}-\mathbf{Y} is distributed as N(μ1μ2,2σ2)N(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2},2\sigma^{2}). Rewriting XY2\left|\left|\mathbf{X}-\mathbf{Y}\right|\right|_{2}

Furthermore (μ1μ2)Z(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})\cdot\mathbf{Z} is sum of Gaussians and hence a Gaussian distribution. It has mean and variance 2σ2μ1μ2222\sigma^{2}\left|\left|\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right|\right|^{2}_{2}. Therefore, by Lemma 15 with probability 1δ/n21-\delta/n^{2},

By the union bound with probability 12δ/n21-2\delta/n^{2},

There are (n2){n\choose 2} pairs and the lemma follows by the union bound. ∎

We show that if Equations (1) and (2) are satisfied, then the lemma holds. The error probability is that of Lemma 23 and is 2δ\leq 2\delta. Since the minimum is over k+1k+1 indices, at least two samples are from the same component. Applying Equations (1) and (2) for these two samples

Similarly by Equations (1) and (2) for any two samples X(a),X(b)\mathbf{X}(a),\mathbf{X}(b) in [k+1][k+1],

where the last inequality follows from the fact that α24αβ4β2\alpha^{2}-4\alpha\beta\geq-4\beta^{2}. The result follows from the assumption that d>20logn2/δd>20\log n^{2}/\delta.

D.2 Proof of Lemma 5

We show that if Equations (1) and (2) are satisfied, then the lemma holds. The error probability is that of Lemma 23 and is 2δ\leq 2\delta. Since Equations (1) and (2) are satisfied, by the proof of Lemma 4, σ^2σ22.5σ2log(n2/δ)d|{\hat{\sigma}}^{2}-\sigma^{2}|\leq 2.5\sigma^{2}\sqrt{\frac{\log(n^{2}/\delta)}{d}}. If two samples X(a)X(a) and X(b)X(b) are from the same component, by Lemma 23,

By Lemma 4, the above quantity is less than 2dσ^2+23σ^2dlogn2δ2d{\hat{\sigma}}^{2}+23{\hat{\sigma}}^{2}\sqrt{d\log\frac{n^{2}}{\delta}}. Hence all the samples from the same component are in a single cluster.

Suppose there are two samples from different components in a cluster, then by Equations (1) and (2),

Relating σ^2{\hat{\sigma}}^{2} and σ2\sigma^{2} using Lemma 4,

Hence \left|\left|{\boldsymbol{\mu}_{i}}-{\boldsymbol{\mu}_{j}}\right|\right|_{2}\leq 10\sigma\bigl{(}d\log\frac{n^{2}}{\delta}\bigr{)}^{1/4}. There are at most kk components; therefore, any two components within the same cluster are at a distance \leq 10k\sigma\bigl{(}d\log\frac{n^{2}}{\delta}\bigr{)}^{1/4}.

D.3 Proof of Lemma 6

The proof is involved and we show it in steps. We first show few concentration bounds which we use later to argue that the samples are clusterable when the sample covariance matrix has a large eigenvalue. Let w^i\hat{w}_{i} be the fraction of samples from component ii. Let μ^i\hat{\boldsymbol{\mu}}_{i} be the empirical average of samples from pi\mathbf{p}_{i}. Let μ^(C){\hat{\overline{\boldsymbol{\mu}}}}(C) be the empirical average of samples in cluster CC. If CC is the entire set of samples we use μ^{\hat{\overline{\boldsymbol{\mu}}}} instead of μ^(C){\hat{\overline{\boldsymbol{\mu}}}}(C). We first show a concentration inequality that we use in rest of the calculations.

Given nn samples from a kk-component Gaussian mixture with probability 12δ\geq 1-2\delta, for every component ii

The second inequality uses the fact that d20logn2/δd\geq 20\log n^{2}/\delta. For bounding the weights, observe that by Lemma 17 with probability 1δ/k\geq 1-\delta/k,

By the union bound the error probability is 2kδ/2k=δ\leq 2k\delta/2k=\delta. ∎

A simple application of triangle inequality yields the following lemma.

Given nn samples from a kk-component Gaussian mixture if Equation (3) holds, then

Given nn samples from a kk-component Gaussian mixture, if Equation (3) holds and the maximum distance between two components is \leq 10k\sigma\bigl{(}d\log\frac{n^{2}}{\delta}\bigr{)}^{1/4}, then μ^μ)2cσdklogn2δn,\left|\left|{\hat{\overline{\boldsymbol{\mu}}}}-{\overline{\boldsymbol{\mu}}})\right|\right|_{2}\leq c\sigma\sqrt{\frac{dk\log\frac{n^{2}}{\delta}}{n}}, for a constant cc.

Hence by Equation (3) and the fact that the maximum distance between two components is \leq 10k\sigma\bigl{(}d\log\frac{n^{2}}{\delta}\bigr{)}^{1/4},

For ndmax(k4,20logn2/δ,1000)n\geq d\geq\max(k^{4},20\log n^{2}/\delta,1000), we get the above term is ckdlogn2/δnσ\leq c\sqrt{\frac{kd\log n^{2}/\delta}{n}}\sigma, for some constant cc. ∎

We now make a simple observation on covariance matrices.

Given nn samples from a kk-component mixture,

Observe that for any two vectors u\mathbf{u} and v\mathbf{v},

Applying the above observation to u=μ^iμ^\mathbf{u}=\hat{\boldsymbol{\mu}}_{i}-{\hat{\overline{\boldsymbol{\mu}}}} and v=μiμ\mathbf{v}={\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}, we get

The lemma follows from triangle inequality. ∎

The following lemma immediately follows from Lemmas 26 and 27.

Given nn samples from a kk-component Gaussian mixture, if Equation (3) and the maximum distance between two components is \leq 10k\sigma\bigl{(}d\log\frac{n^{2}}{\delta}\bigr{)}^{1/4}, then

For a set of samples X(1),X(n)\mathbf{X}(1),\ldots\mathbf{X}(n) from a kk-component mixture,

where w^i\hat{w}_{i} and μ^i\hat{\boldsymbol{\mu}}_{i} are the empirical weights and averages of components ii and μ^=1ni=1nXi{\hat{\overline{\boldsymbol{\mu}}}}=\frac{1}{n}\sum^{n}_{i=1}\mathbf{X}_{i}.

First observe that for any set of points xix_{i} and their average x^\hat{x} and any value aa,

Summing over all components results in the lemma. ∎

We now bound the error in estimating the eigenvalue of the covariance matrix.

Given X(1),X(n)\mathbf{X}(1),\ldots\mathbf{X}(n), nn samples from a kk-component Gaussian mixture, if Equations (1), (2), and (3) hold, then with probability 12δ\geq 1-2\delta,

Since Equations (1), (2), and (3) hold, conditions in Lemmas 26 and 28 are satisfied. By Lemma 28,

By Lemma 29, the covariance matrix can be rewritten as

The second term in Equation (6) is bounded by Lemma 25. Hence together with the fact that d20logn2/δd\geq 20\log n^{2}/\delta we get that with probability 12δ\geq 1-2\delta, the second and third terms are bounded by O(σ2dknlogn2δ).{\cal O}\left(\sigma^{2}\sqrt{\frac{dk}{n}\log\frac{n^{2}}{\delta}}\right).

Let u\mathbf{u} be the largest eigenvector of the sample covariance matrix and ncdk2logn2δn\geq c\cdot dk^{2}\log\frac{n^{2}}{\delta}. If maxiw^iμiμ2=ασ\max_{i}\sqrt{\hat{w}_{i}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}\right|\right|_{2}=\alpha\sigma and Equation (LABEL:eq:covconc) holds, then there exists ii such that u(μiμ)σ(α11/α)/k|\mathbf{u}\cdot({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}})|\geq\sigma(\alpha-1-1/\alpha)/\sqrt{k}.

Observe that jwjvjvjtjwjvjvjtvivi2wivi22\left|\left|\sum_{j}w_{j}\mathbf{v}_{j}\mathbf{v}^{t}_{j}\right|\right|\geq\left|\left|\sum_{j}w_{j}\mathbf{v}_{j}\mathbf{v}^{t}_{j}\frac{\mathbf{v}_{i}}{\left|\left|\mathbf{v}_{i}\right|\right|}\right|\right|_{2}\geq w_{i}\left|\left|\mathbf{v}_{i}\right|\right|^{2}_{2}. Therefore

Hence by Lemma 30 and the triangle inequality, the largest eigenvalue of the sample-covariance matrix is α2σ2c(n)\geq\alpha^{2}\sigma^{2}-c(n). Similarly by applying Lemma 30 again we get,i=1kw^i(μiμ)(μiμ)tu2α2σ22c(n).\left|\left|\sum^{k}_{i=1}\hat{w}_{i}({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}})({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}})^{t}\mathbf{u}\right|\right|_{2}\geq\alpha^{2}\sigma^{2}-2c(n). By triangle inequality and Cauchy-Schwartz inequality,

Hence kασmaxi(μiμ)uα2σ22c(n)\sqrt{k}\alpha\sigma\max_{i}|({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}})\cdot\mathbf{u}|\geq\alpha^{2}\sigma^{2}-2c(n). The lemma follows by substituting the bound on nn in c(n)c(n). ∎

We now make a simple observation on Gaussian mixtures.

The samples from a subset of components AA of the Gaussian mixture are distributed according to a Gaussian mixture of components AA with weights being wi=wi/(jAwj)w_{i}^{\prime}=w_{i}/(\sum_{j\in A}w_{j}).

Observe that we run the recursive clustering at most nn times. At every step, the underlying distribution within a cluster is a Gaussian mixture. Let Equations (1), (2) hold with probability 12δ1-2\delta. Let Equations (3) (LABEL:eq:covconc) all hold with probability 1δ\geq 1-\delta^{\prime}, where δ=δ/2n\delta^{\prime}=\delta/2n at each of nn steps. By the union bound the total error is 2δ+δ2n3δ\leq 2\delta+\delta^{\prime}\cdot 2n\leq 3\delta. Since Equations (1), (2) holds, the conditions of Lemmas 4 and 5 hold. Furthermore it can be shown that discarding at most nϵ/4kn\epsilon/4k samples at each step does not affect the calculations.

We first show that if wiμiμ(C)225k3log(n3/δ)σ\sqrt{w_{i}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}\geq 25\sqrt{k^{3}\log(n^{3}/\delta)}\sigma, then the algorithm gets into the loop. Let wiw_{i}^{\prime} be the weight of the component within the cluster and nnϵ/5kn^{\prime}\geq n\epsilon/5k be the number of samples in the cluster. Let α=25k3log(n3/δ)\alpha=25\sqrt{k^{3}\log(n^{3}/\delta)}. By Fact 32, the components in cluster CC have weight wiwiw_{i}^{\prime}\geq w_{i}. Hence wiμiμ(C)2ασ\sqrt{w_{i}^{\prime}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}\geq\alpha\sigma. Since wiμiμ(C)2ασ\sqrt{w_{i}^{\prime}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}\geq\alpha\sigma, and by Lemma 5 μiμ(C)10kσ(dlogn2/δ)1/4\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|\leq 10k\sigma(d\log n^{2}/\delta)^{1/4}, we have wiα2/(100k2dlogn2/δ)w_{i}^{\prime}\geq\alpha^{2}/(100k^{2}\sqrt{d\log n^{2}/\delta}). Hence by lemma 24, wiwi/2w_{i}^{\prime}\geq w_{i}/2 and w^iμiμ(C)2ασ/2\sqrt{\hat{w}_{i}^{\prime}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}\geq\alpha\sigma/\sqrt{2}. Hence by Lemma 30 and triangle inequality the largest eigenvalue of S(C)S(C) is

Therefore the algorithm gets into the loop.

If nnϵ/8k2cdk2logn3δn^{\prime}\geq n\epsilon/8k^{2}\geq c\cdot dk^{2}\log\frac{n^{3}}{\delta}, then by Lemma 31, there exists a component ii such that u(μiμ(C))σ(α/212/α)/k|\mathbf{u}\cdot({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C))|\geq\sigma(\alpha/\sqrt{2}-1-\sqrt{2}/\alpha)/\sqrt{k}, where u\mathbf{u} is the top eigenvector of the first nϵ/4k2n\epsilon/4k^{2} samples.

Observe that iCwiu(μiμ(C))=0\sum_{i\in C}w_{i}\mathbf{u}\cdot({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C))=0 and maxiu(μiμ(C))σ(α/212/α)/k\max_{i}|\mathbf{u}\cdot({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C))|\geq\sigma(\alpha/\sqrt{2}-1-\sqrt{2}/\alpha)/\sqrt{k}. Let μi{\boldsymbol{\mu}_{i}} be sorted according to their values of u(μiμ(C))\mathbf{u}\cdot({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)), then

where the last inequality follows from Lemma 4 and the fact that d20logn2/δd\geq 20\log n^{2}/\delta. For a sample from component pi\mathbf{p}_{i}, similar to the proof of Lemma 5, by Lemma 15, with probability 1δ/n2k\geq 1-\delta/n^{2}k,

where the second inequality follows from Lemma 4. Since there are two components that are far apart by 9σ^logn2δσ^\geq 9\hat{\sigma}\sqrt{\log\frac{n^{2}}{\delta}}\hat{\sigma} and the maximum distance between a sample and its mean is 2σ^log(n2k/δ)\leq 2{\hat{\sigma}}\sqrt{\log(n^{2}k/\delta)} and the algorithm divides into at-least two non-empty clusters such that no two samples from the same distribution are clustered into two clusters.

For the second part observe that by the above concentration on u\mathbf{u}, no two samples from the same component are clustered differently irrespective of the mean separation. Note that we are using the fact that each sample is clustered at most 2k2k times to get the bound on the error probability. The total error probability by the union bound is 4δ\leq 4\delta. ∎

D.4 Proof of Lemma 7

We show that if the conclusions in Lemmas 6 and 24 holds, then the lemma is satisfied. We also assume that the conclusions in Lemma 30 holds for all the clusters with error probability δ=δ/k\delta^{\prime}=\delta/k. By the union bound the total error probability is 7δ\leq 7\delta.

Let v1,v2,vk1\mathbf{v}_{1},\mathbf{v}_{2},\ldots\mathbf{v}_{k-1} be the top eigenvectors of 1CiCwi(μiμ(C))(μiμ(C))t\frac{1}{|C|}\sum_{i\in C}w_{i}({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C))({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C))^{t}. Let ηi=w^iμiμ(C)2=w^inCμiμ(C)2\eta_{i}=\sqrt{\hat{w}_{i}^{\prime}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}=\sqrt{\hat{w}_{i}}\sqrt{\frac{n}{|C|}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}. Let Δi=μiμ(C))(μiμ(C))2\boldsymbol{\Delta}_{i}=\frac{{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C))}{\left|\left|({\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C))\right|\right|_{2}}. Therefore,

Hence by Lemma 20, the projection of Δi\boldsymbol{\Delta}_{i} on the space orthogonal to top k1k-1 eigenvectors of S(C)S(C) is

The last inequality follows from the bound on w^i\hat{w}_{i} in Lemma 24.

D.5 Proof of Theorem 8

We show that the theorem holds if the conclusions in Lemmas 7 and 26 holds with error probability δ=δ/k\delta^{\prime}=\delta/k. Since in the proof of Lemma 7, the probability that Lemma 6 holds is included, Lemma 6 also holds with the same probability. Since there are at most kk clusters, by the union bound the total error probability is 9δ\leq 9\delta.

For every component ii, we show that there is a choice of mean vector and weight in the search step such that wiD(pi,p^i)ϵ/2kw_{i}D({\mathbf{p}_{i}},{\hat{\mathbf{p}}_{i}})\leq\epsilon/2k and wiw^iϵ/4k|w_{i}-\hat{w}_{i}|\leq\epsilon/4k. That would imply that there is a f^\hat{\mathbf{f}} during the search such that

Since the weights are gridded by ϵ/4k\epsilon/4k, there exists a w^i\hat{w}_{i} such that wiw^iϵ/4k|w_{i}-\hat{w}_{i}|\leq\epsilon/4k. We now show that there exists a choice of mean vector such that wiD(pi,p^i)ϵ/2kw_{i}D({\mathbf{p}_{i}},{\hat{\mathbf{p}}_{i}})\leq\epsilon/2k. Note that if a component has weight ϵ/4k\leq\epsilon/4k, the above inequality follows immediately. Therefore we only look at those components with wiϵ/4kw_{i}\geq\epsilon/4k, by Lemma 24, for such components w^iϵ/5k\hat{w}_{i}\geq\epsilon/5k and therefore we only look at clusters such that Cnϵ/5k|C|\geq n\epsilon/5k. By Lemmas 14 and for any ii,

Note that since we are discarding at most nϵ/8k2n\epsilon/8k^{2} random samples at each step. A total number of nϵ/8k\leq n\epsilon/8k random samples are discarded. It can be shown that this does not affect our calculations and we ignore it in this proof. By Lemma 4, the first estimate of σ2\sigma^{2} satisfies σ^2σ22.5σ2logn2/δ|{\hat{\sigma}}^{2}-\sigma^{2}|\leq 2.5\sigma^{2}\sqrt{\log n^{2}/\delta}. Hence while searching over values of σ^2\hat{\sigma}^{2}, there exist one such that σ2σ2ϵσ2/64dk2|\sigma^{\prime 2}-\sigma^{2}|\leq\epsilon\sigma^{2}/\sqrt{64dk^{2}}. Hence,

Therefore if we show that there is a mean vector μ^i\hat{\boldsymbol{\mu}}_{i} during the search such that μiμ^i2ϵσ/16k2w^i\left|\left|{\boldsymbol{\mu}_{i}}-\hat{\boldsymbol{\mu}}_{i}\right|\right|_{2}\leq\epsilon\sigma/\sqrt{16k^{2}\hat{w}_{i}}, that would prove the Lemma. By triangle inequality,

The second inequality follows from the bound on nn and the fact that Cnw^i|C|\geq n\hat{w}_{i}. Since wiϵ/4kw_{i}\geq\epsilon/4k, by Lemma 24, w^iwi/2\hat{w}_{i}\geq w_{i}/2, we have

Let u1uk1\mathbf{u}_{1}\ldots\mathbf{u}_{k-1} are the top eigenvectors the sample covariance matrix of cluster CC. We now prove that during the search, there is a vector of the form j=1k1gjϵgσ^uj\sum^{k-1}_{j=1}g_{j}\epsilon_{g}\hat{\sigma}\mathbf{u}_{j} such that μiμ(C)j=1k1gjϵgσ^uj2ϵσ8kwi\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)-\sum^{k-1}_{j=1}g_{j}\epsilon_{g}\hat{\sigma}\mathbf{u}_{j}\right|\right|_{2}\leq\frac{\epsilon\sigma}{8k\sqrt{w_{i}}}, during the search, thus proving the lemma. Let ηi=wiμiμ(C)2\eta_{i}=\sqrt{w_{i}}\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}. By Lemma 7, there are set of coefficients αi\alpha_{i} such that

where u\mathbf{u}^{\prime} is perpendicular to u1uk1\mathbf{u}_{1}\ldots\mathbf{u}_{k-1} and 1α2ϵσ/(82ηik)\sqrt{1-\left|\left|\alpha\right|\right|^{2}}\leq\epsilon\sigma/(8\sqrt{2}\eta_{i}k). Hence, we have

Since wiϵ/4kw_{i}\geq\epsilon/4k and by Lemma 6, ηi25k3σlog(n3/δ)\eta_{i}\leq 25\sqrt{k^{3}}\sigma\log(n^{3}/\delta), and μiμ(C)2100k4ϵ1σlog(n3/δ)\left|\left|{\boldsymbol{\mu}_{i}}-{\overline{\boldsymbol{\mu}}}(C)\right|\right|_{2}\leq 100\sqrt{k^{4}\epsilon^{-1}}\sigma\log(n^{3}/\delta). Therefore gj\exists g_{j} such that gjσ^αjϵgσ^|g_{j}\hat{\sigma}-\alpha_{j}|\leq\epsilon_{g}\hat{\sigma} on each eigenvector. Hence,

The last inequality follows by Lemma 4 and the fact that ϵgϵ/16k3/2\epsilon_{g}\leq\epsilon/16k^{3/2}, and hence the theorem. The run time can be easily computed by retracing the steps of the algorithm and using an efficient implementation of single-linkage.