Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians

Constantinos Daskalakis, Gautam Kamath

Introduction

Learning mixtures of Gaussian distributions is one of the most fundamental problems in Statistics, with a multitude of applications in the natural and social sciences, which has recently received considerable attention in Computer Science literature. Given independent samples from an unknown mixture of Gaussians, the task is to ‘learn’ the underlying mixture.

In one version of the problem, ‘learning’ means estimating the parameters of the mixture, that is the mixing probabilities as well as the parameters of each constituent Gaussian. The most popular heuristic for doing so is running the EM algorithm on samples from the mixture [DLR77], albeit no rigorous guarantees are known for it in general.

A line of research initiated by Dasgupta [Das99, AK01, VW02, AM05, BV08] provides rigorous guarantees under separability conditions: roughly speaking, it is assumed that the constituent Gaussians have variation distance bounded away from (indeed, in some cases, distance exponentially close to 11). This line of work was recently settled by a triplet of breakthrough results [KMV10, MV10, BS10], establishing the polynomial solvability of the problem under minimal separability conditions for the parameters to be recoverable in the first place: for any ε>0\varepsilon>0, polynomial in nn and 1/ε1/\varepsilon samples from a mixture of nn-dimensional Gaussians suffice to recover the parameters of the mixture in poly(n,1/ε){\rm poly}(n,1/\varepsilon) time.

While these results settle the polynomial solvability of the problem, they serve more as a proof of concept in that their dependence on 1/ε1/\varepsilon is quite expensive.For example, the single-dimensional algorithm in the heart of [KMV10] has sample and time complexity of Θ(1/ε300)\Theta(1/\varepsilon^{300}) and Ω(1/ε1377)\Omega(1/\varepsilon^{1377}) respectively (even though the authors most certainly did not intend to optimize their constants). Indeed, even for mixtures of two single-dimensional Gaussians, a practically efficient algorithm for this problem is unknown.

A weaker goal for the learner of Gaussian mixtures is this: given samples from an unknown mixture, find any mixture that is close to the unknown one, for some notion of closeness. This PAC-style version of the problem [KMR+94] was pursued by Feldman et al. [FOS06] who obtained efficient learning algorithms for mixtures of nn-dimensional, axis-aligned Gaussians. Given poly(n,1/ε,L){\rm poly}(n,1/\varepsilon,L) samples from such mixture, their algorithm constructs a mixture whose KL divergence to the sampled one is at most ε\varepsilon. Unfortunately, the sample and time complexity of their algorithm depends polynomially on a (priorly known) bound LL, determining the range of the means and variances of the constituent Gaussians in every dimension.In particular, it is assumed that every constituent Gaussian in every dimension has mean μ[μmax,μmax]\mu\in[-\mu_{\max},\mu_{\max}] and variance σ2[σmin2,σmax2]\sigma^{2}\in[\sigma_{\min}^{2},\sigma_{\max}^{2}] where μmaxσmax/σminL\mu_{\max}\sigma_{\max}/\sigma_{\min}\leq L. In particular, the algorithm has pseudo-polynomial dependence on LL where there shouldn’t be any dependence on LL at all [KMV10, MV10, BS10].

Inspired by this recent progress on non-properly learning single-dimensional mixtures, our goal in this paper is to provide sample-optimal algorithms that properly learn. We obtain such algorithms for mixtures of two single-dimensional Gaussians. Namely,

We note that learning a univariate mixture often lies at the heart of learning multivariate mixtures [KMV10, MV10], so it is important to understand this fundamental case.

Note that our algorithm makes no separability assumptions about the constituent Gaussians of the unknown mixture, nor does it require or depend on a bound on the mixture’s parameters. Also, because the mixture is single-dimensional it is not amenable to the techniques of [HK13]. Moreover, it is easy to see that our sample complexity is optimal up to logarithmic factors. Indeed, a Gaussian mixture can trivially simulate a Bernoulli distribution as follows. Let ZZ be a Bernoulli random variable that is with probability 1p1-p and 11 with probability pp. Clearly, ZZ can be viewed as a mixture of two Gaussian random variables of variance, which have means and 11 and are mixed with probabilities 1p1-p and pp respectively. It is known that 1/ε21/\varepsilon^{2} samples are needed to properly learn a Bernoulli distribution, hence this lower bound immediately carries over to Gaussian mixtures.

Approach.

Our algorithm is intuitively quite simple, although some care is required to make the ideas work. First, we can guess the mixing weight up to additive error O(ε)O(\varepsilon), and proceed with our algorithm pretending that our guess is correct. Every guess will result in a collection of candidate distributions, and the final step of our algorithm is a tournament that will select, from among all candidate distributions produced in the course of our algorithm, a distribution that is ε\varepsilon-close to the unknown mixture, if such a distribution exists. To do this we will make use of the following theorem which is our second main contribution in this paper. (See Theorem 19 for a precise statement.)

Informal Theorem 19. There exists an algorithm FastTournament that takes as input sample access to an unknown distribution XX and a collection of candidate hypothesis distributions H1,,HNH_{1},\ldots,H_{N}, as well as an accuracy parameter ε>0\varepsilon>0, and has the following behavior: if there exists some distribution among H1,,HNH_{1},\ldots,H_{N} that is ε\varepsilon-close to XX, then the distribution output by the algorithm is O(ε)O(\varepsilon)-close to XX. Moreover, the number of samples drawn by the algorithm from each of the distributions is O(logN/ε2)O(\log N/\varepsilon^{2}) and the running time is O(NlogN/ε2)O(N\log N/\varepsilon^{2}).

Devising a hypothesis selection algorithm with the performance guarantees of Theorem 19 requires strengthening the Scheffé-estimate-based approach described in Chapter 6 of [DL01] (see [Yat85, DL96, DL97], as well as the recent papers of Daskalakis et al. [DDS12] and Chan et al. [CDSS13]) to continuous distributions whose crossings are difficult to compute exactly as well as to sample-only access to all involved distributions, but most importantly improving the running time to almost linear in the number NN of candidate distributions. Further comparison of our new hypothesis selection algorithm to related work is provided in Section 4. It is also worth noting that the tournament based approach of [FOS06] cannot be used for our purposes in this paper as it would require a priorly known bound on the mixture’s parameters and would depend pseudopolynomially on this bound.

Tuning the number of samples according to the guessed mixing weight, we proceed to draw samples from the unknown mixture, expecting that some of these samples will fall sufficiently close to the means of the constituent Gaussians, where the closeness will depend on the number of samples drawn as well as the unknown variances. We guess which sample falls close to the mean of the constituent Gaussian that has the smaller value of σ/w\sigma/w (standard deviation to mixing weight ratio), which gives us the second parameter of the mixture. To pin down the variance of this Gaussian, we implement a natural idea. Intuitively, if we draw samples from the mixture, we expect that the constituent Gaussian with the smallest σ/w\sigma/w will determine the smallest distance among the samples. Pursuing this idea we produce a collection of variance candidates, one of which truly corresponds to the variance of this Gaussian, giving us a third parameter.

At this point, we have a complete description of one of the component Gaussians. If we could remove this component from the mixture, we would be left with the remaining unknown Gaussian. Our approach is to generate an empirical distribution of the mixture and “subtract out” the component that we already know, giving us an approximation to the unknown Gaussian. For the purposes of estimating the two parameters of this unknown Gaussian, we observe that the most traditional estimates of location and scale are unreliable, since the error in our approximation may cause probability mass to be shifted to arbitrary locations. Instead, we use robust statistics to obtain approximations to these two parameters.

The empirical distribution of the mixture is generated using the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [DKW56]. With O(1/ε2)O(1/\varepsilon^{2}) samples from an arbitrary distribution, this algorithm generates an ε\varepsilon-approximation to the distribution (with respect to the Kolmogorov metric). Since this result applies to arbitrary distributions, it generates a hypothesis that is weak, in some senses, including the choice of distance metric. In particular, the hypothesis distribution output by the DKW inequality is discrete, resulting in a total variation distance of 11 from a mixture of Gaussians (or any other continuous distribution), regardless of the accuracy parameter ε\varepsilon. Thus, we consider it to be interesting that such a weak hypothesis can be used as a tool to generate a stronger, proper hypothesis. We note that the Kolmogorov distance metric is not special here - an approximation with respect to other reasonable distance metrics may be substituted in, as long as the description of the hypothesis is efficiently manipulable in the appropriate ways.

Comparison to Prior Work on Learning Gaussian Mixtures.

In comparison to the recent breakthrough results [KMV10, MV10, BS10], our algorithm has near-optimal sample complexity and much milder running time, where these results have quite expensive dependence of both their sample and time complexity on the accuracy ε\varepsilon, even for single-dimensional mixtures.For example, compared to [KMV10] we improve by a factor of at least 150150 the exponent of both the sample and time complexity. On the other hand, our algorithm has weaker guarantees in that we properly learn but don’t do parameter estimation. In comparison to [FOS06], our algorithm requires no bounds on the parameters of the constituent Gaussians and exhibits no pseudo-polynomial dependence of the sample and time complexity on such bounds. On the other hand, we learn with respect to the total variation distance rather than the KL divergence. Finally, compared to [CDSS13, CDSS14], we properly learn while they non-properly learn and we both have near-optimal sample complexity.

Recently and independently, Acharya et al. [AJOS14a] have also provided algorithms for properly learning spherical Gaussian mixtures. Their primary focus is on the high dimensional case, aiming at a near-linear sample dependence on the dimension. Our focus is instead on optimizing the dependence of the sample and time complexity on ε\varepsilon in the one-dimensional case.

Preliminaries

The univariate half-normal distribution with parameter σ2\sigma^{2} is the distribution of Y|Y| where YY is distributed according to N(0,σ2)\mathcal{N}(0,\sigma^{2}). The CDF of the half-normal distribution is

where \mboxerf(x)\mbox{\text{e}rf}(x) is the error function, defined as

We also make use of the complement of the error function, \mboxerfc(x)\mbox{\text{e}rfc}(x), defined as \mboxerfc(x)=1\mboxerf(x)\mbox{\text{e}rfc}(x)=1-\mbox{\text{e}rf}(x).

A Gaussian mixture model (GMM) of distributions N1(μ1,σ12),,Nn(μn,σn2)\mathcal{N}_{1}(\mu_{1},\sigma_{1}^{2}),\dots,\mathcal{N}_{n}(\mu_{n},\sigma_{n}^{2}) has PDF

where iwi=1\sum_{i}w_{i}=1. These wiw_{i} are referred to as the mixing weights. Drawing a sample from a GMM can be visualized as the following process: select a single Gaussian, where the probability of selecting a Gaussian is equal to its mixing weight, and draw a sample from that Gaussian. In this paper, we consider mixtures of two Gaussians, so w2=1w1w_{2}=1-w_{1}. We will interchangeably use ww and 1w1-w in place of w1w_{1} and w2w_{2}.

The total variation distance between two probability measures PP and QQ on a σ\sigma-algebra FF is defined by

For simplicity in the exposition of our algorithm, we make the standard assumption (see, e.g., [FOS06, KMV10]) of infinite precision real arithmetic. In particular, the samples we draw from a mixture of Gaussians are real numbers, and we can do exact computations on real numbers, e.g., we can exactly evaluate the PDF of a Gaussian distribution on a real number.

The following proposition, whose proof is deferred to the appendix, provides a bound on the total variation distance between two GMMs in terms of the distance between the constituent Gaussians.

Combining these propositions, we obtain the following lemma:

2 Kolmogorov Distance

In addition to total variation distance, we will also use the Kolmogorov distance metric.

We will also use this metric to compare general functions, which may not necessarily be valid CDFs.

We have the following fact, stating that total variation distance upper bounds Kolmogorov distance [GS02].

Fortunately, it is fairly easy to learn with respect to the Kolmogorov distance, due to the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [DKW56].

3 Representing and Manipulating CDFs

Using this construction and Theorem 6, we can derive the following proposition:

Over the course of our algorithm, it will be natural to subtract out a component of a distribution.

A proof and full details are provided in Appendix D.

4 Robust Statistics

We use two well known robust statistics, the median and the interquartile range. These are suited to our application for two purposes. First, they are easy to compute with the nn-interval partition representation of a distribution. Each requires a constant number of queries of the CDF at particular values, and the cost of each query is O(logn)O(\log n). Second, they are robust to small modifications with respect to most metrics on probability distributions. In particular, we will demonstrate their robustness on Gaussians when considering distance with respect to the Kolmogorov metric.

5 Outline of the Algorithm

We can decompose our algorithm into two components: generating a collection of candidate distributions containing at least one candidate with low statistical distance to the unknown distribution (Theorem 11), and identifying such a candidate from this collection (Theorem 19).

Generation of Candidate Distributions: In Section 3, we deal with generation of candidate distributions. A candidate distribution is described by the parameter set (w^,μ^1,σ^1,μ^2,σ^2)(\hat{w},\hat{\mu}_{1},\hat{\sigma}_{1},\hat{\mu}_{2},\hat{\sigma}_{2}), which corresponds to the GMM with PDF f(x)=w^N(μ^1,σ^12,x)+(1w^)N(μ^2,σ^22,x)f(x)=\hat{w}\mathcal{N}(\hat{\mu}_{1},\hat{\sigma}_{1}^{2},x)+(1-\hat{w})\mathcal{N}(\hat{\mu}_{2},\hat{\sigma}_{2}^{2},x). As suggested by Lemma 4, if we have a candidate distribution with sufficently accurate parameters, it will have low statistical distance to the unknown distribution. Our first goal will be to generate a collection of candidates that contains at least one such candidate. Since the time complexity of our algorithm depends on the size of this collection, we wish to keep it to a minimum.

At a high level, we sequentially generate candidates for each parameter. In particular, we start by generating candidates for the mixing weight. While most of these will be inaccurate, we will guarantee to produce at least one appropriately accurate candidate w^\hat{w}^{*}. Then, for each candidate mixing weight, we will generate candidates for the mean of one of the Gaussians. We will guarantee that, out of the candidate means we generated for w^\hat{w}^{*}, it is likely that at least one candidate μ^1\hat{\mu}_{1}^{*} will be sufficiently close to the true mean for this component. The candidate means that were generated for other mixing weights have no such guarantee. We use a similar sequential approach to generate candidates for the variance of this component. Once we have a description of the first component, we simulate the process of subtracting it from the mixture, thus giving us a single Gaussian, whose parameters we can learn. We can not immediately identify which candidates have inaccurate parameters, and they serve only to inflate the size of our collection.

At a lower level, our algorithm starts by generating candidates for the mixing weight followed by generating candidates for the mean of the component with the smaller value of σiwi\frac{\sigma_{i}}{w_{i}}. Note that we do not know which of the two Gaussians this is. The solution is to branch our algorithm, where each branch assumes a correspondence to a different Gaussian. One of the two branches is guaranteed to be correct, and it will only double the number of candidate distributions. We observe that if we take nn samples from a single Gaussian, it is likely that there will exist a sample at distance O(σn)O(\frac{\sigma}{n}) from its mean. Thus, if we take Θ(1wiε)\Theta(\frac{1}{w_{i}\varepsilon}) samples from the mixture, one of them will be sufficiently close to the mean of the corresponding Gaussian. Exploiting this observation we obtain candidates for the mixing weight and the first mean as summarized by Lemma 14.

Next, we generate candidates for the variance of this Gaussian. Our specific approach is based on the observation that given nn samples from a single Gaussian, the minimum distance of a sample to the mean will likely be Θ(σn)\Theta(\frac{\sigma}{n}). In the mixture, this property will still hold for the Gaussian with the smaller σiwi\frac{\sigma_{i}}{w_{i}}, so we extract this statistic and use a grid around it to generate sufficiently accurate candidates for σi\sigma_{i}. This is Lemma 16.

At this point, we have a complete description of one of the component Gaussians. Also, we can generate an empirical distribution of the mixture, which gives an adequate approximation to the true distribution. Given these two pieces, we update the empirical distribution by removing probability mass contributed by the known component. When done carefully, we end up with an approximate description of the distribution of the unknown component. At this point, we extract the median and the interquartile range (IQR) of the resulting distribution. These statistics are robust, so they can tolerate error in our approximation. Finally, the median and IQR allow us to derive the last mean and variance of our distribution. This is Lemma 18.

Putting everything together, we obtain the following result whose proof is in Section 3.5.

Candidate Selection: In view of Theorem 11, to prove our main result it suffices to select from among the candidate mixtures some mixture that is close to the unknown mixture. In Section 4, we describe a tournament-based algorithm (Theorem 19) for identifying a candidate which has low statistical distance to the unknown mixture, concluding the proof of Theorem 1. See our discussion in Section 4 for the challenges arising in obtaining a tournament-based algorithm for continuous distributions whose crossings are difficult to compute exactly, as well as in speeding up the tournament’s running time to almost linear in the number of candidates.

Generating Candidate Distributions

By Proposition 3, if one of the Gaussians of a mixture has a negligible mixing weight, it has a negligible impact on the mixture’s statistical distance to the unknown mixture. Hence, the candidate means and variances of this Gaussian are irrelevant. This is fortunate, since if min(w,1w)<<ε\min{(w,1-w)}<<\varepsilon and we only draw O(1/ε2)O(1/\varepsilon^{2}) samples from the unknown mixture, as we are planning to do, we have no hope of seeing a sufficient number of samples from the low-weight Gaussian to perform accurate statistical tests for it. So for this section we will assume that min(w,1w)Ω(ε)\min{(w,1-w)}\geq\Omega(\varepsilon) and we will deal with the other case separately.

The first step is to generate candidates for the mixing weight. We can obtain a collection of O(1ε)O(\frac{1}{\varepsilon}) candidates containing some w^[wε,w+ε]\hat{w}^{*}\in[w-\varepsilon,w+\varepsilon] by simply taking the set {tεt[1ε]}\{t\varepsilon\mid t\in\left[\frac{1}{\varepsilon}\right]\}.

2 Generating Mean Candidates

The next step is to generate candidates for the mean corresponding to the Gaussian with the smaller value of σiwi\frac{\sigma_{i}}{w_{i}}. Note that, a priori, we do not know whether i=1i=1 or i=2i=2. We try both cases, first generating candidates assuming they correspond to μ1\mu_{1}, and then repeating with μ2\mu_{2}. This will multiply our total number of candidate distributions by a factor of 22. Without loss of essential generality, assume for this section that i=1i=1.

We want a collection of candidates containing μ^1\hat{\mu}_{1}^{*} such that μ1εσ1μ^1μ1+εσ1\mu_{1}-\varepsilon\sigma_{1}\leq\hat{\mu}_{1}^{*}\leq\mu_{1}+\varepsilon\sigma_{1}. The following propositions are straightforward and proved in Appendix C.

Fix i{1,2}i\in\{1,2\}. Given 2023wiε\frac{20\sqrt{2}}{3w_{i}\varepsilon} samples from a GMM, there will exist a sample μ^iμi±εσi\hat{\mu}_{i}^{*}\in\mu_{i}\pm\varepsilon\sigma_{i} with probability 99100\geq\frac{99}{100}.

Fix i{1,2}i\in\{1,2\}. Suppose wiεw^iwi+εw_{i}-\varepsilon\leq\hat{w}_{i}\leq w_{i}+\varepsilon, and wiεw_{i}\geq\varepsilon. Then 2w^i1wi\frac{2}{\hat{w}_{i}}\geq\frac{1}{w_{i}}.

We use these facts to design a simple algorithm: for each candidate w^1\hat{w}_{1} (from Section 3.1), take 4023w^1ε\frac{40\sqrt{2}}{3\hat{w}_{1}\varepsilon} samples from the mixture and use each of them as a candidate for μ1\mu_{1}.

We now examine how many candidate pairs (w^,μ^1)(\hat{w},\hat{\mu}_{1}) we generated. Naively, since w^i\hat{w}_{i} may be as small as O(ε)O(\varepsilon), the candidates for the mean will multiply the size of our collection by O(1ε2)O\left(\frac{1}{\varepsilon^{2}}\right). However, we note that when w^i=Ω(1)\hat{w}_{i}=\Omega(1), then the number of candidates for μi\mu_{i} is actually O(1ε)O\left(\frac{1}{\varepsilon}\right). We count the number of candidate triples (w^,μ^1)(\hat{w},\hat{\mu}_{1}), combining with previous results in the following:

Suppose we have sample access to a GMM with (unknown) parameters (w,μ1,μ2,σ1,σ2)(w,\mu_{1},\mu_{2},\sigma_{1},\sigma_{2}). Then for any ε>0\varepsilon>0 and constants cw,cm>0c_{w},c_{m}>0, using O(1ε2)O(\frac{1}{\varepsilon^{2}}) samples from the GMM, we can generate a collection of O(logε1ε2)O\left(\frac{\log{\varepsilon^{-1}}}{\varepsilon^{2}}\right) candidate pairs for (w,μ1)(w,\mu_{1}). With probability 99100\geq\frac{99}{100}, this will contain a pair (w^,μ^1)(\hat{w}^{*},\hat{\mu}_{1}^{*}) such that w^w±O(ε),\hat{w}^{*}\in w\pm O(\varepsilon), μ^1μ1±O(ε)σ1\hat{\mu}_{1}^{*}\in\mu_{1}\pm O(\varepsilon)\sigma_{1}.

The proof of the lemma is deferred to Appendix C. It implies that we can generate O(1ε2)O\left(\frac{1}{\varepsilon^{2}}\right) candidate triples, such that at least one pair simultaneously describes ww and μ1\mu_{1} to the desired accuracy.

3 Generating Candidates for a Single Variance

In this section, we generate candidates for the variance corresponding to the Gaussian with the smaller value of σiwi\frac{\sigma_{i}}{w_{i}}. We continue with our guess of whether i=1i=1 or i=2i=2 from the previous section.

Again, assume for this section that i=1i=1. The basic idea is that we will find the closest point to μ^1\hat{\mu}_{1}. We use the following property (whose proof is deferred to Appendix E) to establish a range for this distance, which we can then grid over.

We note that this lemma holds in scenarios more general than we consider here, including k>2k>2 and when samples are drawn from a distribution which is only close to a GMM, rather than exactly a GMM.

Let c1c_{1} and c2c_{2} be constants as defined in Proposition 27, and c3=c192c2c_{3}=\frac{c_{1}}{9\sqrt{2}c_{2}}. Consider a mixture of kk Gaussians ff, with components N(μ1,σ12),,N(μk,σk2)\mathcal{N}(\mu_{1},\sigma_{1}^{2}),\dots,\mathcal{N}(\mu_{k},\sigma_{k}^{2}) and weights w1,,wkw_{1},\dots,w_{k}, and let j=argminiσiwij=\arg\min_{i}\frac{\sigma_{i}}{w_{i}}. Suppose we have estimates for the weights and means for all i[1,k]i\in[1,k]:

w^i\hat{w}_{i}, such that 12w^iwi2w^i\frac{1}{2}\hat{w}_{i}\leq w_{i}\leq 2\hat{w}_{i}

μ^i\hat{\mu}_{i}, such that μ^iμic32kσj|\hat{\mu}_{i}-\mu_{i}|\leq\frac{c_{3}}{2k}\sigma_{j}

Suppose we have sample access to a GMM with parameters (w,μ1,μ2,σ1,σ2)(w,\mu_{1},\mu_{2},\sigma_{1},\sigma_{2}), where σ1wσ21w\frac{\sigma_{1}}{w}\leq\frac{\sigma_{2}}{1-w}. Furthermore, we have estimates w^w±O(ε)\hat{w}^{*}\in w\pm O(\varepsilon), μ^1μ1±O(ε)σ1\hat{\mu}_{1}^{*}\in\mu_{1}\pm O(\varepsilon)\sigma_{1}. Then for any ε>0\varepsilon>0, using O(1ε)O(\frac{1}{\varepsilon}) samples from the GMM, we can generate a collection of O(1ε)O\left(\frac{1}{\varepsilon}\right) candidates for σ1\sigma_{1}. With probability 910\geq\frac{9}{10}, this will contain a candidate σ^1\hat{\sigma}_{1}^{*} such that σ^1(1±O(ε))σ1\hat{\sigma}_{1}^{*}\in(1\pm O(\varepsilon))\sigma_{1}.

4 Learning the Last Component using Robust Statistics

At this point, our collection of candidates must contain a triple (w^,μ^1,σ^1)(\hat{w}^{*},\hat{\mu}_{1}^{*},\hat{\sigma}_{1}^{*}) which are sufficiently close to the correct parameters. Intuitively, if we could remove this component from the mixture, we would be left with a distribution corresponding to a single Gaussian, which we could learn trivially. We will formalize the notion of “component subtraction,” which will allow us to eliminate the known component and obtain a description of an approximation to the CDF for the remaining component. Using classic robust statistics (the median and the interquartile range), we can then obtain approximations to the unknown mean and variance. This has the advantage of a single additional candidate for these parameters, in comparison to O(1ε)O(\frac{1}{\varepsilon}) candidates for the previous mean and variance.

The following proposition shows that, when our candidate triple is (w,μ1,σ1)(w^{*},\mu_{1}^{*},\sigma_{1}^{*}), the distribution that we obtain after subtracting the known component out and rescaling is close to the unknown component.

Since the resulting distribution is close to the correct one, we can use robust statistics (via Lemmas 9 and 10) to recover the missing parameters. We combine this with previous details into the following Lemma, whose proof is deferred to Appendix C.

Suppose we have sample access to a GMM with parameters (w,μ1,μ2,σ1,σ2)(w,\mu_{1},\mu_{2},\sigma_{1},\sigma_{2}), where σ1wσ21w\frac{\sigma_{1}}{w}\leq\frac{\sigma_{2}}{1-w}. Furthermore, we have estimates w^w±O(ε)\hat{w}^{*}\in w\pm O(\varepsilon), μ^1μ1±O(ε)σ1\hat{\mu}_{1}^{*}\in\mu_{1}\pm O(\varepsilon)\sigma_{1}, σ^1(1±O(ε))σ1\hat{\sigma}_{1}^{*}\in(1\pm O(\varepsilon))\sigma_{1}. Then for any ε>0\varepsilon>0, using O(1ε2log1δ)O(\frac{1}{\varepsilon^{2}}\cdot\log\frac{1}{\delta}) samples from the GMM, with probability 1δ\geq 1-\delta, we can generate candidates μ^2μ2±O(ε1w)σ2\hat{\mu}_{2}^{*}\in\mu_{2}\pm O\left(\frac{\varepsilon}{1-w}\right)\sigma_{2} and σ^2(1±O(ε1w))σ2\hat{\sigma}_{2}^{*}\in\left(1\pm O\left(\frac{\varepsilon}{1-w}\right)\right)\sigma_{2}.

5 Putting It Together

Theorem 11 is obtained by combining the previous sections, and its proof is given in the appendix.

Quasilinear-Time Hypothesis Selection

The goal of this section is to present a hypothesis selection algorithm, FastTournament, which is given sample access to a target distribution XX and several hypotheses distributions H1,,HNH_{1},\ldots,H_{N}, together with an accuracy parameter ε>0\varepsilon>0, and is supposed to select a hypothesis distribution from {H1,,HN}\{H_{1},\ldots,H_{N}\}. The desired behavior is this: if at least one distribution in {H1,,HN}\{H_{1},\ldots,H_{N}\} is ε\varepsilon-close to XX in total variation distance, we want that the hypothesis distribution selected by the algorithm is O(ε)O(\varepsilon)-close to XX. We provide such an algorithm whose sample complexity is O(1ε2logN)O({1\over\varepsilon^{2}}\log N) and whose running time O(1ε2NlogN)O({1\over\varepsilon^{2}}N\log N), i.e. quasi-linear in the number of hypotheses, improving the running time of the state of the art (predominantly the Scheffé-estimate based algorithm in [DL01]) quadratically.

We develop our algorithm in full generality, assuming that we have sample access to the distributions of interest, and without making any assumptions about whether they are continuous or discrete, and whether their support is single- or multi-dimensional. All our algorithm needs is sample access to the distributions at hand, together with a way to compare the probability density/mass functions of the distributions, encapsulated in the following definition. In our definition, Hi(x)H_{i}(x) is the probability mass at xx if HiH_{i} is a discrete distribution, and the probability density at xx if HiH_{i} is a continuous distribution. We assume that H1H_{1} and H2H_{2} are either both discrete or both continuous, and that, if they are continuous, they have a density function.

Let H1H_{1} and H2H_{2} be probability distributions over some set D{\cal D}. A PDF comparator for H1,H2H_{1},H_{2} is an oracle that takes as input some xDx\in{\cal D} and outputs 11 if H1(x)>H2(x)H_{1}(x)>H_{2}(x), and otherwise.

Our hypothesis selection algorithm is summarized in the following statement:

The proof of Theorem 19 is given in Section 4.3, while the preceding sections build the required machinery for the construction.

A slight modification of our algorithm provided in Appendix G admits a worst-case running time of O(log1/δε2(NlogN+log1+γ1δ))O\left({\log{1/\delta}\over\varepsilon^{2}}\left(N\log N+\log^{1+\gamma}{1\over\delta}\right)\right), for any desired constant γ>0\gamma>0, though the approximation guarantee is weakened based on the value of γ\gamma. See Corollary 1 and its proof in Appendix G.

The skeleton of the hypothesis selection algorithm of Theorem 19 as well as the improved one of Corollary 1, is having candidate distributions compete against each other in a tournament-like fashion. This approach is quite natural and has been commonly used in the literature; see e.g. Devroye and Lugosi ([DL96, DL97] and Chapter 6 of [DL01]), Yatracos [Yat85], as well as the recent papers of Daskalakis et al. [DDS12] and Chan et al. [CDSS13]. The hypothesis selection algorithms in these works are significantly slower than ours, as their running times have quadratic dependence on the number NN of hypotheses, while our dependence is quasi-linear. Furthermore, our setting is more general than prior work, in that we only require sample access to the hypotheses and a PDF comparator. Previous algorithms required knowledge of (or ability to compute) the probability assigned by every pair of hypotheses to their Scheffé set—this is the subset of the support where one hypothesis has larger PMF/PDF than the other, which is difficult to compute in general, even given explicit descriptions of the hypotheses.

Recent independent work by Acharya et al. [AJOS14a, AJOS14b] provides a hypothesis selection algorithm, based on the Scheffé estimate in Chapter 6 of [DL01]. Their algorithm performs a number of operations that is comparable to ours. In particular, the expected running time of their algorithm is also O(NlogN/δε2)O\left({N\log{N/\delta}\over\varepsilon^{2}}\right), but our worst-case running time has better dependence on δ\delta. Our algorithm is not based on the Scheffé estimate, using instead a specialized estimator provided in Lemma 20. Their algorithm, described in terms of the Scheffé estimate, is not immediately applicable to sample-only access to the hypotheses, or to settings where the probabilities on Scheffé sets are difficult to compute.

1 Choosing Between Two Hypotheses

We start with an algorithm for choosing between two hypothesis distributions. This is an adaptation of a similar algorithm from [DDS12] to continuous distributions and sample-only access. The proof of the following is in Appendix F.

There is an algorithm ChooseHypothesis(X,H1,H2,ε,δ)(X,{H}_{1},{H}_{2},\varepsilon,\delta), which is given sample access to distributions X,H1,H2X,H_{1},H_{2} over some set D\cal D, access to a PDF comparator for H1,H2H_{1},H_{2}, an accuracy parameter ε>0\varepsilon>0, and a confidence parameter δ>0.\delta>0. The algorithm draws m=O(log(1/δ)/ε2)m=O(\log(1/\delta)/\varepsilon^{2}) samples from each of X,H1X,H_{1} and H2H_{2}, and either returns some H{H1,H2}H\in\{H_{1},H_{2}\} as the winner or declares a “draw.” The total number of operations of the algorithm is O(log(1/δ)/ε2)O(\log(1/\delta)/\varepsilon^{2}). Additionally, the output satisfies the following properties:

The analogous conclusions hold if we interchange H1H_{1} and H2H_{2} in Properties 1 and 2 above;

2 The Slow Tournament

We proceed with a hypothesis selection algorithm, SlowTournament, which has the correct behavior, but whose running time is suboptimal. Again we proceed similarly to [DDS12] making the approach robust to continuous distributions and sample-only access. SlowTournament performs pairwise comparisons between all hypotheses in H{\cal H}, using the subroutine ChooseHypothesis of Lemma 20, and outputs a hypothesis that never lost to (but potentially tied with) other hypotheses. The running time of the algorithm is quadratic in H|{\cal H}|, as all pairs of hypotheses are compared. FastTournament, described in Section 4.3, organizes the tournament in a more efficient manner, improving the running time to quasilinear. The proof of Lemma 21 can be found in Appendix F.

3 The Fast Tournament

We prove our main result of this section, providing a quasi-linear time algorithm for selecting from a collection of hypothesis distributions H{\cal H} one that is close to a target distribution XX, improving the running time of SlowTournament from Lemma 21. Intuitively, there are two cases to consider. Collection H{\cal H} is either dense or sparse in distributions that are close to XX. In the former case, we show that we can sub-sample H{\cal H} before running SlowTournament{\tt SlowTournament}. In the latter case, we show how to set-up a two-phase tournament, whose first phase eliminates all but a sub linear number of hypotheses, and whose second phase runs SlowTournament on the surviving hypotheses. Depending on the density of H{\cal H} in distributions that are close to the target distribution XX, we show that one of the aforementioned strategies is guaranteed to output a distribution that is close to XX. As we do not know a priori the density of H{\cal H} in distributions that are close to XX, and hence which of our two strategies will succeed in finding a distribution that is close to XX, we use both strategies and run a tournament among their outputs, using SlowTournament again.

Proof of Theorem 19: Let pp be the fraction of the elements of H{\cal H} that are 8ε8\varepsilon-close to XX. The value of pp is unknown to our algorithm. Regardless, we propose two strategies for selecting a distribution from H{\cal H}, one of which is guaranteed to succeed whatever the value of pp is. We assume throughout this proof that NN is larger than a sufficiently large constant, otherwise our claim follows directly from Lemma 21.

S2:

Phase 2: Given the collection HiT{\cal H}_{i_{T}} output by Phase 1, we run SlowTournament(X,HiT,ε,1/4)(X,{\cal H}_{i_{T}},\varepsilon,1/4) to select some distribution H^HiT\hat{H}\in{\cal H}_{i_{T}}. (We use fresh samples for the execution of SlowTournament.)

The number of samples drawn by S2 from each of the distributions in H{X}{\cal H}\cup\{X\} is O(1ε2logN)O({1\over\varepsilon^{2}}\log N), and the total number of operations is O(1ε2NlogN)O({1\over\varepsilon^{2}}N\log N). Moreover, if p(0,1N]p\in(0,{1\over\sqrt{N}}] and there is some distribution in H{\cal H} that is ε\varepsilon-close to XX, then the distribution H^\hat{H} output by S2 is 8ε8\varepsilon-close to XX, with probability at least 1/41/4.

Given strategies S1 and S2, we first design an algorithm which has the stated worst-case number of operations. The algorithm FastTournamentA works as follows:

Execute strategy S2 k2=log42δk_{2}=\log_{4}{2\over\delta} times, with fresh samples each time. Let H^1,,H^k2\hat{H}_{1},\ldots,\hat{H}_{k_{2}} be the distributions output by these executions.

FastTournamentA satisfies the properties described in the statement of Theorem 19, except for the bound on the expected number of operations.

Proof of Claim 3: The bounds on the number of samples and operations follow immediately from our choice of k1,k2k_{1},k_{2}, Claims 1 and 2, and Lemma 21. Let us justify the correctness of the algorithm. Suppose that there is some distribution in H{\cal H} that is ε\varepsilon-close to XX. We distinguish two cases, depending on the fraction pp of distributions in H{\cal H} that are ε\varepsilon-close to XX:

p(0,1N]p\in(0,{1\over\sqrt{N}}]: This case is analyzed analogously. With probability at least 1δ/21-\delta/2, at least one of H^1,,H^k2\hat{H}_{1},\ldots,\hat{H}_{k_{2}} is 8ε8\varepsilon-close to XX (by Claim 2). Conditioning on this, SlowTournament(X,G,64ε,δ/2)(X,{\cal G},64\varepsilon,\delta/2) outputs a distribution that is 512ε512\varepsilon-close to XX, with probability at least 1δ/21-\delta/2 (by Lemma 21). So, with overall probability at least 1δ1-\delta, the distribution output by FastTournament is 512ε512\varepsilon-close to XX.

We now describe an algorithm which has the stated expected number of operations. The algorithm FastTournamentB works as follows:

FastTournamentB satisfies the properties described in the statement of Theorem 19, except for the worst-case bound on the number of operations.

Proof of Claim 4: We note that we will first draw O(log(N3/δ)/ε2)O(\log(N^{3}/\delta)/\varepsilon^{2}) from each of X,H1,,HNX,H_{1},\dots,H_{N} and use the same samples for every execution of ChooseHypothesis to avoid blowing up the sample complexity. Using this fact, the sample complexity is as claimed.

By Claims 1 and 2 and Lemma 20, each round requires O(NlogN+NlogN/δ)O(N\log N+N\log{N/\delta}) operations. Since the expected number of rounds is O(1)O(1), we obtain the desired bound on the expected number of operations. \hfill\qed\hfill\qed

In order to obtain all the guarantees of the theorem simultaneously, our algorithm FastTournament will alternate between steps of FastTournamentA and FastTournamentB, where both algorithms are given an error parameter equal to δ2\frac{\delta}{2}. If either algorithm outputs a hypothesis, FastTournament outputs it. By union bound and Claims 3 and 4, both FastTournamentA and FastTournamentB will be correct with probability at least 1δ1-\delta. The worst-case running time is as desired by Claim 3 and since interleaving between steps of the two tournaments will multiply the number of steps by a factor of at most 22. We have the expected running time similarly, by Claim 4.

Proof of Theorem 1

Theorem 1 is an immediate consequence of Theorems 11 and 19. Namely, we run the algorithm of Theorem 11 to produce a collection of Gaussian mixtures, one of which is within ε\varepsilon of the unknown mixture FF. Then we use FastTournament of Theorem 19 to select from among the candidates a mixture that is O(ε)O(\varepsilon)-close to FF. For the execution of FastTournament, we need a PDF comparator for all pairs of candidate mixtures in our collection. Given that these are described with their parameters, our PDF comparators evaluate the densities of two given mixtures at a challenge point xx and decide which one is largest. We also need sample access to our candidate mixtures. Given a parametric description (w,μ1,σ12,μ2,σ22)(w,\mu_{1},\sigma_{1}^{2},\mu_{2},\sigma_{2}^{2}) of a mixture, we can draw a sample from it as follows: first draw a uniform $variablewhosevaluecomparedtovariable whose value compared towdetermineswhethertosamplefromdetermines whether to sample from{\cal N}(\mu_{1},\sigma_{1}^{2})oror{\cal N}(\mu_{2},\sigma_{2}^{2})inthesecondstep;forthesecondstep,usetheBoxMullertransform[BM58]toobtainsamplefromeitherin the second step; for the second step, use the Box-Muller transform [BM58] to obtain sample from either{\cal N}(\mu_{1},\sigma_{1}^{2})oror{\cal N}(\mu_{2},\sigma_{2}^{2})$ as decided in the first step.

References

Appendix A Gridding

We will encounter settings where we have bounds LL and RR on an unknown measure XX such that LXRL\leq X\leq R, and wish to obtain an estimate X^\hat{X} such that (1ε)XX^(1+ε)X(1-\varepsilon)X\leq\hat{X}\leq(1+\varepsilon)X. Gridding is a common technique to generate a list of candidates that is guaranteed to contain such an estimate.

Candidates of the form L+kεLL+k\varepsilon L define an additive grid with at most 1ε(RLL)\frac{1}{\varepsilon}\left(\frac{R-L}{L}\right) candidates.

Candidates of the form L(1+ε)kL(1+\varepsilon)^{k} define a multiplicative grid with at most 1log(1+ε)log(RL)\frac{1}{\log{(1+\varepsilon)}}\log{\left(\frac{R}{L}\right)} candidates.

We also encounter scenarios where we require an additive estimate XεX^X+εX-\varepsilon\leq\hat{X}\leq X+\varepsilon.

Candidates of the form L+kεL+k\varepsilon define an absolute additive grid with 1ε(RL)\frac{1}{\varepsilon}\left(R-L\right) candidates.

Appendix B Proofs Omitted from Section 2

again using the triangle inequality. A symmetric statement holds for the other term, giving us the desired result. \hfill\qed\hfill\qed

Combining this with the previous bounds and rescaling, we obtain the lemma statement.\hfill\qed\hfill\qed

Appendix C Proofs Omitted from Section 3

Proof of Proposition 12: The probability that a sample is from Ni\mathcal{N}_{i} is wiw_{i}. Using the CDF of the half-normal distribution, given that a sample is from Ni\mathcal{N}_{i}, the probability that it is at a distance εσi\leq\varepsilon\sigma_{i} from μi\mu_{i} is \mboxerf(ε2)\mbox{\text{e}rf}\left(\frac{\varepsilon}{\sqrt{2}}\right).

If we take a single sample from the mixture, it will satisfy the desired conditions with probability at least wi\mboxerf(ε2)w_{i}\mbox{\text{e}rf}\left(\frac{\varepsilon}{\sqrt{2}}\right). If we take 2023wiε\frac{20\sqrt{2}}{3w_{i}\varepsilon} samples from the mixture, the probability that some sample satisfies the conditions is at least

where the first inequality is by noting that \mboxerf(x)34x\mbox{\text{e}rf}(x)\geq\frac{3}{4}x for xx\in. \hfill\qed\hfill\qed

Proof of Proposition 13: wiεw_{i}\geq\varepsilon implies wiwi+ε2w_{i}\geq\frac{w_{i}+\varepsilon}{2}, and thus 2w^i2wi+ε1wi.\frac{2}{\hat{w}_{i}}\geq\frac{2}{w_{i}+\varepsilon}\geq\frac{1}{w_{i}}. \hfill\qed\hfill\qed

Proof of Lemma 14: Aside from the size of the collection, the rest of the conclusions follow from Propositions 12 and 13.

For a given w^\hat{w}, the number of candidates μ^1\hat{\mu}_{1} we consider is 4023w^ε\frac{40\sqrt{2}}{3\hat{w}\varepsilon}. We sum this over all candidates for w^\hat{w}, namely, ε,2ε,,1ε\varepsilon,2\varepsilon,\dots,1-\varepsilon, giving us

where HnH_{n} is the nnth harmonic number. \hfill\qed\hfill\qed

Proof of Lemma 16: Let YY be the nearest sample to μ^1\hat{\mu}_{1}. From Lemma 15, with probability 910\geq\frac{9}{10}, Yμ^1[c34σ1,(2+c34)σ1]|Y-\hat{\mu}_{1}|\in[\frac{c_{3}}{4}\sigma_{1},(\sqrt{2}+\frac{c_{3}}{4})\sigma_{1}].

We can generate candidates by rearranging the bounds to obtain

Applying Fact 23 and noting that RL=O(1)\frac{R}{L}=O(1), we conclude that we can grid over this range with O(1ε)O(\frac{1}{\varepsilon}) candidates.\hfill\qed\hfill\qed

The equality is a rearrangement of terms, the first inequality is the triangle inequality, the second inequality uses Fact 5, and the third and fourth inequalities use Propositions 2 and 3 respectively.\hfill\qed\hfill\qed

Proof of Theorem 11: We produce two lists of candidates corresponding to whether min(w,1w)=Ω(ε)\min{(w,1-w)}=\Omega(\varepsilon) or not:

In the first case, combining Lemmas 14, 16, and 18 and taking the Cartesian product of the resulting candidates for the mixture’s parameters, we see that we can obtain a collection of O(logε1ε3)O\left(\frac{\log\varepsilon^{-1}}{\varepsilon^{3}}\right) candidate mixtures. With probability 45\geq\frac{4}{5}, this will contain a candidate (w^,μ^1,μ^2,σ^1,σ^2)(\hat{w}^{*},\hat{\mu}_{1}^{*},\hat{\mu}_{2}^{*},\hat{\sigma}_{1}^{*},\hat{\sigma}_{2}^{*}) such that w^w±O(ε),μ^iμi±O(ε)σi\hat{w}\in w\pm O(\varepsilon),\hat{\mu}_{i}\in\mu_{i}\pm O(\varepsilon)\sigma_{i} for i=1,2i=1,2, and σ^i(1±O(ε))σi\hat{\sigma}_{i}\in(1\pm O(\varepsilon))\sigma_{i} for i=1,2i=1,2. Note that we can choose the hidden constants to be as small as necessary for Lemma 4, and thus we can obtain the desired total variation distance.

Finally, note that the number of samples that we need for the above to hold is O(1/ε2)O(1/\varepsilon^{2}). For this, it is crucial that we first draw a sufficient O(1/ε2)O(1/\varepsilon^{2}) samples from the mixture (specified by the worse requirement among Lemmas 14, 16, and 18), and then execute the candidate generation algorithm outlined in Lemmas 14, 16, and 18. In particular, we do not want to redraw samples for every branching of this algorithm.

Finally, to boost the success probability, we repeat the entire process log5δ1\log_{5}\delta^{-1} times and let our collection of candidate mixutres be the union of the collections from these repetitions. The probability that none of these collections contains a suitable candidate distribution is (15)log5δ1δ\leq\left(\frac{1}{5}\right)^{\log_{5}\delta^{-1}}\leq\delta.

Appendix D Details about Representing and Manipulating CDFs

We note that, if we are only concerned with sampling, the order of the elements of the support is irrelevant. However, we will sort the elements of the support in order to perform efficient modifications later.

At one point in our learning algorithm, we will have a candidate which correctly describes one of the two components in our mixture of Gaussians. If we could “subtract out” this component from the mixture, we would be left with a single Gaussian - in this setting, we can efficiently perform parameter estimation to learn the other component. However, if we naively subtract the probability densities, we will obtain negative probability densities, or equivalently, non-monotonically increasing CDFs. To deal with this issue, we define a process we call monotonization. Intuitively, this will shift negative probability density to locations with positive probability density. We show that this preserves Kolmogorov distance and that it can be implemented efficiently.

We argue that if a function is close in Kolmogorov distance to a monotone function, then so is its monotonization.

If F(x)G^(x)F(x)\geq\hat{G}(x), using the fact that G^(x)G(x)\hat{G}(x)\geq G(x) (due to monotonization), we can deduce F(x)G^(x)F(x)G(x)ε|F(x)-\hat{G}(x)|\leq|F(x)-G(x)|\leq\varepsilon.

If F(x)<G^(x)F(x)<\hat{G}(x), consider an infinite sequence of points {yi}\{y_{i}\} such that G(yi)G(y_{i}) becomes arbitrarily close to supyxG(x)\sup_{y\leq x}G(x). By monotonicity of FF, we have that F(x)G^(x)F(yi)G(yi)+δiε+δi|F(x)-\hat{G}(x)|\leq|F(y_{i})-G(y_{i})|+\delta_{i}\leq\varepsilon+\delta_{i}, where δi=G^(x)G(yi)\delta_{i}=|\hat{G}(x)-G(y_{i})|. Since δi\delta_{i} can be taken arbitrarily small, we have F(x)G^(x)ε|F(x)-\hat{G}(x)|\leq\varepsilon. ∎

We will need to efficiently compute the monotonization in certain settings, when subtracting one monotone function from another.

Suppose we have access to the nn-interval partition representation of a CDF FF. Given a monotone non-decreasing function GG, we can compute the nn-interval partition representation of the monotonization of FGF-G in O(n)O(n) time.

Consider the values in the nn-interval partition of FF. Between any two consecutive values v1v_{1} and v2v_{2}, FF will be flat, and since GG is monotone non-decreasing, FGF-G will be monotone non-increasing. Therefore, the monotonization of FGF-G at x[v1,v2)x\in[v_{1},v_{2}) will be the maximum of FGF-G on (,v1](-\infty,v_{1}]. The resulting monotonization will also be flat on the same intervals as FF, so we will only need to update the probability intervals to reflect this monotonization.

We will iterate over probability intervals in increasing order of their values, and describe how to update each interval. We will need to keep track of the maximum value of FGF-G seen so far. Let mm be the maximum of FGF-G for all xvx\leq v, where vv is the value associated with the last probability interval we have processed. Initially, we have the value m=0m=0. Suppose we are inspecting a probability interval with endpoints [l,r][l,r] and value vv. The left endpoint of this probability interval will become l^=m\hat{l}=m, and the right endpoint will become r^=rG(v)\hat{r}=r-G(v). If r^l^\hat{r}\leq\hat{l}, the interval is degenerate, meaning that the monotonization will flatten out the discontinuity at vv - therefore, we simply delete the interval. Otherwise, we have a proper probability interval, and we update m=r^m=\hat{r}.

This update takes constant time per interval, so the overall time required is O(n)O(n). ∎

To justify the running time of this procedure, we must also argue that the normalization can be done efficiently. To normalize the distribution (1w)H^(1-w)\hat{H}, we make another O(n)O(n) pass over the probability intervals and multiply all the endpoints by 1r\frac{1}{r^{*}}, where rr^{*} is the right endpoint of the rightmost probability interval. We note that rr^{*} will be exactly 1w1-w because the value of FwGF-wG at \infty is 1w1-w, so this process results in the distribution H^\hat{H}. \hfill\qed\hfill\qed

Appendix E Robust Estimation of Scale from a Mixture of Gaussians

In this section, we examine the following statistic:

We give an interval in which this statistic is likely to fall (Proposition 27), and examine its robustness when sampling from distributions which are statistically close to the distribution under consideration (Proposition 29). We then apply these results to mixtures of Gaussians (Proposition 30 and Lemma 15).

We show that Pr[Xj∉I]110\Pr[X_{j}\not\in I]\leq\frac{1}{10} by showing that the following two bad events are unlikely:

We have a sample which is too close to xx

Showing these events occur with low probability and combining with the union bound gives the desired result.

Let YY be the number of samples at distance <c1n<\frac{c_{1}}{n} in distance in the CDF, i.e., Y={iFX1(Xi)y<c1n}Y=|\{i\mid|F_{X}^{-1}(X_{i})-y|<\frac{c_{1}}{n}\}|. By linearity of expectation, E[Y]=2c1E[Y]=2c_{1}. By Markov’s inequality, Pr(Y>0)<2c1\Pr(Y>0)<2c_{1}. This allows us to upper bound the probability that one of our samples is too close to xx.

Let ZZ be the number of samples at distance <c2n<\frac{c_{2}}{n} in distance in the CDF, i.e., Z={iFX1(Xi)y<c2n}Z=|\{i\mid|F_{X}^{-1}(X_{i})-y|<\frac{c_{2}}{n}\}|, and let ZiZ_{i} be an indicator random variable which indicates this property for XiX_{i}. We use the second moment principle,

By linearity of expectation, E[Z]2=4c22E[Z]^{2}=4c_{2}^{2}.

And thus, Pr(Z=0)12c2+1\Pr(Z=0)\leq\frac{1}{2c_{2}+1}. This allows us to upper bound the probability that all of our samples are too far from xx.

Setting c1=140c_{1}=\frac{1}{40} and c2=192c_{2}=\frac{19}{2} gives probability <120<\frac{1}{20} for each of the bad events, resulting in a probability <110<\frac{1}{10} of either bad event by the union bound, and thus the desired result. ∎

We will need the following property of Kolmogorov distance, which states that probability mass within every interval is approximately preserved:

For an interval I=[a,b]I=[a,b], we can rewrite the property as

as desired, where the first inequality is the triangle inequality and the second inequality is due to the bound on Kolmogorov distance. ∎

The next proposition says that if we instead draw samples from a distribution which is close in total variation distance, the same property approximately holds with respect to the original distribution.

First, examine interval INI_{N}. This interval contains 2c1n2δ\frac{2c_{1}}{n}-2\delta probability measure of the distribution FXF_{X}. By Proposition 28, FX(IN)FX^(IN)2δ|F_{X}(I_{N})-F_{\hat{X}}(I_{N})|\leq 2\delta, so FX^(IN)2c1nF_{\hat{X}}(I_{N})\leq\frac{2c_{1}}{n}. One can repeat this argument to show that the amount of measure contained by FX^F_{\hat{X}} in [FX1(yc2nδ),FX1(y+c2n+δ)][F_{X}^{-1}(y-\frac{c_{2}}{n}-\delta),F_{X}^{-1}(y+\frac{c_{2}}{n}+\delta)] is 2c2n\geq\frac{2c_{2}}{n}.

As established through the proof of Proposition 27, with probability 910\geq\frac{9}{10}, there will be no samples in a window containing probability measure 2c1n\frac{2c_{1}}{n}, but there will be at least one sample in a window containing probability measure 2c2n\frac{2c_{2}}{n}. Applying the same argument to these intervals, we can arrive at the desired result. ∎

We examine this statistic for some mixture of kk Gaussians with PDF ff around the point corresponding to the mean of the component with the minimum value of σiwi\frac{\sigma_{i}}{w_{i}}. Initially, we assume that we know this location exactly and that we are taking samples according to ff exactly.

Consider a mixture of kk Gaussians with PDF ff, components N(μ1,σ12),,N(μk,σk2)\mathcal{N}(\mu_{1},\sigma_{1}^{2}),\dots,\mathcal{N}(\mu_{k},\sigma_{k}^{2}) and weights w1,,wkw_{1},\dots,w_{k}. Let j=argminiσiwij=\arg\min_{i}\frac{\sigma_{i}}{w_{i}}. If we take nn samples X1,,XnX_{1},\dots,X_{n} from the mixture (where n>3πc22wjn>\frac{3\sqrt{\pi}c_{2}}{2w_{j}}), then miniXiμj[2πc1σjkwjn,32πc2σj2wjn]\min_{i}|X_{i}-\mu_{j}|\in\left[\frac{\sqrt{2\pi}c_{1}\sigma_{j}}{kw_{j}n},\frac{3\sqrt{2\pi}c_{2}\sigma_{j}}{2w_{j}n}\right] with probability 910\geq\frac{9}{10}, where c1c_{1} and c2c_{2} are as defined in Proposition 27.

We examine the CDF of the mixture around μi\mu_{i}. Using Proposition 1 (and symmetry of a Gaussian about its mean), it is sufficient to show that

where FF is the CDF of the mixture. We show that each endpoint of the latter interval bounds the corresponding endpoint of the former interval.

First, we show c1nF(μi+2πc1σikwin)F(μi)\frac{c_{1}}{n}\geq F\left(\mu_{i}+\frac{\sqrt{2\pi}c_{1}\sigma_{i}}{kw_{i}n}\right)-F(\mu_{i}). Let I=[μi,μi+2πc1σikwin]I=\left[\mu_{i},\mu_{i}+\frac{\sqrt{2\pi}c_{1}\sigma_{i}}{kw_{i}n}\right], ff be the PDF of the mixture, and fif_{i} be the PDF of component ii of the mixture. The right-hand side of the inequality we wish to prove is equal to

where the first inequality is since the maximum of the PDF of a Gaussian is 1σ2π\frac{1}{\sigma\sqrt{2\pi}}, and the second is since σjwjσiwi\frac{\sigma_{j}}{w_{j}}\leq\frac{\sigma_{i}}{w_{i}} for all jj.

Next, we show c2nF(μi+32πc2σi2win)F(μi)\frac{c_{2}}{n}\leq F\left(\mu_{i}+\frac{3\sqrt{2\pi}c_{2}\sigma_{i}}{2w_{i}n}\right)-F(\mu_{i}). We note that the right-hand side is the probability mass contained in the interval - a lower bound for this quantity is the probability mass contributed by the particular Guassian we are examining, which is wi2\mboxerf(3πc22win)\frac{w_{i}}{2}\mbox{\text{e}rf}{\left(\frac{3\sqrt{\pi}c_{2}}{2w_{i}n}\right)}. Taking the Taylor expansion of the error function gives

if x<1x<1. Applying this here, we can lower bound the contributed probability mass by wi22π233πc22win=c2n\frac{w_{i}}{2}\frac{2}{\sqrt{\pi}}\frac{2}{3}\frac{3\sqrt{\pi}c_{2}}{2w_{i}n}=\frac{c_{2}}{n}, as desired. ∎

Finally, we deal with uncertainties in parameters and apply the robustness properties of Proposition 29 in the following lemma:

Proof of Lemma 15: We analyze the effect of each uncertainty:

First, we consider the effect of sampling from f^\hat{f}, which is δ\delta-close to ff. By using Proposition 29, we know that the nearest sample to μj\mu_{j} will be at CDF distance between c1nδc12n\frac{c_{1}}{n}-\delta\geq\frac{c_{1}}{2n} and c2n+δ3c22n\frac{c_{2}}{n}+\delta\leq\frac{3c_{2}}{2n}. We can then repeat the proof of Proposition 30 with c1c_{1} replaced by c12\frac{c_{1}}{2} and c2c_{2} replaced by 3c22\frac{3c_{2}}{2}. This gives us that miniXiμj[πc12kwjnσj,9πc222wjnσj]\min_{i}|X_{i}-\mu_{j}|\in\left[\frac{\sqrt{\pi}c_{1}}{\sqrt{2}kw_{j}n}\sigma_{j},\frac{9\sqrt{\pi}c_{2}}{2\sqrt{2}w_{j}n}\sigma_{j}\right] (where n9πc24wjn\geq\frac{9\sqrt{\pi}c_{2}}{4w_{j}}) with probability 910\geq\frac{9}{10}.

Next, substituting in the bounds 12w^jwj2w^j\frac{1}{2}\hat{w}_{j}\leq w_{j}\leq 2\hat{w}_{j}, we get miniXiμj[πc122kw^jnσj,9πc22w^jnσj]\min_{i}\left|X_{i}-\mu_{j}\right|\in\left[\frac{\sqrt{\pi}c_{1}}{2\sqrt{2}k\hat{w}_{j}n}\sigma_{j},\frac{9\sqrt{\pi}c_{2}}{\sqrt{2}\hat{w}_{j}n}\sigma_{j}\right] (where n9πc22w^jn\geq\frac{9\sqrt{\pi}c_{2}}{2\hat{w}_{j}}) with probability 910\geq\frac{9}{10}.

We use n=9πc22w^jn=\frac{9\sqrt{\pi}c_{2}}{2\hat{w}_{j}} samples to obtain: miniXiμj[c3kσj,2σj]\min_{i}\left|X_{i}-\mu_{j}\right|\in\left[\frac{c_{3}}{k}\sigma_{j},\sqrt{2}\sigma_{j}\right] with probability 910\geq\frac{9}{10}.

Finally, applying μ^jμjc32kσj|\hat{\mu}_{j}-\mu_{j}|\leq\frac{c_{3}}{2k}\sigma_{j} gives the lemma statement.

Appendix F Omitted Proofs from Section 4

Proof of Lemma 20: We set up a competition between H1H_{1} and H2H_{2}, in terms of the following subset of D\cal D:

1a. Draw m=O(log(1/δ)ε2)m=O\left({\log(1/\delta)\over\varepsilon^{2}}\right) samples s1,,sms_{1},\ldots,s_{m} from XX, and let τ^=1m{i  siW1}\hat{\tau}={1\over m}|\{i~{}|~{}s_{i}\in{\cal W}_{1}\}| be the fraction of them that fall inside W1.{\cal W}_{1}. 1b. Similarly, draw mm samples from H1H_{1}, and let p^1\hat{p}_{1} be the fraction of them that fall inside W1.{\cal W}_{1}. 1c. Finally, draw mm samples from H2H_{2}, and let p^2\hat{p}_{2} be the fraction of them that fall inside W1.{\cal W}_{1}. 2. If p^1p^26ε\hat{p}_{1}-\hat{p}_{2}\leq 6\varepsilon, declare a draw. Otherwise: 3. If τ^>p^12ε\hat{\tau}>\hat{p}_{1}-2\varepsilon, declare H1H_{1} as winner and return H1H_{1}; otherwise, 4. if τ^<p^2+2ε\hat{\tau}<\hat{p}_{2}+2\varepsilon, declare H2H_{2} as winner and return H2H_{2}; otherwise, 5. Declare a draw.

Notice that, in Steps 1a, 1b and 1c, the algorithm utilizes the PDF comparator for distributions H1H_{1} and H2H_{2}. The correctness of the algorithm is a consequence of the following claim.

Proof of Claim 5: Let τ=X(W1)\tau=X({\cal W}_{1}). The Chernoff bound (together with a union bound) imply that, with probability at least 16emε2/21-6e^{-{m\varepsilon^{2}/2}}, the following are simultaneously true: p1p^1<ε/2|p_{1}-\hat{p}_{1}|<\varepsilon/2, p2p^2<ε/2|p_{2}-\hat{p}_{2}|<\varepsilon/2, and ττ^<ε/2|\tau-\hat{\tau}|<\varepsilon/2. Conditioning on these:

Proof of Lemma 21: Draw m=O(log(2N/δ)/ε2)m=O(\log(2N/\delta)/\varepsilon^{2}) samples from each of X,H1,,HNX,H_{1},\ldots,H_{N} and, using the same samples, run

for every pair of distributions Hi,HjHH_{i},H_{j}\in{\cal H}. If there is a distribution HHH\in{\cal H} that was never a loser (but potentially tied with some distributions), output any such distribution. Otherwise, output “failure.”

Proof of Claim 1: The probability that H{\cal H}^{\prime} contains no distribution that is 8ε8\varepsilon-close to XX is at most

If H{\cal H}^{\prime} contains at least one distribution that is 8ε8\varepsilon-close to XX, then by Lemma 21 the distribution output by SlowTournament(X,H,8ε,e3)(X,{\cal H}^{\prime},8\varepsilon,e^{-3}) is 64ε64\varepsilon-close to XX with probability at least 1e31-e^{-3}. From a union bound, it follows that the distribution output by S1 is 64ε64\varepsilon-close to XX, with probability at least 12e39/101-2e^{-3}\geq 9/10. The bounds on the number of samples and operations follow from Lemma 21. \hfill\qed\hfill\qed

Proof of Claim 2: Suppose that there is some distribution HHH^{*}\in{\cal H} that is ε\varepsilon-close to XX. We first argue that with probability at least 13{1\over 3}, HHiTH^{*}\in{\cal H}_{i_{T}}. We show this in two steps:

Recall that we draw samples from X,H1,,HNX,H_{1},\ldots,H_{N} before Phase 1 begins, and reuse the same samples whenever required by some execution of ChooseHypothesis during Phase 1. Fix a realization of these samples. We can ask the question of what would happen if we executed ChooseHypothesis(X,H,Hj,ε,1/3N)(X,H^{*},{H}_{j},\varepsilon,1/3N), for some HjH{H}H_{j}\in{\cal H}\setminus\{H^{*}\} using these samples. From Lemma 20, it follows that, if HjH_{j} is farther than 8ε8\varepsilon-away from XX, then HH^{*} would be declared the winner by ChooseHypothesis(X,H,Hj,ε,1/3N)(X,H^{*},{H}_{j},\varepsilon,1/3N), with probability at least 11/3N1-1/3N. By a union bound, our samples satisfy this property simultaneously for all HjH{H}H_{j}\in{\cal H}\setminus\{H^{*}\} that are farther than 8ε8\varepsilon-away from XX, with probability at least 11/31-1/3. Henceforth, we condition that our samples have this property.

Conditioning on our samples having the property discussed in (a), we argue that HHiTH^{*}\in{\cal H}_{i_{T}} with probability at least 1/21/2 (so that, with overall probability at least 1/31/3, it holds that HHiTH^{*}\in{\cal H}_{i_{T}}). It suffices to argue that, with probability at least 1/21/2, in all iterations of Phase 1, HH^{*} is not matched with a distribution that is 8ε8\varepsilon-close to XX. This happens with probability at least:

Indeed, given the definition of pp, the probability that HH^{*} is not matched to a distribution that is 8ε8\varepsilon-close to XX is at least 1p1-p in the first iteration. If this happens, then (because of our conditioning from (a)), HH^{*} will survive this iteration. In the next iteration, the fraction of surviving distributions that are 8ε8\varepsilon-close to XX and are different than HH^{*} itself is at most 2p2p. Hence, the probability that HH^{*} is not matched to a distribution that is 8ε8\varepsilon-close to XX is at least 12p1-2p in the second iteration, etc.

Now, conditioning on HHiTH^{*}\in{\cal H}_{i_{T}}, it follows from Lemma 21 that the distribution H^\hat{H} output by SlowTournament(X,HiT,ε,1/4)(X,{\cal H}_{i_{T}},\varepsilon,1/4) is 8ε8\varepsilon-close to XX with probability at least 3/43/4.

Hence, with overall probability at least 1/41/4, the distribution output by S2 is 8ε8\varepsilon-close to XX.

The number of samples drawn from each distribution in H{X}{\cal H}\cup\{X\} is clearly O(1ε2logN)O({1\over\varepsilon^{2}}\log N), as Phase 1 draws O(1ε2logN)O({1\over\varepsilon^{2}}\log N) samples from each distribution and, by Lemma 21, Phase 2 also draws O(1ε2logN)O({1\over\varepsilon^{2}}\log N) samples from each distribution.

The total number of operations is bounded by O(1ε2NlogN).O({1\over\varepsilon^{2}}N\log N). Indeed, Phase 1 runs ChooseHypothesis O(N)O(N) times, and by Lemma 20 and our choice of 1/3N1/3N for the confidence parameter of each execution, each execution takes O(logN/ε2)O(\log N/\varepsilon^{2}) operations. So the total number of operations of Phase 1 is O(1ε2NlogN)O({1\over\varepsilon^{2}}N\log N). On the other hand, the size of HiT{\cal H}_{i_{T}} is at most 2log2N2T=2log2N2log2N28N{2^{\lceil\log_{2}N\rceil}\over 2^{T}}={2^{\lceil\log_{2}N\rceil}\over 2^{\lfloor\log_{2}{\sqrt{N}\over 2}\rfloor}}\leq 8\sqrt{N}. So by Lemma 21, Phase 2 takes O(1ε2NlogN)O({1\over\varepsilon^{2}}N\log N) operations. \hfill\qed\hfill\qed

Appendix G Faster Slow and Fast Tournaments

In this section, we describe another hypothesis selection algorithm. This algorithm is faster than SlowTournament, though at the cost of a larger constant in the approximation factor. In most reasonable parameter regimes, this algorithm is slower than FastTournament, and still has a larger constant in the approximation factor. Regardless, we go on to show how it can be used to improve upon the worst-case running time of FastTournament.

For simplicity, assume that N\sqrt{N} is integer. (If not, introduce into H{\cal H} multiple copies of an arbritrary HHH\in{\cal H} so that N\sqrt{N} becomes an integer.) Partition H{\cal H} into N\sqrt{N} subsets, H=H1H2HN{\cal H}={\cal H}_{1}\sqcup{\cal H}_{2}\sqcup\ldots\sqcup{\cal H}_{\sqrt{N}} and do the following:

Set δ=δ/2\delta^{\prime}=\delta/2, draw O(log(N/δ)/ε2)O(\log(\sqrt{N}/\delta^{\prime})/\varepsilon^{2}) samples from XX and, using the same samples, run SlowTournament(X,Hi,ε,δ)(X,{\cal H}_{i},\varepsilon,\delta^{\prime}) from Lemma 21 for each ii;

Run SlowTournament(X,W,8ε,δ)(X,{\cal W},8\varepsilon,\delta^{\prime}), where W{\cal W} are the distributions output by SlowTournament in the previous step. If W={\cal W}=\emptyset output “failure”.

So, compared to SlowTournament, SlowTournament⨂1 has the same sample complexity asymptotics and the same asymptotic guarantee for the distance from XX of the output distribution, but the exponent of NN in the running time improved from 22 to 3/23/2.

For t=2,3,t=2,3,\ldots, define SlowTournament⨂t by replacing SlowTournament by SlowTournament⨂t-1 in the code of SlowTournament⨂1. It follows from the same analysis as above that as tt increases the exponent of NN in the running time gets arbitrarily close to 11. In particular, in one step an exponent of 1+α1+\alpha becomes an exponent of 1+α/21+\alpha/2. So for some constant tt, SlowTournament⨂t will satisfy the requirements of the theorem. ∎

As a corollary, we can immediately improve the running time of FastTournament at the cost of the constant in the approximation factor. The construction and analysis is nearly identical to that of FastTournament. The sole difference is in step 3 of FastTournamentA - we replace SlowTournament with RecursiveSlowTournamentγ.