List-Decodable Robust Mean Estimation and Learning Mixtures of Spherical Gaussians

Ilias Diakonikolas, Daniel M. Kane, Alistair Stewart

Introduction

This paper is concerned with the problem of efficiently learning high-dimensional spherical Gaussians in the presence of a large fraction of corrupted data, and in the related problem of parameter estimation for mixtures of high-dimensional spherical Gaussians (henceforth, spherical GMMs). Before we state our main results, we describe and motivate these two fundamental learning problems.

The first problem we study is the following:

A few remarks are in order: We first note that we make no assumptions on the remaining (1α)(1-\alpha)-fraction of the points in TT. These points can be arbitrary and may be chosen by an adversary that is computationally unbounded and is allowed to inspect the set of good points. We will henceforth call such a set of points α\alpha-corrupted. Ideally, we would like to output a single hypothesis vector μ^{\widehat{\mu}} that is close to μ\mu (with high probability). Unfortunately, this goal is information-theoretically impossible when the fraction α\alpha of good samples is less than 1/21/2. For example, if the input distribution is a uniform mixture of 1/α1/\alpha many Gaussians whose means are pairwise far from each other, there are Θ(1/α)\Theta(1/\alpha) different valid answers and the list must by definition contain approximations to each of them. It turns out that the information-theoretically best possible size of the candidates list is s=Θ(1/α)s=\Theta(1/\alpha). Therefore, the feasible goal is to design an efficient algorithm that minimizes the Euclidean distance between the unknown μ\mu and its closest μ^i{\widehat{\mu}}_{i}.

Before we proceed with a detailed background and motivation, we point out the connection between these two problems. Intuitively, Problem 2 can be reduced to Problem 1, as follows: We can think of the samples drawn from a spherical GMM as a set of corrupted samples from a single Gaussian — where the Gaussian in question can be any of the mixture components. The output of the list decoding algorithm will produce a list of hypotheses with the guarantee that every mean vector in the mixture is relatively close to some hypothesis. If in addition the distances between the means and their closest hypotheses are substantially smaller than the distances between the means of different components, this will allow us to reliably cluster our sample points based on which hypothesis they are closest to. We can thus cluster points based on which component they came from, and then we can learn each component independently.

2 List-Decodable Robust Learning

The vast majority of efficient high-dimensional learning algorithms with provable guarantees make strong assumptions about the input data. In the context of unsupervised learning (which is the focus of this paper), the standard assumption is that the input points are independent samples drawn from a known family of generative models (e.g., a mixture of Gaussians). However, this simplifying assumption is rarely true in practice and it is important to design estimators that are robust to deviations from their model assumptions.

This phenomenon had raised the following question: Can we reconcile computational efficiency and robustness in high dimensions? Recent work in the TCS community made the first algorithmic progress on this front: Two contemporaneous works [DKK+16, LRV16] gave the first computationally efficient robust algorithms for learning high-dimensional Gaussians (and many other high-dimensional models) with error close to the information-theoretic optimum. Specifically, for the problem of robustly learning an unknown mean Gaussian N(μ,I)N(\mu,I) from an η\eta-corrupted set of samples, η<1/2\eta<1/2, we now know a polynomial-time algorithm that achieves the information-theoretically optimal error of O(η)O(\eta) [DKK+17b].

The aforementioned literature studies the setting where the fraction of corrupted data is relatively small (smaller than 1/21/2), therefore the real data is the majority of the input points. A related setting of interest focuses on the regime when the fraction α\alpha of real data is small — strictly smaller than 1/21/2. From a practical standpoint, this “large error regime” is well-motivated by a number of pressing machine learning applications (see, e.g., [CSV17, SVC16, SKL17]). From a theoretical standpoint, understanding this regime is of fundamental interest and merits investigation in its own right. A specific motivation comes from a previously observed connection to learning mixture models: Suppose we are given samples from the mixture αN(μ,I)+(1α)E\alpha\cdot N(\mu,I)+(1-\alpha)E, i.e., α\alpha-fraction of the samples are drawn from an unknown Gaussian, while the rest of the data comes from several other populations for which we have limited (or no) information. Can we approximate this “good” Gaussian component, independent of the structure of the remaining components?

The main focus of this work is on the fundamental setting where the good data comes from a Gaussian distribution. Specifically, we ask the following question:

What is the best possible error guarantee (information-theoretically) achievable for list-decodable mean estimation, when the true distribution is an unknown N(μ,I)N(\mu,I)? More importantly, what is the best error guarantee that we can achieve with a computationally efficient algorithm?

As our first main result, we essentially resolve Question 1.1.

3 Learning Mixtures of Separated Spherical Gaussians

In this paper, we focus on the natural and important case where each of the components is spherical, i.e., each covariance matrix is an unknown multiple of the identity. The majority of prior algorithmic work on this problem studied the setting where there is a minimum separation between the means of the componentsWithout any separation assumptions, it is known that the sample complexity of the problem becomes exponential in the number of components [MV10, HP15].. For the simplicity of this discussion, let us consider the case that the mixing weights are uniform (i.e., equal to 1/k1/k, where kk is the number of components) and each component has identity covariance. (We emphasize that the positive results of this paper hold for the general case of an arbitrary mixture of high-dimensional spherical Gaussians, and apply even in the presence of a small dimension-independent fraction of corrupted data.) The problem of learning separated spherical GMMs was first studied by Dasgupta [Das99], followed by a long line of works that obtained efficient algorithms under weaker separation assumptions.

Interestingly enough, until very recently, even the information-theoretic aspect of this problem was not understood. Specifically, what is the minimum separation that allows the problem to be solvable with poly(n,k,1/δ)\operatorname{poly}(n,k,1/\delta) samples? Recent work by Regev and Vijayraghavan [RV17] characterized this aspect of the problem: Specifically, [RV17] showed that the problem of learning spherical kk-GMMs (with equal weights and identity covariances) can be solved with poly(n,k,1/δ)\operatorname{poly}(n,k,1/\delta) samples if and only if the means are pairwise separated by at least Θ(logk)\Theta(\sqrt{\log k}). Unfortunately, the approach of [RV17] is non-constructive in high dimensions. Specifically, they gave a sample-efficient learning algorithm whose running time is exponential in the dimension. This motivates the following question:

Is there a poly(n,k)\operatorname{poly}(n,k) time algorithm for learning spherical kk-GMMs with separation o(k1/4)o(k^{1/4}), or better O(kε)O(k^{\varepsilon}), for any fixed ε>0\varepsilon>0? More ambitiously, is there an efficient algorithm that succeeds under the information-theoretically optimal separation?

As our second main result, we make substantial progress towards the resolution of Question 1.2.

4 Our Contributions

In this paper, we develop a set of techniques that yield new efficient algorithms with significantly better guarantees for Problems 1 and 2. Our algorithms depend in an essential way on the analysis of high degree multivariate polynomials. We obtain a detailed structural understanding of the behavior of high degree polynomials under the standard multivariate Gaussian distribution, and leverage this understanding to design our learning algorithms. More concretely, our main technical contribution is a new technique, using degree-dd multivariate polynomials, to remove outliers from high-dimensional datasets where the majority of the points are corrupted.

Our main result is an efficient algorithm for list-decodable Gaussian mean estimation with a significantly improved error guarantee:

A major complication that occurs in the regime of α<1/2\alpha<1/2 is that since fewer than half of our samples are good, the values of such a polynomial pp might concentrate in several clusters. As a consequence, we will not necessarily be able to identify which cluster contains the good samples. In order to deal with this issue, we need to develop new techniques for outlier removal that handle the setting that the good data is a small fraction of our dataset. Roughly speaking, we achieve this by performing a suitable clustering of points based on the values of pp, and returning multiple (potentially overlapping) subsets of our original dataset TT with the guarantee that at least one of them will be a cleaner version of TT. This new paradigm for performing outlier removal in the large error regime may prove useful in other contexts as well.

A crucial technical contribution of our approach is the use of degree more than one polynomials for outlier removal in this setting. The intuitive reason for using polynomials of higher degree is this: A small fraction of points that are far from the true mean in some particular direction will have a more pronounced effect on higher degree moments. Therefore, taking advantage of the information contained in higher moments should allow us to discern smaller errors in the distance from the true mean. The difficulty is that it is not clear how to algorithmically exploit the structure of higher degree moments in this setting.

The major obstacle is the following: Since we do not know the mean μ\mu of GG — this is exactly the quantity we are trying to approximate! — we are also not able to evaluate the variance Var[p(G)]\operatorname*{Var}[p(G)] of p(G)p(G). If pp was a degree-11 polynomial, this would not be a problem, as the variance Var[p(G)]\operatorname*{Var}[p(G)] does not depend on μ\mu. But for degree at least 22 polynomials, the dependence of Var[p(G)]\operatorname*{Var}[p(G)] on μ\mu becomes a fundamental difficulty. Thus, although we can potentially find polynomials with unexpectedly large empirical variance, we will have no way of knowing whether this is due to corrupted points xTx\in T (on which p(x)p(x) is abnormally far from its true mean), or due to errors in our estimation of the mean of GG causing us to underestimate the variance Var[p(G)]\operatorname*{Var}[p(G)].

In order to circumvent this difficulty, we require a number of new ideas, culminating in an algorithm that allows us to either verify that the variance of p(G)p(G) is close to what we are expecting, or to find some other polynomial that allows us to remove outliers.

We leverage the connection between list-decodable learning and learning mixture models to obtain an efficient algorithm for learning spherical GMMs under much weaker separation assumptions. Specifically, by using the algorithm of Theorem 1.3 combined with additional algorithmic ideas, we obtain our second main result:

for all iji\neq j, for C>0C>0 a sufficiently large constant, the algorithm draws poly(n,(dk/δ)d)\operatorname{poly}(n,{(dk/\delta)^{d}}) samples from FF, runs in time poly(n,(dk/δ)d)\operatorname{poly}(n,{(dk/\delta)^{d}}), and with high probability returns a list {(ui,νi,si),i[k]}\{(u_{i},\nu_{i},s_{i}),i\in[k]\}, such that the following conditions hold (up to a permutation): uiwi=O(δ)|u_{i}-w_{i}|=O(\delta), μiνi2/σi=O(δ/wi)\|\mu_{i}-\nu_{i}\|_{2}/\sigma_{i}=O(\delta/w_{i}), and siσi/σi=O(δ/wi)/n|s_{i}-\sigma_{i}|/\sigma_{i}=O(\delta/w_{i})/\sqrt{n}.

The reader is also referred to Proposition 4.3 for a more detailed statement that also allows a small, dimension-independent fraction of adversarial noise in the input samples.

We now provide an intuitive explanation of our spherical GMM learning algorithm. First, we note that we can reduce the dimension of the problem from nn down to some function of kk. When the covariance matrices of the components are nearly identical, this can be done with a twist of standard techniques. For the case of arbitrary covariances, we need to employ a few additional ingredients.

When each component has the same covariance matrix, the learning algorithm is quite simple: We start by running our list-decoding algorithm (Theorem 1.3) with appropriate parameters to get a small list of hypothesis means. We then associate each sample with the closest element of our list. At this point, we can cluster the points based on which means they are associated to and use this clustering to accurately learn the correct components.

The general case, when the covariances of the components are arbitrary, is significantly more complex. In this case, we can recover a list HH of candidate means only after first guessing the radius of the component that we are looking for. Without too much difficulty, we can find a large list of guesses and thereby produce a list of hypotheses of size poly(n/α)\operatorname{poly}(n/\alpha). However, clustering based on this list now becomes somewhat more difficult, as we do not know the radius at which to cluster. We address this issue by performing a secondary test to determine whether or not the cluster that we have found contains many points at approximately the correct distance from each other.

As mentioned in Section 1.2, even the following information-theoretic aspect of list-decodable mean estimation is open: Ignoring sample complexity and running time, how small a distance from the true mean can be achieved with poly(1/α)\operatorname{poly}(1/\alpha) many hypotheses or number of hypotheses that is only a function of α\alpha, i.e., independent of the dimension nn?

Let 0<α<1/20<\alpha<1/2. There exists an (inefficient) algorithm that given a set of α\alpha-corrupted samples from a distribution DD, where (a) DD is subgaussian with bounded variance in each direction, or (b) DD has bounded first kk moments, for even kk, outputs a list of O(1/α)O(1/\alpha) vectors one of which is within distance g(α)g(\alpha) from the mean μ\mu of DD, and g(α)=O(log(1/α))g(\alpha)=O(\sqrt{\log(1/\alpha)}) in case (a) and g(α)=Ok(α1/k)g(\alpha)=O_{k}(\alpha^{-1/k}) in case (b). Moreover, these error bounds are optimal, up to constant factors. Specifically, the error bound of (a) cannot be asymptotically improved even if D=N(μ,I)D=N(\mu,I), as long as the list size is poly(1/α)\operatorname{poly}(1/\alpha). The error bound of (b) cannot be asymptotically improved as long as the list size is only a function of α\alpha.

For the detailed statements, the reader is referred to Section 5.

We now turn to our computational lower bounds. Given Theorem 1.5, the following natural question arises: For the case of Gaussians, can we achieve the minimax bound in polynomial time? We provide evidence that this may not be possible, by proving a Statistical Query (SQ) lower bound for this problem. Recall that a Statistical Query (SQ) algorithm [Kea98] relies on an oracle that given any bounded function on a single domain element provides an estimate of the expectation of the function on a random sample from the input distribution. This is a restricted but broad class of algorithms, encompassing many algorithmic techniques in machine learning. A recent line of work [FGR+13, FPV15, FGV17, Fel17] developed a framework of proving unconditional lower bounds on the complexity of SQ algorithms for search problems over distributions.

By leveraging this framework, using the techniques of our previous work [DKS16b], we show that any SQ algorithm for list-decodable Gaussian mean estimation that guarantees error α1/d\alpha^{-1/d}, for some d2d\geq 2, requires either high accuracy queries or exponentially many queries:

Any SQ list-decodable mean estimation algorithm for GN(μ,I)G\sim N(\mu,I) that returns a list of sub-exponential size so that some element in the list is within distance O(α1/d)O(\alpha^{-1/d}) of the mean μ\mu of GG requires either queries of accuracy 2O((1/α)2/d)nΩ(d)2^{O((1/\alpha)^{2/d})}\cdot n^{-\Omega(d)} or 2nΩ(1)2^{n^{\Omega(1)}} queries.

The reader is referred to Section 5.2 for the formal statement and proof.

5 Related Work

Robust Estimation. The field of robust statistics [Tuk60, Hub64, HR09, HRRS86, RL05] studies the design of estimators that are stable to model misspecification. After several decades of investigation, the statistics community has discovered a number of estimators that are provably robust in the sense that they can tolerate a constant (less than 1/21/2) fraction of corruptions, independent of the dimension. While the information-theoretic aspects of robust estimation have been understood, the central algorithmic question — that of designing robust and computationally efficient estimators in high-dimensions — had remained open.

Recent work in computer science [DKK+16, LRV16] shed light to this question by providing the first efficient robust learning algorithms for a variety of high-dimensional distributions. Specifically, [DKK+16] gave the first robust learning algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. Subsequently, there has been a flurry of research activity on algorithmic robust high-dimensional estimation. This includes robust estimation of graphical models [DKS16a], handling a large fraction of corruptions in the list-decodable model [CSV17, SCV17], developing robust algorithms under sparsity assumptions [BDLS17], obtaining optimal error guarantees [DKK+17b], establishing computational lower bounds for robust estimation [DKS16b], establishing connections with robust supervised learning [DKS17], and designing practical algorithms for data analysis applications [DKK+17a].

Learning GMMs. A long line of work initiated by Dasgupta [Das99], see, e.g., [AK01, VW02, AM05, KSV08, BV08], provides computationally efficient algorithms for recovering the parameters of a GMM under various separation assumptions between the mixture components. More recently, efficient parameter learning algorithms were obtained [MV10, BS10, HP15] under minimal information-theoretic separation assumptions. Without separation conditions, the sample complexity of parameter estimation is known to scale exponentially with the number of components, even in one dimension [MV10, HP15]. To circumvent this information-theoretic bottleneck of parameter learning, a related line of work has studied parameter learning in a smoothed setting [HK13, GVX14, BCMV14, ABG+14, GHK15]. The related problems of density estimation and proper learning for GMMs have also been extensively studied [FOS06, SOAJ14, MV10, HP15, ADLS17, LS17]. In density estimation (resp. proper learning), the goal is to output some hypothesis (resp. GMM) that is close to the unknown mixture in total variation distance.

Most relevant to the current work is the classical work of Vempala and Wang [VW02] and the very recent work by Regev and Vijayraghavan [RV17]. Specifically, [VW02] gave an efficient algorithm that learns the parameters of spherical GMMs under the weakest separation conditions known to date. On the other hand, [RV17] characterize the separation conditions under which parameter learning for spherical GMMs can be solved with poly(n,k,1/δ)\operatorname{poly}(n,k,1/\delta) samples. Whether such a separation can be achieved with an efficient algorithm was left open in [RV17]. Our work makes substantial progress in this direction.

6 Detailed Overview of Techniques

We start by reviewing the framework of [DKK+16] for robust mean estimation in the small error regime, followed by an explanation of the main difficulties that arise in the large error regime of the current paper.

In the small error regime, the “filtering” algorithm of [DKK+16] for robust Gaussian mean estimation works by iteratively detecting and removing outliers (corrupted samples) until the empirical variance in every direction is not much larger than expected. If every direction has small empirical variance, then the true mean and the empirical mean are close to each other [DKK+16]. Otherwise, the [DKK+16] algorithm projects the input points in a direction of maximum variance and throws away those points whose projections lie unexpectedly far from the empirical median in this direction. While this iterative spectral technique for outlier removal is by now well-understood for the small error regime (and has been applied to various settings), there are two major obstacles that arise if one wants to generalize it to the large error regime, i.e., where only a small fraction α\alpha of samples are good.

The first difficulty is that even the one-dimensional version of the problem in the large error regime is non-trivial. Specifically, consider a direction vv of large empirical variance. The [DKK+16] algorithm exploits the fact that the empirical median is a robust estimator of the mean in the one-dimensional setting. In contrast, in the large error regime, it is not clear how to approximate the true mean of a one-dimensional projection. This holds for the following reason: The input distribution can simulate a mixture of 1/α1/\alpha many Gaussians whose means are far from each other, and the algorithm will have no way of knowing which is the real one. In order to get around this obstacle, we construct more elaborate outlier-removal algorithms, which we call multifilters. Roughly speaking, a multifilter can return several (potentially overlapping) subsets of the original dataset TT with the guarantee that at least one of these subsets is substantially “cleaner” than TT.

The second difficulty is somewhat harder to deal with. As already mentioned, the filtering algorithm of [DKK+16] iteratively removes outliers by looking for directions in which the empirical distribution has a substantially larger variance than it should. In the low error regime, this approach does a good job of detecting are removing the corrupted points that can move the empirical mean far from the true mean. In the large error regime, the situation is substantially different. In particular, it is entirely possible that the empirical distribution does not have abnormally large variance in any direction, while still the empirical mean is Ω(1/α)\Omega(\sqrt{1/\alpha})-far from the true mean. That is, considering the variance of one-dimensional projections of our dataset in various directions seems inadequate in order to improve the O(1/α)O(\sqrt{1/\alpha}) error bound. This obstacle is inherent: the variance of linear polynomials (projections) is not a sufficiently accurate method of detecting a small fraction of good samples being substantially displaced from the mean of the bad samples. To circumvent this obstacle, we will use higher degree polynomials, which are much more sensitive to a small fraction of points being far away from the others. In particular, our algorithms will search for degree-dd polynomials that have abnormally large expectation or variance, and use such polynomials to construct our multifilters.

We now sketch how to exploit the existence of a large variance polynomial pp to construct a multifilter. Intuitively, the existence of such a polynomial pp suggests that there are many points that are far away from other points, and therefore separating these points into (potentially overlapping) clusters should guarantee that almost all good points are in the same cluster. Unfortunately, for this idea to work, we need to know that the variance of pp on the good set of points SS is not too large. For degree-11 polynomials pp this condition holds automatically. If SS is a sufficiently large set of samples from GN(μ,I)G\sim N(\mu,I) and pp is a normalized linear form, then Var[p(S)]Var[p(G)]=1\operatorname*{Var}[p(S)]\approx\operatorname*{Var}[p(G)]=1. But if pp has degree at least 22, the variance Var[p(S)]\operatorname*{Var}[p(S)] depends on the true mean μ\mu, which unfortunately is unknown. Fortunately, there is a way to circumvent this obstacle by either producing a multifilter or verifying that the variance Var[p(G)]\operatorname*{Var}[p(G)] is not too large.

We do this as follows: Firstly, we show that the variance Var[p(G)]\operatorname*{Var}[p(G)], GN(μ,I)G\sim N(\mu,I), can be expressed as an average of pi2(μ)p_{i}^{2}(\mu) for some explicitly computable, normalized, homogeneous polynomials pip_{i} (see Lemma 3.24). We then need to algorithmically verify that the polynomials pi(μ)p_{i}(\mu) are not too large. This is difficult to do directly, so instead we replace each pip_{i} by the corresponding multilinear polynomial qiq_{i}, and note that pi(μ)p_{i}(\mu) is the average value of qiq_{i} at many independent copies of GG. If this is large, then it means that evaluating qiq_{i} at a random tuple of samples will often have larger than expected size.

This idea will allow us to produce a multifilter for the following reason: Since each qiq_{i} is multilinear, this essentially allows us to write it as a composition of linear functions. More rigorously, we use the following iterative process: We iteratively plug-in variables one at a time to qiq_{i}. If at any step the size of the resulting polynomial jumps substantially, then the fact that this size is not well-concentrated as we try different samples will allow us to produce a multifilter. The details of this argument are given in Lemma 3.27 of Section 3.7.

6.2 Learning Spherical GMMs

Since a Gaussian mixture model can simultaneously be thought of as a mixture of any one of its components with some error distribution, applying our list-decoding algorithm to samples from a GMM will return a list of hypotheses so that every mean in the mixture is close to some hypothesis in the list. We can then use this list to cluster our samples by component.

In particular, given samples from a Gaussian G=N(μ,I)G=N(\mu,I) and many possible means h1,,hmh_{1},\ldots,h_{m}, we consider the process of associating a sample xx from GG with the nearest hih_{i}. We note that xx is closer to hjh_{j} than hih_{i} if and only if its projection onto the line between them is. Now if hih_{i} is substantially closer to μ\mu than hjh_{j} is, then this requires that this projection (which is Gaussian distributed) be far from its mean, which happens with tiny probability. Thus, by a union bound, as long as our list contains some hih_{i} that is close to μ\mu, the closest hypothesis to xx with high probability is not much further. If the separation between the means in our mixture is much larger than the separation between the means and the closest hypotheses, this implies that almost all samples are associated with one of the hypotheses near its component mean, and this will allow us to cluster samples by component. This idea of clustering points based on which of a finite set they are close to is an important idea that shows up in several related contexts in this paper.

The above idea works more or less as stated for mixtures of identity covariance Gaussians, but when dealing with more general mixtures of spherical Gaussians several complications arise. Firstly, in order to run out list-decoding algorithm, we need to know (a good approximation to) the covariance matrix of each component. The other difficulty is that, in order to cluster points, we will take a set of all nearby hypotheses that have reasonable numbers of samples associated with them. The issue is that we no longer know what “nearby” means, as it should depend on the covariance matrix of the associated Gaussian.

To solve the first of these problems we use a trick that will be reused several times. We note that two samples from the same Gaussian N(μ,σ2I)N(\mu,\sigma^{2}I) have distance approximately Θ(σn)\Theta(\sigma\sqrt{n}), and that even one sample from N(μ,σ2I)N(\mu,\sigma^{2}I) is unlikely to be much closer than this to samples from different components. Therefore, by simply looking at the distance to the closest other sample gives us a constant factor approximation to the standard deviation of the corresponding component. This allows us to write down a polynomial-size list of viable hypothesis standard deviations. Running our list decoding algorithm for each standard deviation, gives us a polynomial-size list of hypothesis means.

To solve the second problem, we use the above idea to approximate the standard deviations associated to our sample points. When clustering them, we look for collections of sample points with standard deviations approximately the same σ\sigma, whose closest hypotheses are within some reasonable multiple of σ\sigma of each other. Since we are able to approximate the size of the component that our samples are coming from, we can guarantee that we aren’t accidentally merging several smaller clusters together by using the wrong radius.

One slight wrinkle with the above sketched learning algorithm is that since the number of candidate hypotheses is polynomial in nn, the separation between the components will be required to be at least log(n)\sqrt{\log(n)}. This bound is suboptimal, when nn is very large. Another issue is that the overall runtime of the learning algorithm would not be a fixed polynomial in nn, but would scale as ndn^{d}. There is a way around both these issues, by reducing to a lower dimensional problem.

In particular, standard techniques involve looking at the kk largest principle values that allow one to project onto a subspace of dimension kk without losing too much. Unfortunately, these ideas require that all of the Gaussians involved have roughly the same covariance. Fortunately, if nn is large, our ability to approximate the covariance associated to a sample by looking at its distances to other samples becomes more accurate. Using a slightly modification of this idea, we can actually break our samples into subsets so that each subset is a mixture of Gaussians of approximately the save covariance. By projecting each of these in turn, we can reduce the original problem to a poly(k)\operatorname{poly}(k) number of dimensions and eliminate this extra term.

6.3 Minimax Error Bounds

We now explain our approach to pin down the information-theoretic optimal error for the list-decodable mean estimation problem. Concretely, for the identity covariance Gaussian case we show that there is an (inefficient) algorithm that guarantees that some hypothesis is within O(log(1/α))O(\sqrt{\log(1/\alpha)}) of the true mean. The basic idea is that the true mean must have the property that there is an α\alpha-fraction of samples that are well-concentrated (in the sense of having good tail bounds in every direction) about the point. The goal of our (inefficient) algorithm will be to find a small number of balls of radius O(log(1/α))O(\sqrt{\log(1/\alpha)}) that covers the set of all such points. We show that such a set exists using the covering/packing duality. In particular, we note that if there are a large number of such sets with means far apart, we get a contradiction since the sets must be individually large but their overlaps must be pairwise small (due to concentration bounds).

Regarding lower bounds, [CSV17] showed an Ω(log(1/α))\Omega(\sqrt{\log(1/\alpha)}) error lower bound for N(μ,Σ)N(\mu,\Sigma), where Σ\Sigma is unknown and ΣI\Sigma\preceq I. We strengthen this result by showing that the Ω(log(1/α))\Omega(\sqrt{\log(1/\alpha)}) lower bound holds even for N(μ,I)N(\mu,I). We also prove matching lower bounds of Ωk(α1/k)\Omega_{k}(\alpha^{-1/k}) for distributions with bounded moments. Our proofs proceed by exhibiting distributions XX, so that XX can be written as X=αXi+(1α)EiX=\alpha X_{i}+(1-\alpha)E_{i} for many different XiX_{i} satisfying the necessary hypotheses. Then any list-decoding algorithm must return a list of hypotheses close to the mean of every XiX_{i}. If there are many such XiX_{i}’s with means pairwise separated, then the list-decoding algorithm must either return many hypotheses or have large error.

6.4 SQ Lower Bounds

Finally, we prove lower bounds for list-decoding algorithms in the Statistical Query (SQ) model. Roughly speaking, we show that any SQ algorithm must either spend ndn^{d} time or have accuracy higher than α1/2d\alpha^{-1/2d}, suggesting that our list-decoding algorithm is qualitatively tight in its tradeoff between runtime and sample complexity.

We prove these bounds using the technology developed in [DKS16b]. This basically reduces to finding a 11-dimensional distribution whose first many moments agree with the corresponding moments of a standard Gaussian. In our case, this amounts to constructing a one-dimensional distribution A=αN(α1/d,1)+(1α)EA=\alpha N(\alpha^{-1/d},1)+(1-\alpha)E, so that AA’s first dd moments agree with those of a standard Gaussian. This can be done essentially because the αN(α1/d,1)\alpha N(\alpha^{-1/d},1) part of the distribution only contributes at most a constant to any of the first dd moments. This allows us to take EE approximately Gaussian but slightly tweaked near in order to fix these first few moments.

We note however, that if we move the error component much further from , its contribution to the dthd^{th} moment becomes super-constant and thus impossible to hide. This corresponds to the fact that degree-dd moments are sufficient (and necessary) in order to detect errors of size α1/d\alpha^{-1/d}.

7 Organization

The structure of the paper is as follows: In Section 2, we provide the necessary definitions and technical facts. In Section 3, we present our list-decoding algorithm. Section 4 gives our algorithm for GMMs. Finally, our minimax error and SQ lower bounds are given in Section 5.

Definitions and Preliminaries

Our basic objects of study are the Gaussian distribution and finite mixtures of spherical Gaussians:

2 Formal Problem Definitions

We record here the formal definitions of the problems that we study. Our first problem is robust mean estimation in the list-decodable learning model. We start by defining the list-decodable model:

We say that a learning problem is (m,ε)(m,\varepsilon)-list decodably solvable if there exists an efficient algorithm that can output a set of mm hypotheses with the guarantee that at least one is accurate to within error ε\varepsilon with high probability.

Our notion of robust estimation relies on the following model of corruptions:

Given 0<α10<\alpha\leq 1 and a distribution family D\mathcal{D}, an α\alpha-corrupted set of samples TT of size mm is generated as follows: First, a set SS of αm\alpha\cdot m many samples are drawn independently from some unknown DDD\in\mathcal{D}. Then an omniscient adversary, that is allowed to inspect the set SS, adds an arbitrary set of (1α)m(1-\alpha)\cdot m many points to the set SS to obtain the set TT.

We are now ready to define the problem of list-decodable robust mean estimation:

Our main algorithmic result is for the important special case that D\mathcal{D} is the family of unknown mean known covariance Gaussian distributions. We also establish minimax bounds that apply for more general distribution families.

Our second problem is that of learning mixtures of separated spherical Gaussians:

Given a positive integer kk and samples from a spherical kk-GMM F(x)=i=1kwiN(μi,σi2I)F(x)=\sum_{i=1}^{k}w_{i}N(\mu_{i},\sigma_{i}^{2}\cdot I), we want to estimate the parameters {(wi,μi,σi),i[k]}\{(w_{i},\mu_{i},\sigma_{i}),i\in[k]\} up to a required accuracy δ\delta. More specifically, we would like to return a list {(ui,νi,si),i[k]}\{(u_{i},\nu_{i},s_{i}),i\in[k]\} so that with high probability the following holds: For some permutation πSk\pi\in\mathbf{S}_{k} we have that for all i[k]i\in[k]: wiuπ(i)δ|w_{i}-u_{\pi(i)}|\leq\delta, μiνπ(i)2/σiδ/wi\|\mu_{i}-\nu_{\pi(i)}\|_{2}/\sigma_{i}\leq\delta/w_{i}, and σisπ(i)/σi(δ/wi)/n|\sigma_{i}-s_{\pi(i)}|/\sigma_{i}\leq(\delta/w_{i})/\sqrt{n}.

3 Basics of Hermite Analysis and Concentration

We will need the following standard concentration bound for degree-dd polynomials over independent Gaussians (see, e.g., [Jan97]):

List-Decodable Robust Mean Estimation Algorithm

In this section, we prove our main algorithmic result on list-decodable mean estimation:

The key idea procedure behind our algorithm is a subroutine that given a set of samples either cleans it up producing one or two subsets at least one of which has substantially fewer errors than the original, or certifies that the mean of GG must be close to the empirical mean (Proposition 3.6). Using this subroutine, our final algorithm can be obtained by repeatedly applying the subroutine recursively to the returned sets until they produce vectors. The details of this analysis are in Section 3.3.

Before we can get into the detailed overview of this proof, it is necessary to lay out some technical groundwork. First, we will want to have a deterministic condition under which our algorithm will succeed. To that end, we introduce two important definitions. We say that a set SS is representative of GG if it behaves like a set of independent samples of GG, in particular in the sense that it is a PRG against low-degree polynomial threshold functions for GG. We also say that a larger set TT is good if (roughly speaking) an α\alpha-fraction of the elements of TT are a representative set for GG. For technical reasons, will will also want the points of TT to be not too far apart from each other. In Section 3.1, we discuss the definitions of representative and good sets and provide some basic results.

In Section 3.2, we show that given a large set of points that contain an α\alpha-fraction of good points from, one can algorithmically find O(1/α)O(1/\alpha) many subsets so that with high probability at least one of them is good (and thus can be fed into the rest of our algorithm). This would be immediate if t were it not for the requirement that the points in a good set be not too far apart. As it stands, this will require that we perform some very basic clustering algorithms.

The actual design of our multifilter involves working with several types of “pure” degree-dd polynomials and their appropriate tensors. In particular, we need to pay attention to harmonic polynomials (which behave well with respect to L2L^{2}-norms), homogeneous polynomials, and multilinear polynomials. In Section 3.5, we introduce these and give several algebraic results relating them that will be required later.

The multifilter at its base level requires a routine that given a polynomial pp, where p(T)p(T) behaves very differently from p(G)p(G), allows us to use the values of p(x)p(x) to separate the points coming from GG from the errors. The basic idea of the technique is to cluster the points p(x)p(x), for xTx\in T, and throw away points that are too far from any cluster large enough to correspond to the bulk of the values of p(G)p(G) (which must be well-concentrated), or to divide TT into two subsets with enough overlap to guarantee that any such cluster could be entirely contained on one side or the other. The details of this basic multifilter algorithm are covered in Section 3.4.

Unfortunately, the application of the multifilter in the application above has a slight catch to it. Our basic multifilter will only apply to pp if we can verify that Var[p(G)]\operatorname*{Var}[p(G)] is not too large. This would be easy to verify if we knew the mean of GG, but unfortunately, we do not and errors in our approximation may lead to Var[p(G)]\operatorname*{Var}[p(G)] being much larger than anticipated, and in fact, potentially too large to apply our filter productively. In order to correct this, we will need new techniques to either prove that Var[p(G)]\operatorname*{Var}[p(G)] is small or to find a filter in the process. Using analytic techniques, in Section 3.5 we show that the Var[p(G)]\operatorname*{Var}[p(G)] is a weighted average of squares of qi(μμT)q_{i}(\mu-\mu_{T}) for some normalized, homogeneous polynomials qiq_{i}. Thus, it suffices to verify that each qi(μμT)q_{i}(\mu-\mu_{T}) is small.

To deal with this issue, it is actually much easier to work with multilinear polynomials, and so instead we deal with multilinear polynomials rir_{i} so that ri(x,,,x)=qi(x)r_{i}(x,\ldots,,x)=q_{i}(x). We thus need to verify that ri(μμT,μμT,,μμT)r_{i}(\mu-\mu_{T},\mu-\mu_{T},\ldots,\mu-\mu_{T}) is small. The discussion of the reduction to this problem is in Section 3.8, while the techniques for verifying that the rir_{i} have small mean is in Section 3.7.

In summary, the structure of this section is as follows: In Section 3.1, we define the important notions of a representative set and a good set and prove some basic properties. In Section 3.2, we do some basic clustering and show that a good set can be extracted from a set of corrupted samples. In Section 3.3, we present the main subroutine and show how it can be used to produce our final algorithm.

The remainder of Section 3 will be spent building this subroutine. In Section 3.4, we produce our most basic tool for creating multifilters given a single polynomial where p(T)p(T) and p(G)p(G) behave substantially differently. In Section 3.5, we present some basic background on harmonic, homogeneous and multilinear polynomials and their associated tensors. In Section 3.6, we use these to produce routines to find multifilters given degree-11 polynomials or degree-22 polynomials with bounded trace norm for which p(μ)p(\mu) is too large. In Section 3.7, we leverage these results to extend to produce a similar multifilter for arbitrary multilinear polynomials. In Section 3.8, we use this to get a multifilter for degree-dd harmonic polynomials whose L2L^{2} norms are substantially larger than expected, and in Section 3.9, we combine this with spectral techniques to get the full version of our filtering procedure, thus finishing our algorithm.

1 Representative Sets and Good Sets

We note that even though the definition of “representativeness” of a set SS depends on the parameters α\alpha and dd, these quantities will be fixed for the representative set SS throughout the execution of our algorithm, and thus the dependence will be implicit. Note that as α\alpha increases or dd decreases, the representativeness condition becomes weaker. Thus, SS being representative with parameters α\alpha and dd implies that it is also representative with parameter β\beta and dd^{\prime}, for any βα\beta\geq\alpha and ddd^{\prime}\leq d.

We start by showing that a sufficiently large set of samples drawn from GN(μ,I)G\sim N(\mu,I) is representative with high probability. This fact follows from standard arguments using the VC-inequality:

for all such polynomials pp with probability at least 1τ1-\tau, if we take

Our list-decodable learning algorithm and its analysis require an appropriate notion of goodness for the corrupted set TT. At a high-level, throughout its execution, our algorithm will produce several different subsets of the original corrupted set TT it starts with, in its attempt to remove outliers. Intuitively, we want a good set TTT^{\prime}\subseteq T to have the property that a β\beta-fraction of the points in TT^{\prime}, with βα\beta\geq\alpha, come from a representative set SS. However, we also need to account for the possibility that — in the process of removing outliers — our algorithm may also remove a small number of the points in the original representative set SS. Moreover, for technical reasons, we will require that all points in a good set are contained in a ball of radius O(n)O(\sqrt{n}). This is formalized in the following definition:

All points in TT are within distance O(n)O(\sqrt{n}) of each other.

2 Naive Clustering

We note that the definition of “goodness” of a set TT depends on the parameters α\alpha and dd. The parameter α\alpha will change multiple times during the execution of the algorithm (it will increase), while the parameter dd does not increase. This justifies our choice of making α\alpha explicit in the definition of “α\alpha-good”, while keeping the dependence on dd implicit.

The additional constraint that all points in a good set are contained in a not-too-large ball means that our original set of corrupted samples TT may not form a good set. To rectify this issue, we start by performing a very basic clustering step as follows: Since an α\alpha-fraction of the points in TT are concentrated within distance O(n)O(\sqrt{n}) from the true mean μ\mu, by taking a maximal set of non-overlapping, not-too-large balls each with a large fraction of points, we are guaranteed that at least one of them will contain a good set. This is formalized in the following lemma:

Let SS be the subset of TT containing the good samples, i.e., the points that are independent samples from GG. We have that STS\subset T and S=2αT|S|={2}\alpha\cdot|T|. By Lemma 3.3, the set SS is representative with probability at least 1τ1-\tau. We henceforth condition on this event.

If all points in TT are contained in a ball of radius O(n)O(\sqrt{n}), there is nothing to prove. If this is not the case, we will show how to efficiently find a collection of at most 1/α1/\alpha many balls of radius O(n)O(\sqrt{n}), so that at least one of the balls contains at least a (1α/6e3n)(1-\alpha/{6}-e^{-{3}n})-fraction of the points in SS.

First note that, by the degree-22 Chernoff bound, we have that Pr[Gμ22C2n]exp(3n)\Pr[\|G-\mu\|_{2}^{2}\geq C^{2}n]\leq\exp(-{3}n), for a sufficiently large universal constant CC. Since Gμ22\|G-\mu\|_{2}^{2} is a degree-22 polynomial and SS is representative, Definition 3.2 implies that at least an (1exp(3n)α/6)(1-\exp(-{3}n)-\alpha/{6}) fraction of points in SS are within distance CnC\sqrt{n} of μ\mu.

Our clustering scheme works as follows: We consider a maximal set of disjoint balls of radius R=2CnR=2C\cdot\sqrt{n} centered at points of TT, such that each ball contains at least an α\alpha-fraction of the points in TT. Note that this set is non-empty: Since SS is representative, the ball of radius RR centered at any point xSx\in S with xμ2Cn\|x-\mu\|_{2}\leq C\sqrt{n} contains at least a (1e3nα/6)(1-e^{-3n}-\alpha/6)-fraction of the points in SS, and therefore at least an α\alpha-fraction of the points in TT.

Let B1,,BsB_{1},\ldots,B_{s} be the maximal set of disjoint balls described above. Since each BiB_{i} contains an α\alpha-fraction of the points in TT, there are at most s1/αs\leq 1/\alpha many such balls. Let BiB^{\prime}_{i}, i[s]i\in[s], be the ball with the same center as BiB_{i} and radius 3R=6Cn3R=6C\cdot\sqrt{n}. Consider the subsets Ti=TBiT_{i}=T\cap B^{\prime}_{i}, i[s]i\in[s]. We claim that at least one of the TiT_{i}’s is α\alpha-good.

The pseudo-code for this clustering algorithm is given below.

We now prove correctness. By definition, all the points of TiT_{i} are contained in a ball of radius 3R=6Cn3R=6C\sqrt{n}. Let xx be any point of SS within distance CnC\sqrt{n} of the mean μ\mu of GG. (Recall that, since SS is representative, most of the points of SS satisfy this property.) Since the initial set of balls B1,,BsB_{1},\ldots,B_{s} was constructed to be maximal, at least one BiB_{i} must intersect the ball BxB_{x} of radius R=2CnR=2C\sqrt{n} centered at xx. This implies that the ball BxB_{x} is contained in BiB^{\prime}_{i}. Therefore, TiT_{i} contains all of the points of SS within CnC\sqrt{n} of the mean μ\mu of GG. This completes the proof of Lemma 3.5. ∎

3 Main Multifilter Algorithm and Proof of Theorem 3.1

In this section, we describe the main subroutine that forms the core of our final algorithm. Let 0<α<1/20<\alpha<1/2 be the fraction of good samples, and let SS be the representative set contained in the initial corrupted set TT. Ideally, starting from the corrupted set TT, we would like to either certify that the true mean μ\mu of GG is close to the empirical mean μT\mu_{T} of TT, or we have an efficient method to “clean up” the set TT, i.e., obtain a set TTT^{\prime}\subset T containing a higher fraction of clean samples.

Unfortunately, such a goal seems unattainable in the setting where the fraction α\alpha of clean samples is smaller than 1/21/2 for the following reason: Suppose that α=1/k\alpha=1/k for some integer k2k\geq 2. It is quite possible that the original set of corrupted points TT is a set of samples drawn from a mixture i=1k(1/k)N(μi,I)\sum_{i=1}^{k}(1/k)\cdot N(\mu_{i},I) of kk spherical Gaussians with separated means μi\mu_{i}. In this case, the best that we can hope to achieve is break TT into several (potentially overlapping) subsets TiT_{i}, with the guarantee that at least one of the subsets TiT_{i} has a higher fraction of points from SS. As our final algorithm will then have to recurse on the TiT_{i}’s, we will need to ensure that they are not too large in order to obtain a bound on the total runtime of the algorithm.

Our main sub-routine satisfies the guarantees of the following proposition:

Moreover, with probability at least 1τ1-\tau, the output of the algorithm satisfies the following guarantees:

In case (1), if TT is α\alpha-good with respect to GN(μ,I)G\sim N(\mu,I), then we have that

In case (2), the set TT is not α\alpha-good.

In case (3), if TT is α\alpha-good then at least one TiT_{i} is αi\alpha_{i}-good with respect to GG.

The algorithm runs in time O((d+ln(1/τ))/α)dpoly(T,nd)O\left((d+\ln(1/\tau))/\alpha\right)^{d}\cdot\operatorname{poly}(|T|,n^{d}).

The proof of Proposition 3.6 requires several ingredients and is given in the following subsections. We start by showing how Theorem 3.1 follows from this proposition.

The overall algorithm proceeds as follows: Using Lemma 3.5, we produce a list {T1,,Ts}\{T_{1},\ldots,T_{s}\} of subsets of TT at least one of which is α/2\alpha/2-good with respect to GG.

The algorithm maintains a list of pairs {(Ti,αi)}i=1m\{(T_{i},\alpha_{i})\}_{i=1}^{m} with the guarantee that with high probability at least one of the TiT_{i} is αi\alpha_{i}-good with respect to GG. It then iteratively applies the routine from Proposition 3.6, to each pair (Ti,αi)(T_{i},\alpha_{i}) in the list, and when its output is not a vector, it replaces the pair (Ti,αi)(T_{i},\alpha_{i}) by either zero, one, or two such pairs in the list. Then it outputs the list of all vectors that Proposition 3.6 ever returned.

We will show that this process produces a list of O(1/α3)O(1/\alpha^{3}) many candidate hypotheses. We provide an efficient black-box way in which we can reduce the size of this list to O(1/α)O(1/\alpha) without losing more than a constant factor in the final error guarantee. See Proposition B.1.

The pseudo-code for the algorithm is given below:

We now proceed with the analysis. First, it is easy to see that iαi2\sum_{i}\alpha_{i}^{-2} can only decrease. Therefore, since all αi1\alpha_{i}\leq 1, we can never have more than O(α3)O(\alpha^{-{3}}) many terms in the list. Each iteration of the algorithm either increases the number of TiT_{i}’s or decreases the size of some TiT_{i} by at least 11. Therefore, the algorithm terminates after at most O(T/α3)O(|T|/\alpha^{{3}}) many calls to MainSubroutine.

We claim that with probability at least 1τ1-\tau, at every step, we either have that one of the TiT_{i}’s is αi\alpha_{i}-good, or we have already produced a good approximation to μ\mu. We can view the set of all TiT_{i}’s ever produced by the algorithm as a tree, where the children of each TiT_{i} are all subsets produced by MainSubroutine on TiT_{i}. We say that TiT_{i} belongs to the jj-th generation if the path from TT to TiT_{i} in this tree has length jj. Then, it is easy to see inductively that either at least one TiT_{i} in the jj-th generation is αi\alpha_{i}-good or a good approximation to the mean was output in an earlier generation, with probability at least jτ/Tj\tau/|T|. This is because given a TiT_{i} that is αi\alpha_{i}-good in the jthj^{th} generation, with probability at least 1τ/T1-\tau/|T| it either produces a good approximation to the mean or has at least one descendant TT^{\prime} that is α\alpha^{\prime}-good. Since there can only be T|T| generations, the list of vectors produced contains a good approximation to the mean with probability at least 1τ1-\tau. This completes the proof.

To analyze the runtime, note that there are at most T|T| generations each involving running MainSubroutine on O(α3)O(\alpha^{-3}) many TiT_{i}’s. This clearly dominates the runtime from the initial use of NaiveClustering, the final use of ListReduction and the other bookkeeping steps. ∎

In the remaining sections, we describe a number of subroutines that either verify some desired properties of low-degree polynomials or produce an output list of subsets satisfying the following condition:

We say that a list of pairs {(Ti,αi)}\{(T_{i},\alpha_{i})\}, where TiTT_{i}\subset T and αi(0,1)\alpha_{i}\in(0,1), satisfies the multifilter condition for (T,α)(T,\alpha) if the following hold:

i1/αi21/α2\sum_{i}1/\alpha_{i}^{2}\leq 1/\alpha^{2}, and

If TT is α\alpha-good (with respect to GG), then at least one of the TiT_{i}’s is αi\alpha_{i}-good (with respect to GG).

4 Basic Multifilter Routine

How do we produce the multifilters for corrupted sets TT required in Proposition 3.6? The basic idea is the following: If we find a degree-dd nn-variable polynomial pp such that the mean and variance of p(T)p(T) behave substantially differently from the mean and variance of p(G)p(G), we will be able to split TT into (overlapping) subsets — based on the values of p(x)p(x), xTx\in T — with the property that at least one of these subsets is “cleaner” than TT.

The obvious difficulty is that the mean μ\mu of GG is unknown, and therefore so is the mean and variance of p(G)p(G). Suppose for now that we have a low-degree polynomial pp whose variance Var[p(G)]\operatorname*{Var}[p(G)] is small. (As we argue in the following sections, we can guarantee that this condition is satisfied with high probability for the polynomials pp that we will use.) By concentration, it then follows that at least a (1α/2)(1-\alpha/2) fraction of the points of {p(x),xS}\{p(x),x\in S\} will lie in an interval of length log(1/α)O(d)\log(1/\alpha)^{O(d)}. If a decent number of points of {p(x),xT}\{p(x),x\in T\} fall outside an interval of this length, then we know that the distribution of p(T)p(T) is very different. In such a case, be removing points of TT beyond some threshold, we can get rid of substantially more bad points than good points.

Recall that we assumed that Var[p(G)]\operatorname*{Var}[p(G)] is small. However, the mean of p(G)p(G) is still unknown. This creates the following complication: There might be several intervals of length log(1/α)O(d)\log(1/\alpha)^{O(d)} each of them containing an α\alpha-fraction of TT. Any of these intervals would make a reasonable hypothesis and we have no way of distinguishing between them. In such a setting, we split our set TT into several overlapping subsets. We make the overlap between them large enough so that no matter where the (unknown) true mean of p(G)p(G) is, the vast majority of the points of {p(x),xS}\{p(x),x\in S\} will lie in the same subset. At the same time, because there is a substantial number of points in TT on either side of the divide, we can guarantee that not too many points are kept on either side.

Formally, we establish the following proposition:

“YES”. If the algorithm returns “YES”, then we have that:

Var[p(T)]O(d+log(1/α))d(log(2+log(1/α)))2\operatorname*{Var}[p(T)]\leq O(d+\log(1/\alpha))^{d}(\log(2+\log(1/\alpha)))^{2}, and

“NO”. If the algorithm returns “NO”, then the set TT is not α\alpha-good.

A list of pairs {(Ti,αi)}\{(T_{i},\alpha_{i})\}, TiTT_{i}\subset T, of length one or two satisfying the multifilter condition for (T,α)(T,\alpha).

The rest of this subsection is devoted to the proof of Proposition 3.8.

We start by describing the basic multifilter routine in pseudocode:

The algorithm begins by defining some appropriate constants and in particular choosing the parameter RR to be polylogarithmic in 1/α1/\alpha, so that if Var[p(G)]1\operatorname*{Var}[p(G)]\leq 1 then all but a poly(α)\operatorname{poly}(\alpha)-fraction of the points of p(G)p(G) must lie in an interval of length RR.

We begin in Step 3 by doing a basic sanity check, ensuring that the total range of values of p(T)p(T) is not too large. On the one hand, we are guaranteed by Claim 3.9 in the next section that this will be true if Var[p(G)]1\operatorname*{Var}[p(G)]\leq 1 and TT is good. This assumption will become technically useful as it will imply that if p(T)p(T) has too large a mean or variance, then this cannot be due to a negligible number of points at extreme distances from the mean, and instead must be due to a relatively significant number of errors.

We split our analysis into cases roughly corresponding to the one cluster versus two clusters cases. Step 4 corresponds to the case where there is a single interval of length not much more than RR that contains all but an α/2\alpha/2-fraction of the points of {p(x),xT}\{p(x),x\in T\}. This implies that since an α\alpha-fraction of points come from p(G)p(G), the mean of p(G)p(G) cannot be far outside of this interval. Thus, we can construct a slightly larger interval that contains all but a tiny fraction of the points of p(G)p(G). From there, it is a relatively easy to show that since p(T)p(T) has substantially different behavior from p(G)p(G), there must be some (potentially larger interval), where the fraction of points of p(T)p(T) outside of this interval is much larger than the fraction of points of p(G)p(G) outside of this interval. Thus, throwing away the points outside of the latter interval will serve to clean up our dataset.

Step 5 covers the other case where no short interval contains all but an α/2\alpha/2-fraction of the points. In this case, we try to partition TT into sets based on whether p(x)p(x) lies in (,t+R)(\infty,t+R) or (tR,)(t-R,\infty), for some appropriate threshold tt. Notice that since these intervals overlap on an interval of length 2R2R, no matter where the mean of p(G)p(G) lies, almost all of the points of p(G)p(G) will lie in one of the two intervals (if it’s in the middle it could even be the case that almost all points lie in both). We now need to pick a threshold tt so that the number of elements of p(T)p(T) not on either side is substantial. We are aided in this endeavor by the fact that there must be some gap of length Ω(Rlog(2+log(1/α)))\Omega(R\log(2+\log(1/\alpha))) with at least an Ω(α)\Omega(\alpha)-fraction of the points lying on either side of it. By trying many possible values if tt within this interval, it is not hard to show that there must be some threshold where neither side keeps too many.

4.2 First Stage of Correctness Proof

In this subsection, we prove the correctness conditions of our algorithm that do not establish goodness of the output.

We start by showing that if the algorithm returns “NO” in Step 3, then the set TT is not α\alpha-good:

Suppose that TT is α\alpha-good and Var[p(G)]1\operatorname*{Var}[p(G)]\leq 1. Then, it holds that maxx,yTp(x)p(y)O(1+n/d)d/2\max_{x,y\in T}|p(x)-p(y)|\leq O(1+n/d)^{d/2}.

Taking the product over ii from 11 to nn yields:

If u0u\geq 0, we have that the left hand side is at least

Letting u=min(1/2,d/n)u=\min(1/2,d/n), we find that if dnd\ll n, then the above sum is at most (n/d)dexp(O(d2/n)+O(dy22/n))=O(n/d)d(n/d)^{d}\exp(O(d^{2}/n)+O(d\|y\|_{2}^{2}/n))=O(n/d)^{d}. If dnd\gg n, this is O(1)d+n+y22O(1)^{d+n+\|y\|_{2}^{2}}. In either case, it is O(1+n/d)dO(1+n/d)^{d}, and thus p(y)|p(y)| is at most O(1+n/d)d/2O(1+n/d)^{d/2}. This completes the proof of Claim 3.9. ∎

The correctness of the other step in which the algorithm can output “NO”, Step 4(b)iii, is deferred to after Lemma 3.18 below.

We next show that the algorithm finds an appropriate threshold tt in Steps 4(b)i and 5a.

If the algorithm reaches Step 4(b)i, there exists a threshold t>2Rt>2R satisfying (2).

The main idea of the proof is quite simple: If there is no t>2Rt>2R satisfying (2), the random variable p(T)p(T) is well-concentrated, which contradicts the lower bound on its variance that holds in Step 4b. Suppose, for the sake of contradiction, that for all t>2Rt>2R, it holds:

By a change of variable, this is equivalent to the following statement. For all t>2R+(ba)/2t>2R+(b-a)/2, it holds:

The proof will require some basic properties of the Gamma function. We will use the following notation: Let Γ(s,x)=xexp(t)ts1dt\Gamma(s,x)=\int_{x}^{\infty}\exp(-t)t^{s-1}dt be the incomplete gamma function and Γ(s)=Γ(s,0)\Gamma(s)=\Gamma(s,0) be the Gamma function. We will need the following technical claim about the incomplete Gamma function, whose proof can be found in Appendix A:

For s1,x0s\geq 1,x\geq 0, we have that Γ(s,x)exp(x)(x+s)s1\Gamma(s,x)\leq\exp(-x)(x+s)^{s-1}.

We can now bound from below the variance of p(T)p(T) as follows:

By the construction of the algorithm, when the variance Var[p(T)]\operatorname*{Var}[p(T)] is this small, the algorithm does not reach Step 4(b)i. This gives the desired contradiction and completes the proof of Lemma 3.10. ∎

We next show that the algorithm finds an appropriate threshold in Step 5a:

If the algorithm reaches Step 5a, there exists a threshold tt satisfying (3).

We will show that if no tt satisfies (3), then at least a 1α/21-\alpha/2 fraction of points xTx\in T have p(x)p(x) in an interval of length O(log(2+log(1/α))R)O(\log(2+\log(1/\alpha))\cdot R). This would be a contradiction, as no such interval exists if the algorithm reaches Step 5. This is essentially because if there is no such tt, then if we consider the fraction of xTx\in T with p(x)tp(x)\leq t, this must grow very quickly with tt after it first exceeds α/4\alpha/4, and thus there is very little distance between the α/4\alpha/4-tails of the distribution and the median. This means that the α/4\alpha/4-tails on the left and on the right must be close to each other, implying that the interval from Step 4 must exist.

Let tm=2mR+median(p(T))t_{m}=2m\cdot R+\textrm{median}(p(T)), for integer mm. If no tt satisfies (3), then tmt_{m} is not a suitable choice of tt, and therefore

where T1,m={xT:p(x)>tmR}T_{1,m}=\{x\in T:p(x){>}t_{m}-R\} and T2,m={xT:p(x)<tm+R}T_{2,m}=\{x\in T:p(x){<}t_{m}+R\} for all mm with Ti,mαT|T_{i,m}|\geq\alpha|T|.

Let am=T1,m/Ta_{m}=|T_{1,m}|/|T|. Notice that T2,m/T=(1am+1)|T_{2,m}|/|T|=(1-a_{m+1}). If Equation (4) holds for a given mm, then we must have am2+(1am+1)2>(1α2/100)2.a_{m}^{2}+(1-a_{m+1})^{2}>(1-\alpha^{2}/100)^{2}. This means that am+1<(am2+α2/16)a_{m+1}<(a_{m}^{2}+\alpha^{2}/16). If amα/4a_{m}\geq\alpha/4, then am+12am2a_{m+1}\leq 2a_{m}^{2}. Since a1<1/2a_{1}<1/2, it is easy to see that for some m=log2log(1/α)+O(1)m=\log_{2}\log(1/\alpha)+O(1) that amα/4a_{m}\leq\alpha/4. Thus, {xT:p(x)>t+(2m+1)R}Tα/4|\{x\in T:p(x)>t+(2m+1)R\}|\leq|T|\alpha/4. Similarly, we can see that {xT:p(x)<t(2m+1)R}Tα/4|\{x\in T:p(x)<t-(2m+1)R\}|\leq|T|\alpha/4. This gives us an interval of length O(Rlog(2+log(1/α)))O(R\log(2+\log(1/\alpha))) that contains all but an α/2\alpha/2-fraction of the points of p(T)p(T).

This provides the desired contradiction and gives the proof of Lemma 3.12. ∎

We now show that if the algorithm outputs a list of length one, the parameter α\alpha improves:

If the algorithm outputs a single pair (T,α)(T^{\prime},\alpha^{\prime}), then αα\alpha^{\prime}\geq\alpha.

Note that αα>0\alpha^{\prime}-\alpha>0 when T<T|T^{\prime}|<|T|. Moreover, T<T|T^{\prime}|<|T| holds true because the algorithm is guaranteed to remove at least one point, since

is non-zero. This completes the proof of Lemma 3.13. ∎

The next lemma shows that whenever the algorithm outputs two sets T1,T2T_{1},T_{2}, the associated parameters α1,α2\alpha_{1},\alpha_{2} satisfy 1/α12+1/α221/α21/\alpha_{1}^{2}+1/\alpha_{2}^{2}\leq 1/\alpha^{2}. This condition will be crucial for bounding the runtime of the overall algorithm. We also note that this condition holds even if the set TT was not α\alpha-good.

If the algorithm outputs a list of two pairs {(T1,α1),(T2,α2)}\{(T_{1},\alpha_{1}),(T_{2},\alpha_{2})\}, then we have that 1/α12+1/α221/α21/\alpha_{1}^{2}+1/\alpha_{2}^{2}\leq 1/\alpha^{2}.

4.3 Second Stage of Correctness Proof

In this subsection, we establish the correctness conditions of the algorithm that hold under the assumption that the set TT is α\alpha-good. Specifically, we will show that if TT is α\alpha-good the following hold: (1) If the algorithm returns “YES”, the means of p(G)p(G) and p(T)p(T) are close to each other. (2) If the algorithm returns a subset TT^{\prime}, then TT^{\prime} is good. (3) If the algorithm returns two subsets T1,T2T_{1},T_{2}, then at least one of the returned sets is good.

Since TT is α\alpha-good, we have that Var[p(G)]1\operatorname*{Var}[p(G)]\leq 1. By the degree-dd Chernoff bound and the definition of RR we get

We can now apply the bounds given by the definition of TT being α\alpha-good, which give the claim. ∎

We next show that if we find an appropriate interval [a,b][a,b] in Step 4, then the mean of p(G)p(G) is contained in the interval [aR,b+R][a-R,b+R].

By construction, if the algorithm returns “YES”, then

and [a,b][a,b] is an interval of length at most O((Clog(1/α))d/2log(2+log(1/α)))O((C\log(1/\alpha))^{d/2}\log(2+\log(1/\alpha))) that contains all except an α/2\alpha/2-fraction of points of p(x)p(x), xTx\in T. We first consider the contribution of the points in [a,b][a,b] to the variance to obtain that

We now prove (2). This proof requires a careful accounting of the number of points from SS and points from TT that are removed by our algorithm.

If TT is α\alpha-good and the algorithm returns a single pair (T,α)(T^{\prime},\alpha^{\prime}), then TT^{\prime} is α\alpha^{\prime}-good.

Since TT is α\alpha-good, it follows that STS\cap T contains a 1α/6+exp(3n)1/21-\alpha/6+\exp(-3n)\geq 1/2 fraction of the points in SS, hence

But note that the probability of the above event is at least 8/α8/\alpha times bigger for XuTX\in_{u}T. Since TT^{\prime} has the filtered samples removed, we have that

and so ST/ST1(α/8)(1T/T)|S\cap T^{\prime}|/|S\cap T|\geq 1-(\alpha/8)(1-|T^{\prime}|/|T|). Now we can bound from below the fraction of good samples in TT^{\prime}:

It remains to show that ST/S1α/6exp(3n)|S\cap T^{\prime}|/|S|\geq 1-\alpha^{\prime}/6-\exp(-3n). Noting that (1α/8(1T/T))α=αT/T(1-\alpha/8(1-|T^{\prime}|/|T|))\alpha=\alpha^{\prime}|T^{\prime}|/|T|, we get that:

Note that this proof shows that if TT is α\alpha-good, then TT^{\prime} is α\alpha^{\prime}-good. If α>1\alpha^{\prime}>1, TT cannot be α\alpha^{\prime}-good, as this would require that ST>T|S\cap T^{\prime}|>|T^{\prime}|. Thus, if the algorithm outputs “NO” in Step 4(b)iii, then α>1\alpha^{\prime}>1 implies that TT cannot be α\alpha-good.

Our last lemma establishes (3), handling the case when the algorithm outputs a list of two sets:

If TT is α\alpha-good and the algorithm outputs a list {(T1,α1),(T2,α2)}\{(T_{1},\alpha_{1}),(T_{2},\alpha_{2})\}, then for some i{1,2}i\in\{1,2\} TiT_{i} is αi\alpha_{i}-good.

Note that STS(1α/6e3n)|S\cap T|\geq|S|(1-\alpha/6-e^{-3n}). We have that S(T\T)Sα2/100|S\cap(T\backslash T^{\prime})|\leq|S|\alpha^{2}/100. Therefore, as long as αi/6α/6+α2/100\alpha_{i}/6\geq\alpha/6+\alpha^{2}/100, this set TiT_{i} contains enough points of SS. These points consist of at least T(αα2/100)/Ti|T|(\alpha-\alpha^{2}/100)/|T_{i}| fraction of TiT_{i}. Thus, when

the set TiT_{i} is αi\alpha_{i}-good. It remains to show that αi=Tα(1α/100)/Tiα(1+3α/50)\alpha_{i}=|T|\alpha(1-\alpha/100)/|T_{i}|\geq\alpha(1+3\alpha/50).

Recall that Tmax{T1,T2}αT/4|T|-\max\{|T_{1}|,|T_{2}|\}\geq\alpha\cdot|T|/4. Therefore, T/Ti(1α/4)1|T|/|T_{i}|\geq(1-\alpha/4)^{-1}, and thus αiα(1α/100)/(1α/4)α(1+3α/50)\alpha_{i}\geq\alpha(1-\alpha/100)/(1-\alpha/4)\geq\alpha(1+3\alpha/50). This completes our proof. ∎

In the following subsections, we appropriately use the basic multifilter algorithm to handle increasingly broad families of degree-dd polynomials.

5 Useful Results on Polynomials and Tensors

Our algorithm and its analysis will make essential use of degree-dd multivariate polynomials. Thus, we will need to establish here some basic facts and terminology. A basic theme throughout will be that “pure” degree-dd multivariate polynomials are in 111-1 correspondence with symmetric order-dd tensors. On the other hand, we will require three different notions of what it means for a polynomial to be “pure” degree-dd: homogeneous polynomials, symmetric multilinear polynomials, and harmonic polynomials (which are useful when considering norms with respect to the Gaussian distribution).

The structure of this section is as follows: In Section 3.5.1, we give the definitions of the various types of polynomials we will require and their associated tensors. In Section 3.5.2, we establish a key lemma (Lemma 3.24) relating the first two moments of such polynomials under an identity covariance Gaussian. This lemma is essential for our algorithm and its analysis.

Any degree-dd harmonic polynomial p(x)p(x) satisfies p(x)=hA(x)p(x)=h_{A}(x), where

We call a homogeneous, harmonic or symmetric multilinear polynomial normalized if the associated tensor AA has A2=1\|A\|_{2}=1.

5.2 Useful Lemma on Polynomials

In this section, we establish a crucial lemma relating the aforementioned classes of polynomials with respect to an identity covariance Gaussian distribution.

The top level of our list-decoding algorithm (Section 3.9) proceeds by computing the top eigenvector of a matrix to find a degree-dd harmonic polynomial whose empirical variance is large. In order to construct a multifilter from such a polynomial, we need to bound its variance on the true Gaussian, which determines its concentration. In general, the variance of a polynomial of an identity covariance Gaussian depends on the mean. Lemma 3.24 expresses the expectation and variance of harmonic polynomials of such a Gaussian in terms of homogeneous and symmetric multilinear polynomials. Using this connection, we can bound the variance of a harmonic polynomial by running multifilters on symmetric multilinear polynomials, which are easier to analyze because the variance of linear polynomials of Gaussians does not depend on the mean.

where B(d)B^{(d^{\prime})} is the order-2d2d^{\prime} tensor with

Let XN(μ,I)X\sim N(\mu,I) and Y=XμY=X-\mu. Then we have

where AμddA\mu^{\otimes d-d^{\prime}} is the order-dd^{\prime} tensor with i1,,idi_{1},\dots,i_{d^{\prime}} entry id+1,,idAi1,,idj=d+1dμij\sum_{i_{d^{\prime}+1},\dots,i_{d}}A_{i_{1},\dots,i_{d}}\prod_{j=d^{\prime}+1}^{d}\mu_{i_{j}}.

We prove this by Taylor expanding hA(X)=hA(Y+μ)h_{A}(X)=h_{A}(Y+\mu) about YY. In particular, applying the one-dimensional Taylor Theorem to hA(Y+tμ)h_{A}(Y+t\mu) about t=0t=0, we find that

where Dμ(f)=i=1nμifxiD_{\mu}(f)=\sum_{i=1}^{n}\mu_{i}\frac{\partial f}{\partial x_{i}} is the directional derivative in the μ\mu-direction of ff.

We note that DμkhAD_{\mu}^{k}h_{A} is a linear combination of kthk^{th}-order partial derivatives of hAh_{A}. Since any kthk^{th}-order partial derivative of HeaHe_{a} is a harmonic polynomial of degree a1k\|a\|_{1}-k, we have that DμkhAD_{\mu}^{k}h_{A} is a harmonic polynomial of degree dkd-k, say hCkh_{C_{k}}. In order to find out which polynomial, recall that by Fact 3.21 we have

Therefore, Ck=d!(dk)!AμkC_{k}=\sqrt{\frac{d!}{(d-k)!}}A\mu^{\otimes k}. Hence, we have

Changing variables to d=dkd^{\prime}=d-k yields our result. ∎

The proof of Lemma 3.24 is now complete. ∎

6 Multifilter Routine for Degree-111 and Degree-222 Homogeneous Polynomials

If the output is “YES”, then the following holds: if TT is α\alpha-good, then

If the output is “NO”, then TT is not α\alpha-good.

If the output is a list {(Ti,αi)}\{(T_{i},\alpha_{i})\}, then the list satisfies the multifilter condition for (T,α)(T,\alpha).

For correctness, note that since the viv_{i}’s are unit vectors, we have that Var[viTG]=1\operatorname*{Var}[v_{i}^{T}G]=1, and therefore the conditions of Proposition 3.8 hold. Hence, if BasicMultifilter returns “NO”, then TT is not α\alpha-good; if it returns a list of (Ti,αi)(T_{i},\alpha_{i}), then the list satisfies the multifilter condition for (T,α)(T,\alpha). So, the algorithm is correct if any multifilter produces this output.

It remains to show correctness when the algorithm output “YES”. In this case, using the guarantees of the “YES” case in BasicMultifilter, we have that

Since p(x)=xTAx=iλi(viTx)2p(x)=x^{T}Ax=\sum_{i}\lambda_{i}(v_{i}^{T}x)^{2} and iλi=A1\sum_{i}|\lambda_{i}|=\|A\|_{\ast}\leq 1, we have

7 Multifilter Routine for Symmetric Multilinear Polynomials

Recall that a degree-dd multilinear polynomial is one that is linear as a function of each coordinate and can be expressed as A(x1,,xd)A(x_{1},\dots,x_{d}), for an order-dd tensor AA.

A major part of our algorithm is to design a multifilter for degree-dd multilinear polynomials. In particular, given such a polynomial pp, we want to either be able to verify that p(GμT,,GμT)p(G-\mu_{T},\ldots,G-\mu_{T}) has similar expectation to p(TμT,,TμT)=0p(T-\mu_{T},\ldots,T-\mu_{T})=0 (in both cases the coordinates are independent), or to find a filter that allows us to throw away some of the bad points of TT. In particular, we show:

If the algorithm returns “YES”, then the following holds: If TT is α\alpha-good, then with probability at least 1τ1-\tau, we have that

where GiG_{i} are independent copies of GG.

If the algorithm returns “NO”, then TT is not α\alpha-good.

If the algorithm returns a list of (Ti,αi)(T_{i},\alpha_{i}), then (Ti,αi)(T_{i},\alpha_{i}) satisfy the multifilter condition for (T,α)(T,\alpha).

The algorithm runs in time O((ln(1/τ)+d)/α)dpoly(T,nd)O((\ln(1/\tau)+d)/\alpha)^{d}\cdot\operatorname{poly}(|T|,n^{d}).

The basic idea of the proof is as follows: We note that if p(GμT,GμT,,GμT)p(G-\mu_{T},G-\mu_{T},\ldots,G-\mu_{T}) has mean significantly far from , then when we evaluate p(TμT,TμT,,TμT)p(T-\mu_{T},T-\mu_{T},\ldots,T-\mu_{T}) there will be approximately an αd\alpha^{d} probability that all of our samples come from GG, and thus produce a value far from the mean. One might naively try to look at dd inputs to GG at a time and perform some clustering based on the values. Unfortunately, this approach has several problems including that some dd-tuples will mix good samples with bad ones.

Instead, we consider assigning all but one of the coordinates random values from TT and considering the linear function of the last coordinate. With decent probability, the first d1d-1 coordinates will come from GG and the average over the samples from SS will be approximately p(μμT,μμT,,μμT)p(\mu-\mu_{T},\mu-\mu_{T},\ldots,\mu-\mu_{T}), while the average where the last sample comes from TT will be approximately (since the mean of TT is , any linear function of TT also has mean ). This should allow us to do clustering unless VarX[p(X1μT,X2μT,,Xd1μT,XμT)]\operatorname*{Var}_{X}[p(X_{1}-\mu_{T},X_{2}-\mu_{T},\ldots,X_{d-1}-\mu_{T},X-\mu_{T})] is too large.

However, if fixing one of the inputs to pp causes the variance to increase significantly, this should also produce a detectable discrepancy that we can use to filter. In particular, the variance of a multilinear polynomial pp after setting its first coordinate to XX is a degree-22, trace-norm 11 polynomial in XX. Therefore, using our multifilter for degree-22 polynomials, unless it is the case that setting the first variable of pp to a random XX from TT does not increase the variance by too much, we can create a filter.

In the end, this gives us a recursive algorithm. We fix the variables of pp one at a time in many different ways. At each step, we verify that most of the ways to fix the next variable does not increase the variance by too much (or set the mean to something too large in the last step). If any of these fails to hold, we will be able to get a multifilter. If however, none of these settings produce any detectable problem, then it is likely the case that many times we saw dd samples from GG as inputs, yielding a value of pp that is not too large. This will only happen if the mean of pp at dd copies of GG is not too large, which is our “YES” case.

The pseudocode of our multifilter for multilinear polynomials is given below.

Firstly, we bound the running time. Note that we make Cα1ln(2d/τ)C\alpha^{-1}\ln(2d/\tau) calls to the degree-(d1)(d-1) MultilinearMultifilter, each of which has precision τ/2\tau/2. Inductively, in the recursive calls to the degree-(di)(d-i) MultilinearMultifilter, for di2d-i\geq 2, we make Cα1ln(2i+1(di)/τ)C\alpha^{-1}\ln(2^{i+1}(d-i)/\tau) calls to degree-(di+1)(d-i+1) MultilinearMultifilter. Thus, the total number of calls to MultilinearMultifilter is O((ln(1/τ)+d)/α)dO((\ln(1/\tau)+d)/\alpha)^{d}.

Next, we show that if the algorithm returns “NO” or a list (Ti,αi)(T_{i},\alpha_{i}), then it is correct. For this, we just need to show that all the calls to the subroutines satisfy their conditions. In the d=1d=1 case, AA is a vector, hence Var[A(G)]=A2=1\operatorname*{Var}[A(G)]=\|A\|_{2}=1, and therefore we satisfy the conditions of BasicMultifilter and the algorithm returns correctly.

where Bij=i2,,idAi,i2,,idAj,i2,,idB_{ij}=\sum_{i_{2},\dots,i_{d}}A_{i,i_{2},\dots,i_{d}}A_{j,i_{2},\dots,i_{d}}. But note that BB is symmetric and

Thus, we satisfy the pre-conditions of Degree2Multifilter and the algorithm returns correctly.

Since Ax=(1/q(x))A(xμT)=(1/Ax2)A(xμT)A_{x}=(1/\sqrt{q(x)})A(x-\mu_{T})=(1/\|Ax\|_{2})A(x-\mu_{T}), we have Ax2=1\|A_{x}\|_{2}=1 for all xUx\in U. Thus, we recursively satisfy the conditions of calls to MultilinearMultifilter with degrees lower than dd. By a simple induction, they are correct if they return “NO” or a list (Ti,αi)(T_{i},\alpha_{i}).

Now we consider the case when the algorithm outputs “YES”. We will show correctness by induction on the degree. For d=1d=1, since we satisfy the conditions of BasicMultifilter, we have that

We assume inductively that for d=d1d^{\prime}=d-1 and all order dd^{\prime} tensors AA^{\prime} with A2=1\|A^{\prime}\|_{2}=1 the following holds: If we apply the algorithm to AA^{\prime} with failure probability τ\tau^{\prime} and it returns “YES”, and if TT is α\alpha-good, then

with probability at least 1τ1-\tau^{\prime}, where ff is some function to be decided later.

Since TT is α\alpha-good, it contains at least an 1α/6+exp(3n)53/1001-\alpha/{6}+\exp(-{3}n)\geq 53/100 fraction of SS. Thus, STS\cap T contains at least S/200|S|/200 samples satisfying the condition in the claim statement. Again, since TT is α\alpha good, STαT|S\cap T|\geq\alpha|T|, and so at least an α/200\alpha/200-fraction of TT satisfies this condition. The probability that UU contains such an xx is at least

By our inductive assumption and a union bound, except with probability 1τ/2d21-\tau/2d^{2}, there is such an xUx\in U satisfying the above claim, if the recursive call on that xx satisfies

Our base case was f(1,α)=O(1+log(1/α)log(2+log(1/α)))f(1,\alpha)=O(\sqrt{1+\log(1/\alpha)}{\log(2+\log(1/\alpha))}), and we have shown that

We can thus take f(d,α)=O(1+log(1/α)log(2+log(1/α)))df(d,\alpha)=O(\sqrt{1+\log(1/\alpha)}{\log(2+\log(1/\alpha))})^{d}.

8 Multifilter Routine for Harmonic Polynomials

In this subsection, we build on the results of the previous subsections to handle all degree-dd Harmonic polynomials. Specifically, we prove:

There exists an algorithm, given a degree-dd harmonic polynomial hA(x)h_{A}(x) for a symmetric order-dd tensor AA with A2=1\|A\|_{2}=1, (T,α)(T,\alpha), and a confidence probability τ\tau, it outputs “YES”, “NO”, or a list of (Ti,αi)(T_{i},\alpha_{i}), and with probability at least 1τ1-\tau satisfies:

If it returns “YES”, the following holds: If TT is α\alpha-good, then

If it returns “NO”, then TT is not α\alpha-good.

If it returns a list of (Ti,αi)(T_{i},\alpha_{i}), then (Ti,αi)(T_{i},\alpha_{i}) satisfy the multifilter condition for (T,α)(T,\alpha).

When the algorithm outputs a list (Ti,αi)(T_{i},\alpha_{i}), we always have that i1/αi21/α2\sum_{i}1/\alpha_{i}^{2}\leq 1/\alpha^{2}. The algorithm runs in time O((ln(1/τ)+d)/α)dpoly(T,nd)O((\ln(1/\tau)+d)/\alpha)^{d}\cdot\operatorname{poly}(|T|,n^{d}).

Algorithm HarmonicMultifilter Input: a degree-dd harmonic polynomial hA(x)h_{A}(x) for a symmetric tensor AA with A2=1\|A\|_{2}=1, T,α,τT,\alpha,\tau. 1. For d=0d^{\prime}=0 to dd: (a) Let B(d)B^{(d^{\prime})} by the order-2d2d^{\prime} tensor with Bi1,,id,j1,,jd(d)=kd+1,,kdAi1,,id,kd+1,kdAj1,,jd,kd+1,,kd  .B^{(d^{\prime})}_{i_{1},\dots,i_{d^{\prime}},j_{1},\dots,j_{d^{\prime}}}=\sum_{k_{d^{\prime}+1},\dots,k_{d}}A_{i_{1},\dots,i_{d^{\prime}},k_{d^{\prime}+1},\dots k_{d}}A_{j_{1},\dots,j_{d^{\prime}},k_{d^{\prime}+1},\dots,k_{d}}\;. (b) We can consider B(d)B^{(d^{\prime})} as a ndndn^{d^{\prime}}\otimes n^{d^{\prime}} symmetric matrix by grouping each of the i1,,idi_{1},\dots,i_{d^{\prime}} and j1,,jdj_{1},\dots,j_{d^{\prime}} coordinates together. By computing an eigendecomposition of this matrix, obtain a decomposition B(d)=iλiViVi  ,B^{(d^{\prime})}=\sum_{i}\lambda_{i}V_{i}\otimes V_{i}\;, where the ViV_{i} are a symmetric order dd^{\prime} tensors with Vi2=1\|V_{i}\|_{2}=1 and λi0\lambda_{i}\neq 0. (c) For each ViV_{i}, run MultilinearMultifilter(T,Vi,d,α,τ/(dnd))(T,V_{i},d^{\prime},\alpha,\tau/(dn^{d})). If any return “NO” or a list of (Ti,αi)(T_{i},\alpha_{i}), then return the same output. 2. Run BasicMultifilter on hA(xμT)/βh_{A}(x-\mu_{T})/\beta, where β=(C(1+log(1/α))log(2+log(1/α))2)d/2\beta=(C(1+\log(1/\alpha))\log(2+\log(1/\alpha))^{2})^{d/2} for a sufficiently large constant CC. If it returns “NO” or a list of (Ti,αi)(T_{i},\alpha_{i}), then return the same output. 3. Return “YES”.

We first note that, as claimed, ViV_{i} is symmetric:

The eigenvectors of the matrix corresponding to B(d)B^{(d^{\prime})} with non-zero eigenvalues correspond to symmetric tensors.

If VV is such a tensor corresponding to eigenvalue λ\lambda, then

However, since AA is symmetric under permutations of its first dd^{\prime} coordinates, B(d)B^{(d^{\prime})} is also symmetric in its first dd^{\prime} coordinates, and so λV\lambda V is symmetric. When λ0\lambda\neq 0, VV is also symmetric. ∎

Since V2=1\|V\|_{2}=1, the preconditions of MultilinearMultifilter are satisfied, and if it returns something other than “YES”, the algorithm returns correctly.

There are at most dnd{dn^{d}} calls to MultilinearMultifilter in the loop and one outside. By a union bound, all calls to MultilinearMultifilter output “YES” correctly with probability at least 1τ1-\tau. We assume this holds.

We can further use the decomposition of B(d)B^{(d^{\prime})} to obtain that when all calls to MultilinearMultifilter correctly return “YES”:

where G(1),,G(d)G_{(1)},\dots,G_{(d^{\prime})} are independent copies of GG.

Now we need to bound (iλi)(\sum_{i}|\lambda_{i}|). Let BB^{\prime} be the matrix that, as in the algorithm, is obtained from B(d)B^{(d^{\prime})} by treating the first and last dd^{\prime} coordinates of B(d)B^{(d^{\prime})} as one coordinate. Let AA^{\prime} be the matrix obtained from AA by similarly treating the first dd^{\prime} and last ddd-d^{\prime} coordinates as a single coordinate. Then, B=ATAB^{\prime}=A^{\prime T}A^{\prime}. Thus, BB^{\prime} is positive semidefinite and λi0\lambda_{i}\geq 0. Thus, we have

Since d=0d1/(dd)!e\sum_{d^{\prime}=0}^{d}1/(d^{\prime}-d)!\leq e and (dd)2d{d\choose d^{\prime}}\leq 2^{d}, we have

For CC sufficiently large, if TT is α\alpha-good, then Var[hA(GμT)]β2\operatorname*{Var}[h_{A}(G-\mu_{T})]\leq\beta^{{2}} with probability at least 1τ1-\tau. In this case, hA(xμT)/βh_{A}(x-\mu_{T})/\beta satisfies the pre-conditions of BasicMultifilter, and so if it returns “NO” or a list, the algorithm is correct. If it returns “YES”, then Var[hA(TμT)/β]=O(d+log(1/α))dlog(2+log(1/α))2\operatorname*{Var}[h_{A}(T-\mu_{T})/\beta]=O(d+\log(1/\alpha))^{d}\cdot\log(2+\log(1/\alpha))^{2}, and so

Additionally, if TT is α\alpha-good, then

9 Putting Everything Together: Proof of Proposition 3.6

The algorithm establishing the proposition is presented in pseudocode below:

For Ai=vc(i)/(dc1(i),,cn(i))A_{i}=v^{\ast}_{c(i)}/\sqrt{{d\choose c_{1}(i),\dots,c_{n}(i)}}, we have hA(x)=vTPn,d(x)h_{A}(x)=v^{\ast T}\mathbf{P}_{n,d}(x) and A2=v2=1\|A\|_{2}=\|v^{\ast}\|_{2}=1.

Now we have hA(x)=vTPn,d(x)h_{A}(x)=v^{\ast T}\mathbf{P}_{n,d}(x).

Since A2=1\|A\|_{2}=1, we satisfy the preconditions of HarmonicMultifilter. By construction, the routine HarmonicMultifilter returns correctly with probability at least 1τ1-\tau. We assume that this holds. In this case, if it returns “NO” or a list of (Ti,αi)(T_{i},\alpha_{i}), then they are the correct output.

The following lemma establishes that AA has the largest eigenvalue:

We have hB(x)=vTp(x)h_{B}(x)=v^{T}p(x), where va=i:c(i)=aBiv_{a}=\sum_{i:c(i)=a}B_{i}. Thus, we obtain

Now we consider the distribution of this polynomial under GG:

Using the Taylor series expansion (7), we obtain that

Note the k=0k=0 term is (vT(μμT))2d(v^{T}(\mu-\mu_{T}))^{2d} and that the ratio of the k+1k+1 term to the kk term, for k0k\geq 0, is

Thus, by Cantelli’s inequality we get that

Since SS is representative, we have that

Now if TT is α\alpha-good, then STS{\cap T} contains at least a 1α/6exp(3n)76/1001-\alpha/{6}-\exp(-{3}n)\geq{76}/100 fraction of the samples in SS. Thus, we have that

However, by Markov’s inequality applied to (9), we get

Note that (vT(μμT))d2max{d,vT(μμT)}d1(v^{T}(\mu-\mu_{T}))^{d}\leq\sqrt{2}\max\{d,v^{T}(\mu-\mu_{T})\}^{d-1} only when vT(μμT)2dv^{T}(\mu-\mu_{T})\leq 2d, and so we have that:

Since this holds for all unit vectors vv, we have that when the algorithm returns μT\mu_{T} and TT is α\alpha-good, then

This completes the proof or Proposition 3.6. ∎

Learning Spherical Gaussian Mixture Models

In Section 4.1, we present a simpler learning algorithm that works when the components have the same covariance matrix. The general case of unknown (potentially different) covariances is more complex and is handled in Section 4.2. Section 4.3 contains our dimension-reduction procedures. In Section 4.4, we put everything together to obtain our final learning guarantees, including Theorem 1.4.

We start by handling the important special case of this problem where each Gaussian component has identity covariance matrix. Note that our learning algorithm is robust to a small constant fraction of corrupted samples:

There is an algorithm that given a positive integer dd, constants 1/2>α>4ε01/2>\alpha>4\varepsilon\geq 0, 0<τ<10<\tau<1, and sample access to a probability distribution X=(1ε)M+εYX=(1-\varepsilon)M+\varepsilon Y, where M=wiN(μi,I)M=\sum w_{i}N(\mu_{i},I) is a mixture of identity covariance Gaussians with wiαw_{i}\geq\alpha for all ii, and so that μiμj2\|\mu_{i}-\mu_{j}\|_{2} is at least

The algorithm itself is very simple. We run our list-decoding algorithm to get a list of hypothesis means. We then associate each sample with the closest element of our list. We can then cluster points based on which means they are associated to and use this to learn the correct components. The algorithm is as follows:

Note that for each ii, XX is simultaneously a mixture of N(μi,I)N(\mu_{i},I) with weight α\alpha and some other distribution with weight (1α)(1-\alpha). Therefore, for each ii with probability at least 1τ/(10k)1-\tau/(10k), there is some hjHh_{j}\in H with

for CC is sufficiently large. By a union bound, with probability at least 1τ/101-\tau/10, this occurs for every ii. We assume this holds throughout the remainder of our analysis.

Let SiS_{i} be the set of elements of TT^{\prime} drawn from the component N(μi,I)N(\mu_{i},I).

With probability 1exp(Ω(S2))1-\exp(-\Omega(S^{2})) over the samples from TT^{\prime}, all but an exp(Ω(S2))\exp(-\Omega(S^{2}))-fraction of the elements of SiS_{i} are associated with elements hHh\in H with hμi2S/20.\|h-\mu_{i}\|_{2}\leq S/20.

The basic idea of the proof is the following: For any given hHh\in H that is far from μi\mu_{i}, there will be some hHh^{\prime}\in H that is much closer. A given sample point xx will only be closer to hh than hh^{\prime} if its projection to the line between them is more than half way there. However, this projection is distributed as a Gaussian, and therefore the probability that it is much larger than its mean is small.

It suffices to show that for each hHh\in H with hμi2>S/20\|h-\mu_{i}\|_{2}>S/20 the following holds: less than a exp(Ω(S2))\exp(-\Omega(S^{2}))-fraction of the elements of SiS_{i} are associated with hh.

Firstly, assuming that the first step was successful, we know that there is an hHh^{\prime}\in H with hμi2<S/100\|h^{\prime}-\mu_{i}\|_{2}<S/100.

Thus, by Markov’s inequality, the probability that more than a exp(Ω(S2))\exp(-\Omega(S^{2}))-fraction of the elements of SiS_{i} are associated to hh (with suitably small constant in the Ω()\Omega(\cdot)), is exp(Ω(S2))\exp(-\Omega(S^{2})). Taking a union bound over hh, does not change this asymptotic. ∎

Taking a union bound over ii, we can assume that, with probability at least 1τ/101-\tau/10, we have that all but an exp(Ω(S2))\exp(-\Omega(S^{2})) fraction of the points of SiS_{i} are associated with some hjh_{j} where hjμi2S/20\|h_{j}-\mu_{i}\|_{2}\leq S/20. In particular, this implies that every element of HH within distance S/20S/20 of some μi\mu_{i} is in HH^{\prime}. Indeed, this holds for the following reason: With high probability, Si(3α/4)T|S_{i}|\geq(3\alpha/4)\cdot|T^{\prime}| and at least 8/98/9 fraction of elements in SiS_{i} are associated with hjh_{j}’s that are within distance S/20S/20 of μi\mu_{i}. By the triangle inequality these hjh_{j}’s are within distance S/3S/3 of hh. Conversely, any element of HH not within distance S/20S/20 of some μi\mu_{i} has associated with it at most an ε/10\varepsilon/10-fraction of the elements of the union of the SiS_{i}’s. This implies that with high probability less than 1.2ε<2α/31.2\varepsilon<2\alpha/3 fraction of points in TT are associated to any point of HH not within distance S/20S/20 of some μi\mu_{i}. Therefore, all points of HH^{\prime} are within distance 3S/203S/20 of some μi\mu_{i}, which implies that the relation on HH^{\prime} is an equivalence relation. Specifically, each equivalence class consists of the points in HH^{\prime} within distance 3S/203S/20 of some particular mean μi\mu_{i}. Note in particular that this implies that there is exactly one equivalence class CC for each μi\mu_{i}.

2 Learning Spherical GMMs: The General Case

We now generalize the algorithm from the previous subsection to handle arbitrary mixtures of spherical Gaussians. When it is not the case that all of the covariance matrices are the same, things are substantially more complicated. We can recover a list HH of candidate means only after first guessing the radius of the component that we are looking for. We can produce a large list of guesses and thereby obtain a list of hypotheses of size poly(n/α)\operatorname{poly}(n/\alpha). However, clustering becomes somewhat more difficult, as we do not know the radius at which to cluster. In particular, Steps 6 and 7 become difficult not knowing at what distance to stop considering two hypotheses part of the same cluster. This difficulty can be dealt with by doing a secondary test to determine whether or not the cluster that we have found contains many points at approximately the correct distance from each other.

There is an algorithm that, given a positive integer dd, constants 1/2>α>3ε01/2>\alpha>3\varepsilon\geq 0, and sample access to a probability distribution X=(1ε)M+εYX=(1-\varepsilon)M+\varepsilon Y, in dimension nn larger than a sufficiently large multiple of log(τ/α)\log(\tau/\alpha), where M=i=1kwiN(μi,σi2I)M=\sum_{i=1}^{k}w_{i}N(\mu_{i},\sigma_{i}^{2}I) is a mixture of spherical Gaussians with wiαw_{i}\geq\alpha for all ii, and so that μiμj2/(σi+σj)\|\mu_{i}-\mu_{j}\|_{2}/(\sigma_{i}+\sigma_{j}) is at least

Throughout this proof “with high probability” will always mean with probability at least 1τ(α/n)c1-\tau(\alpha/n)^{c} for a sufficiently large constant cc.

The algorithm begins by producing a list LL of hypothesis standard deviations. We note that if xx and yy are both drawn from N(μi,σi2I)N(\mu_{i},\sigma^{2}_{i}I) then with high probability the distance between xx and yy is Θ(σin)\Theta(\sigma_{i}\sqrt{n}). Therefore, since with high probability TT contains at least (2α/3)2(2\alpha/3)^{2} such pairs, with high probability the largest power of (1+1/(10n))(1+1/(10n)) less than σi2\sigma_{i}^{2} is in LL for all ii. We also note that LL cannot be too large. In particular, of sLs\in L than an Ω(α2)\Omega(\alpha^{2})-fraction of the pairs in TT have distance Θ(sn)\Theta(s\sqrt{n}). This cannot happen for more than O(α2)O(\alpha^{-2}) values of ss that differ pairwise by sufficiently large constant multiples. From this it is easy to see that L=O(nα2)|L|=O(n\alpha^{-2}). This implies that H=O(nα5)|H|=O(n\alpha^{-5}).

Additionally, note that N(μi,σi2I)N(\mu_{i},\sigma^{2}_{i}I) is a mixture of N(μi,si2I)N(\mu_{i},s^{2}_{i}I) with some other distribution. In particular:

Given σ,σ>0\sigma,\sigma^{\prime}>0 with σ2/(1+1/10n)σ2σ2\sigma^{2}/(1+1/10n)\leq\sigma^{\prime 2}\leq\sigma^{2}, N(μ,σ2I)N(\mu,\sigma^{2}I) can be considered as a mixture 0.9N(μ,σ2I)+0.1E0.9N(\mu,\sigma^{\prime 2}I)+0.1E, for some distribution EE.

This means that XX is a mixture of wi(9/10)N(μi,si2I)w_{i}(9/10)N(\mu_{i},s_{i}^{2}I) with some other distribution, when sis_{i} is the largest power of (1+1/(10n))(1+1/(10n)) less than σi\sigma_{i}. Therefore, with high probability over TT, our list HH includes a hypothesis hh (associated with sis_{i}) so that hμi2siS/10000\|h-\mu_{i}\|_{2}\leq s_{i}S/10000. We assume that this is true in the remainder.

Let UiU_{i} be the set of samples of UU drawn from N(μi,σi2I)N(\mu_{i},\sigma_{i}^{2}I), and SiS_{i} the set of these samples from TT^{\prime}. Note that with high probability Ui(9α/10)U|U_{i}|\geq(9\alpha/10)|U| for all ii, and Si=(wi+O(ε))T|S_{i}|=(w_{i}+O(\varepsilon))|T^{\prime}| for all ii.

We would like to claim that, for xx drawn from UiU_{i}, with high probability that σi/2r(x)2σi\sigma_{i}/2\leq r(x)\leq 2\sigma_{i}. For the upper bound, note that any pair of elements from N(μi,σi2I)N(\mu_{i},\sigma_{i}^{2}I) have distance at most 2σin2\sigma_{i}\sqrt{n} with high probability. Thus, as long as this holds for at least 90%90\% of the other elements of UiU_{i}, we will have r(x)2σir(x)\leq 2\sigma_{i}. For the lower bound, consider fixing the elements of UU other than xx, and then drawing xx uniformly at random from N(μi,σi2I)N(\mu_{i},\sigma^{2}_{i}I). We note that for any other yUy\in U, except with exp(Ω(n))\exp(-\Omega(n))-probability, it holds xy2σin/2\|x-y\|_{2}\geq\sigma_{i}\sqrt{n}/2. Since nlog(1/(τα))n\gg\log(1/(\tau\alpha)), with high probability, this cannot happen for more than an α/2\alpha/2-fraction of the elements of UU. Thus, we may assume that for all xUix\in U_{i} we have σi/2r(x)2σi\sigma_{i}/2\leq r(x)\leq 2\sigma_{i}.

We next want to show that each component Gaussian has a corresponding cluster of hypotheses in HH^{\prime}. To do this, we want to show that the closest element of HH to any element of UiU_{i} is not too far from μi\mu_{i}. We show the following lemma:

With high probability, all of the elements xx of UiU_{i} are associated with elements hHh\in H with hμi2σiS/2000.\|h-\mu_{i}\|_{2}\leq\sigma_{i}S/2000.

The proof is essentially identical to that of Lemma 4.2.

It suffices to show that an element xx drawn from N(μi,σi2)N(\mu_{i},\sigma_{i}^{2}) is not associated with a specific hh with hμi2>σiS/200\|h-\mu_{i}\|_{2}>\sigma_{i}S/200 with high probability. Taking a union bound over hh and a Chernoff bound over UiU_{i} will complete the proof. We note that there is always an hh^{\prime} at distance at most σiS/10000\sigma_{i}S/10000 from μi\mu_{i}. We note that xx is closer to hh than hh^{\prime} only if its inner product with vv, the unit vector in the hhh-h^{\prime} direction is more than v(h+h)/2v\cdot(h+h^{\prime})/2. However, this is greater than the mean of vxv\cdot x by at least S/10000S/10000, so it happens with probability at most exp(Ω(S2))\exp(-\Omega(S^{2})). This completes the proof. ∎

We make two claims about HH^{\prime}. First, that all elements hiHh_{i}\in H^{\prime} are within distance σjS/1000\sigma_{j}S/1000 of μj\mu_{j} for some jj, and that 6σjsiσj/66\sigma_{j}\geq s_{i}\geq\sigma_{j}/6. Second, for each jj, that there is such an hiHh_{i}\in H^{\prime}. This will tell us that the elements of HH^{\prime} correspond to clusters around each true mean.

On the one hand, with high probability, for each jj, at least 9α/109\alpha/10-fraction of elements in UjU_{j} satisfy 2σjr(x)σj/22\sigma_{j}\geq r(x)\geq\sigma_{j}/2, and at least 4α/54\alpha/5 fraction of them are associated to elements of HH within distance σjS/2000\sigma_{j}S/2000 of μj\mu_{j}. Furthermore, HH contains an hih_{i} with σj(1+1/(10n))siσj\sigma_{j}(1+1/(10n))\leq s_{i}\leq\sigma_{j} and with hih_{i} within distance siS/1000s_{i}S/1000 of μj\mu_{j}. If this is all the case, then a 4α/54\alpha/5-fraction of the elements of UU are associated to elements of HH within distance siS/1000s_{i}S/1000 of hih_{i}, and thus hiHh_{i}\in H^{\prime}.

In the other direction, we note that if hHh\in H^{\prime} then at least a (4α/52ε)(4\alpha/5-2\varepsilon)-fraction of the points of xUx\in U are from some UiU_{i}, satisfy si/3r(x)3sis_{i}/3\leq r(x)\leq 3s_{i}, and are associated with a point of HH within distance siS/1000s_{i}S/1000 of hh. Therefore, for some jj, at least an α/5\alpha/5-fraction of the elements of UjU_{j} have this property. We claim that with high probability this cannot happen unless hh is within distance σjS/100\sigma_{j}S/100 of μj\mu_{j}. For one, if si/3r(x)3sis_{i}/3\leq r(x)\leq 3s_{i}, it must be the case that si/6σj6sis_{i}/6\leq\sigma_{j}\leq 6s_{i}. By Lemma 4.5, we must have that xx is associated to an element within distance σjS/2000\sigma_{j}S/2000 of μj\mu_{j}. This is within distance siS/10006σjS/1000s_{i}S/1000\leq 6\sigma_{j}S/1000 of hih_{i}, and therefore hiμj27σjS/1000σjS/100\|h_{i}-\mu_{j}\|_{2}\leq 7\sigma_{j}S/1000\leq\sigma_{j}S/100.

This implies that the elements of HH^{\prime} cluster into equivalence classes as desired. Two distinct points hih_{i} and hih_{i^{\prime}} close to the same σj\sigma_{j} will have hihi2σjS/1003(si+si)/100<(si+si)/10\|h_{i}-h_{i^{\prime}}\|_{2}\leq\sigma_{j}S/100\leq 3(s_{i}+s_{i^{\prime}})/100<(s_{i}+s_{i^{\prime}})/10, and thus are equivalent. On the other hand, if hih_{i} is close to μj\mu_{j} and hih_{i^{\prime}} is close to μj\mu_{j^{\prime}}, then

We claim that with high probability all but an ε\varepsilon-fraction of the elements of SiS_{i} are in TCT_{C}, where CC is the equivalence class containing elements of HH^{\prime} close to μi\mu_{i}. This is for the same reason as Lemmas 4.5 and 4.2. TCT_{C} contains elements of HH^{\prime} that are at most σiS/100\sigma_{i}S/100 far from μi\mu_{i}. Other TCT_{C^{\prime}} only contain elements of HH^{\prime} that are σjS/100\sigma_{j}S/100-far from other μj\mu_{j}, and therefore, at least 99(σi+σj)S/100>σiS/299(\sigma_{i}+\sigma_{j})S/100>\sigma_{i}S/2 far from μi\mu_{i}. A given sample from N(μi,σi2I)N(\mu_{i},\sigma_{i}^{2}I) is closer to the former except with probability exp(Ω(S2))\exp(-\Omega(S^{2})), and thus with high probability all but an ε\varepsilon-fraction of the elements of SiS_{i} are associated with TCT_{C}.

3 Dimension Reduction

In this section, we describe our dimension reduction scheme for the case of spherical mixtures. When the components have the same covariance, dimension reduction is quite simple and allows us to assume without loss of generality that the ambient dimension is k1k-1. The effect of dimension reduction for this case is that the runtime of the learning algorithm becomes somewhat better as a function of nn.

When the components have arbitrary spherical covariances, we require a more complicated procedure that allows us to reduce the dimension down to poly(k/ε)\operatorname{poly}(k/\varepsilon). In addition to improving the dependence on nn in the runtime, this has the effect of removing the Ω(log(n))\Omega(\sqrt{\log(n)}) dependence in the separation condition of Proposition 4.3.

For the case of identity covariance components, we will require the following generalization of Theorem 4.2 of [RV17] or Corollary 3 of [VW04]:

Given ε>0\varepsilon>0, suppose we take Ω(nlog(k/τ)/(ε4wmin4))\Omega(n\log(k/\tau)/(\varepsilon^{4}w_{\min}^{4})) independent samples from X=i=1kwiN(μi,σi2I)X=\sum_{i=1}^{k}w_{i}N(\mu_{i},\sigma_{i}^{2}I), where wiwminw_{i}\geq w_{\min}, and let WW be the affine subspace of dimension k1k-1 containing the empirical mean μ~{\widetilde{\mu}} and spanned by the top k1k-1 eigenvectors of the empirical covariance Σ~{\widetilde{\Sigma}}. Then, with probability at least 1τ1-\tau, for all ii, μi\mu^{\prime}_{i}, the orthogonal projection of μi\mu_{i} onto WW, satisfies μiμi2εσ\|\mu^{\prime}_{i}-\mu_{i}\|_{2}\leq\varepsilon\sigma, where σ2=iwiσi2\sigma^{2}=\sum_{i}w_{i}\sigma_{i}^{2}.

Note that unlike Corollary 3 of [VW04], we only need WW to be k1k-1 dimensional and unlike Theorem 4.2 of [RV17], we do not need the means μi\mu_{i} to be bounded.

We first use standard facts about the empirical mean and covariance matrix of a single Gaussian:

If we take Ω(nlog(1/τ)/ε2)\Omega(n\log(1/\tau)/\varepsilon^{2}) independent samples from a Gaussian N(μ,Σ)N(\mu,\Sigma), then, except with probability τ\tau, we have that the empirical covariance Σ~{\widetilde{\Sigma}} and empirical mean μ~{\widetilde{\mu}} satisfy (1ε)ΣΣ~(1+ε)Σ(1-\varepsilon)\Sigma\preceq{\widetilde{\Sigma}}\preceq(1+\varepsilon)\Sigma and (μ~μ)TΣ(μ~μ)ε2({\widetilde{\mu}}-\mu)^{T}\Sigma({\widetilde{\mu}}-\mu)\leq\varepsilon^{2}.

Next note that we can write the empirical covariance as

Since μ~{\widetilde{\mu}} is a convex combination μ~=j=1kw~jμ~j{\widetilde{\mu}}=\sum_{j=1}^{k}{\widetilde{w}}_{j}{\widetilde{\mu}}_{j} of the μ~j{\widetilde{\mu}}_{j}, the vectors μ~iμ~{\widetilde{\mu}}_{i}-{\widetilde{\mu}} span a (k1)(k-1)-dimensional subspace. For any unit vector vv in the (nk+1)(n-k+1)-dimensional subspace orthogonal to this subspace, we have

Thus, the bottom nk+1n-k+1 eigenvalues of Σ~{\widetilde{\Sigma}} are at most (1+3δ)σ2(1+3\delta)\sigma^{2}.

Now consider μ~i{\widetilde{\mu}}^{\prime}_{i}, the orthogonal projection of μ~i{\widetilde{\mu}}_{i} onto WW. Let v=(1/μ~iμ~i2)(μ~iμ~i)v=(1/\|{\widetilde{\mu}}_{i}-{\widetilde{\mu}}^{\prime}_{i}\|_{2})({\widetilde{\mu}}_{i}-{\widetilde{\mu}}^{\prime}_{i}). Since vv is orthogonal to the top-kk eigenvectors of Σ~{\widetilde{\Sigma}}, it follows that vTΣ~v(1+3δ)σ2v^{T}{\widetilde{\Sigma}}v\leq(1+3\delta)\sigma^{2}. Since vv is orthogonal to WW which contains μ~i{\widetilde{\mu}}^{\prime}_{i} and μ~{\widetilde{\mu}}, we have vT(μ~iμ~)=0v^{T}({\widetilde{\mu}}^{\prime}_{i}-{\widetilde{\mu}})=0. Thus, we have

Re-arranging, we have μ~iμ~i25δσ2/((1δ)wi)\|{\widetilde{\mu}}_{i}-{\widetilde{\mu}}^{\prime}_{i}\|_{2}\leq\sqrt{5\delta\sigma^{2}/((1-\delta)w_{i})}. Setting δ=ε2wmin/12\delta=\varepsilon^{2}w_{\min}/12 gives μ~iμ~i2εσ/2\|{\widetilde{\mu}}_{i}-{\widetilde{\mu}}^{\prime}_{i}\|_{2}\leq\varepsilon\sigma/2.

Noting that projecting onto an orthogonal space reduces Euclidean distance, we have μ~iμi2μ~iμi2\|{\widetilde{\mu}}^{\prime}_{i}-\mu^{\prime}_{i}\|_{2}\leq\|{\widetilde{\mu}}_{i}-\mu_{i}\|_{2}. The triangle inequality gives μiμi2μ~iμi2+μ~iμ~i2+μ~iμi2εσ/2+2δσiεσ/2+εwiσi/2εσ\|\mu_{i}-\mu^{\prime}_{i}\|_{2}\leq\|{\widetilde{\mu}}_{i}-\mu_{i}\|_{2}+\|{\widetilde{\mu}}_{i}-{\widetilde{\mu}}^{\prime}_{i}\|_{2}+\|{\widetilde{\mu}}^{\prime}_{i}-\mu^{\prime}_{i}\|_{2}\leq\varepsilon\sigma/2+2\delta\sigma_{i}\leq\varepsilon\sigma/2+\varepsilon\sqrt{w_{i}}\sigma_{i}/2\leq\varepsilon\sigma. This completes the proof. ∎

Remark. Note that the above proposition can be combined with our algorithm from Section 4.2 by first finding WW and then learning the projection of XX onto WW. Since we are now working in only poly(k/ε)\operatorname{poly}(k/\varepsilon) dimensions, the latter does not require a log(n)\log(n) dependence on SS.

We start with the following observation: If we knew that all of the σi\sigma_{i}’s were within a constant multiple of some known σ\sigma, we could simply scale XX down by a factor of σ\sigma and then project onto the top kk eigenvectors of the empirical covariance. This approach is used in [VW04]. The difficulty comes when the σi\sigma_{i}’s are not close to each other. Doing this in this case would only give error O(εjwjσj2)O(\varepsilon\sqrt{\sum_{j}w_{j}\sigma_{j}^{2}}), which will be larger than O(εσi)O(\varepsilon\sigma_{i}) for some ii. To deal with this issue, we notice that similar to the proof of Proposition 4.3, we can approximate the σ\sigma associated to a given sample by measuring how close it is to other samples. This will allow us to break our samples into several subsets each of which is a mixture of Gaussians with similar covariances.

We note that we can assume that npoly(N)n\gg\operatorname{poly}(N), for a sufficiently large polynomial or the algorithm trivially terminates in Step 1. We assume this throughout the rest of this proof.

In order to analyze the algorithm, we need to understand the distribution of the r(x)r(x). To begin with, we note that:

With high probability over our samples, for every xyx\neq y from UU, with xx drawn from N(μi,σi2I)N(\mu_{i},\sigma_{i}^{2}I) and yy drawn from N(μj,σj2I)N(\mu_{j},\sigma_{j}^{2}I), we have that xy22=(μiμj22+(σi2+σj2)n)(1+o(n1/3)).\|x-y\|_{2}^{2}=\left(\|\mu_{i}-\mu_{j}\|_{2}^{2}+(\sigma_{i}^{2}+\sigma_{j}^{2})n\right)(1+o(n^{-1/3})).

We note that, for any given choice of ii and jj, a random pair of xx and yy satisfy this except with exp(Ω(n))\exp(-\Omega(n)) probability. The lemma follows from a union bound over xx and yy. ∎

With high probability, for all xUx\in U drawn from N(μi,σi2I)N(\mu_{i},\sigma_{i}^{2}I), we have that

Assuming the conclusion of Lemma 4.9 holds, then the r(x)r(x) is automatically at least this big and is at most this large assume that at least one (other) sample was drawn from N(μj,σj2I)N(\mu_{j},\sigma_{j}^{2}I) for the minimizing jj. This of course happens with high probability. ∎

With high probability, for all xx drawn from N(μi,σi2I)N(\mu_{i},\sigma_{i}^{2}I) we have that σi(1o(n1/3))r(x)2σi(1+o(n1/3))\sigma_{i}(1-o(n^{-1/3}))\leq r(x)\leq\sqrt{2}\sigma_{i}(1+o(n^{-1/3})).

The lower bound is immediate. The upper bound follows from taking j=ij=i. ∎

The above Corollary also has several other consequences. All of the xx’s coming from the same component will have r(x)r(x) close to minj(μiμj22/n+(σi2+σj2))\min_{j}(\sqrt{\|\mu_{i}-\mu_{j}\|_{2}^{2}/n+(\sigma_{i}^{2}+\sigma_{j}^{2})}) and thus all lie in the same CjC_{j}. This implies that there are at most kk many classes CjC_{j}. Furthermore, since the xx’s coming from a single Gaussian component all have r(x)r(x) within a 1+o(n1/3)1+o(n^{-1/3}) multiple of each other, it means that all of the r(x)r(x), for xCjx\in C_{j}, are within a (1+n1/3)O(k)(1+n^{-1/3})^{O(k)} multiple of each other. Therefore, for each jj, all of the xCjx\in C_{j} have r(x)r(x) within a constant multiple of sjs_{j}. Thus, all of these xx’s come from Gaussians with σi\sigma_{i} within a constant multiple of sjs_{j}. Let SjS_{j} be the set of ii such that all samples from N(μi,σi)N(\mu_{i},\sigma_{i}) are in CjC_{j}. By Lemma 4.6, the orthogonal projection μi\mu^{\prime}_{i} of μi\mu_{i} for iSji\in S_{j} onto WjW_{j} satisfies μiμi2εσ\|\mu^{\prime}_{i}-\mu_{i}\|_{2}\leq\varepsilon\sigma, where σ2=(iSjwiσi2)/iSjwi\sigma^{2}=\left(\sum_{i\in S_{j}}w_{i}\sigma_{i}^{2}\right)/\sum_{i\in S_{j}}w_{i}. Since σ=Θ(sj)=Θ(σi)\sigma=\Theta(s_{j})=\Theta(\sigma_{i}) for each iSji\in S_{j}, we have that μiμi2O(εσi)\|\mu^{\prime}_{i}-\mu_{i}\|_{2}\leq O(\varepsilon\sigma_{i}). Therefore, since WW contains WjW_{j}, μi\mu_{i} is within O(εσi)O(\varepsilon\sigma_{i}) of its projection onto WW for all ii.

Finally, since each WjW_{j} has dimension at most k1k-1 and since WW is the sum of at most kk of them, we have that dim(W)k2\dim(W)\leq k^{2}. This completes the proof. ∎

4 Putting Everything Together

By combining Proposition 4.1 and Lemma 4.6, we immediately obtain the following corollary:

There is an algorithm that given a positive integer dd, constants 1/2>α>4ε01/2>\alpha>4\varepsilon\geq 0, 0<τ<10<\tau<1, and sample access to a probability distribution M=wiN(μi,I)M=\sum w_{i}N(\mu_{i},I) with wiαw_{i}\geq\alpha for all ii, and so that μiμj2\|\mu_{i}-\mu_{j}\|_{2} is at least

To prove this theorem, we will combine Proposition 4.8 with Proposition 4.3 and a few additional ingredients. In particular, running Proposition 4.8, we find a subspace WW, as required. We note that the projection of XX onto WW is still a mixture of Gaussians with appropriate separations between the means to run Proposition 4.3. We note that because the dimension is now only poly(k/δ)\operatorname{poly}(k/\delta), the log(n)\log(n) term in SS becomes a log(k/δ)\log(k/\delta), and the dependence on nn in the sample complexity disappears. We can then learn wiw_{i} to error O(δ)O(\delta) and the projection of μi\mu_{i} to WW to error σiO(δ/wi)\sigma_{i}O(\delta/w_{i}), which is within σiO(δ/wi)\sigma_{i}O(\delta/w_{i}) of the true value of μi\mu_{i}.

Minimax Error Bounds and SQ Lower Bounds

The structure of this section is as follows: First, we describe a generic (inefficient) algorithm that only requires concentration bounds and applies to all aforementioned families. As a corollary, we obtain our information-theoretic upper bounds. We then give matching information-theoretic lower bounds for the three aforementioned distribution families.

Let TT be an α\alpha-corrupted set of Ω(n/α3)\Omega(n/\alpha^{3}) samples from some DDD\in\mathcal{D} with unknown mean μ\mu. By definition, there exists a set STS\subseteq T of independent samples from DD with S=αT|S|=\alpha\cdot|T|. Furthermore, these samples should be representative of DD in the sense that for any unit vector vv:

The above statement holds with high probability by the VC-inequality and the assumption in the statement of the proposition. We henceforth condition on this event.

By the above conditioning, we have that μH\mu\in H. The following important claim motivates the algorithm:

There exists a covering of HH by O(1/α)O(1/\alpha) balls of radius 2u2u.

Given the claim, the algorithm is very simple: Our algorithm will return as its list the set of centers of the balls in such a covering. We will therefore have that μH\mu\in H is within distance 2u2u of at least one of our candidate hypotheses. The proof of the claim below completes the proof of Proposition 5.1.

Proposition 5.1 yields tight error upper bounds for a number of distribution families. We explicitly state its implications for the following families: sub-gaussian distributions, distributions with bounded covariance, and distributions whose first kk moments are appropriately bounded.

O(log(1/α))σO(\sqrt{\log(1/\alpha)})\cdot\sigma, if D\mathcal{D} is the family of sub-gaussian distributions with parameter σ\sigma.

O(1/α)σO(1/\sqrt{\alpha})\cdot\sigma, if D\mathcal{D} is the family of distributions with covariance matrix Σσ2I\Sigma\preceq\sigma^{2}\cdot I.

O((C/α)1/k)O((C/\alpha)^{1/k}), if D\mathcal{D} is the family of distributions whose kthk^{th} central moments in any direction, for some even k>0k>0, are at most CC.

All three statements follow from Proposition 5.1 by applying the appropriate concentration inequality.

Let 0<α<1/20<\alpha<1/2. Any list-decodable mean estimation algorithm for the family D\mathcal{D} must have error guarantee at least:

Ω(1/α)\Omega(1/\sqrt{\alpha}), if D\mathcal{D} is the family of distributions with bounded covariance ΣI\Sigma\preceq I, and the list size depends only on α\alpha.

Ωk(α1/k)\Omega_{k}(\alpha^{-1/k}), if D\mathcal{D} is a family of distributions having its first kk mixed central moments agree with the corresponding moments of the standard Gaussian, and the list size depends only on α\alpha.

Proof of (i). We start with our lower bound for N(μ,I)N(\mu,I). In more detail, we will show that for any d>0d>0, any list-decoding algorithm for N(μ,I)N(\mu,I), that achieves error o(log(1/α)/d)o(\sqrt{\log(1/\alpha)/d}), must return more than (1/α)d(1/\alpha)^{d} hypotheses, as long as the dimension nln(1/α)Cdn\geq\ln(1/\alpha)^{Cd}, for some universal constant C>0C>0.

Let m=(1/α)dm=\lceil(1/\alpha)^{d}\rceil. We will construct a distribution XX that can be considered as a mixture

for i[m]i\in[m], such that for any 1ijm1\leq i\neq j\leq m, we have μiμj2cln(1/α)\|\mu_{i}-\mu_{j}\|_{2}\geq c\cdot\sqrt{\ln(1/\alpha)}, for some suitable constant 1/4>c>01/4>c>0 depending on dd. Specifically, we can take c=1/(8d+3)c=1/(8\sqrt{d+3}).

for a sufficiently large universal constant CC. We therefore get that

Let dN(μ,I)dN(\mu,I) denote the pdf of N(μ,I)N(\mu,I). We will pick our distribution XX to be Y/Y1Y/\|Y\|_{1}, where YY is the pseudo-distribution with density dY=maxi[m]dN(μi,I)dY=\max_{i\in[m]}dN(\mu_{i},I). It suffices to show that Y11/α\|Y\|_{1}\leq 1/\alpha, since then the pdf of XX satisfies dXαdN(μi,I)dX\geq\alpha dN(\mu_{i},I), for each ii, and so we can write X=αN(μi,I)+(1α)EiX=\alpha N(\mu_{i},I)+(1-\alpha)E_{i}, for a suitable distribution EiE_{i}.

We note that the pdf dN(μi,I)dN(\mu_{i},I) of N(μi,I)N(\mu_{i},I) is

Comparing this to the pdf of N(0,I)N(0,I), (1/2π)exp(x2/2)(1/\sqrt{2\pi})\exp(-\|x\|^{2}/2), we see that their ratio

Using our assumptions that c1/4c\leq 1/4 and α1/2\alpha\leq 1/2, this holds unless (vix)ln(1/α)/(2c)(v_{i}\cdot x)\geq\sqrt{\ln(1/\alpha)}/(2c).

Therefore, dN(μi,I)dN(\mu_{i},I) is less than 1/(2α)dN(0,I)1/(2\alpha)dN(0,I), unless (vix)ln(1/α)/(2c)(v_{i}\cdot x)\geq\sqrt{\ln(1/\alpha)}/(2c). Let TiT_{i} be the pseudo-distribution obtained by restricting N(μi,I)N(\mu_{i},I) to the set (vix)ln(1/α)/(2c)(v_{i}\cdot x)\geq\sqrt{\ln(1/\alpha)}/(2c). Since c1/4c\leq 1/4, note that

and therefore, by standard tail bounds, the total mass of TiT_{i} is

By definition, we have that dY(2/α)dN(0,I)+idTidY\leq(2/\alpha)dN(0,I)+\sum_{i}dT_{i}. Therefore, Y11/(2α)+2mα1/(32c2)\|Y\|_{1}\leq 1/(2\alpha)+2m\alpha^{1/(32c^{2})}. Since m2(1/α)dm\leq 2(1/\alpha)^{d}, when 1/(32c2)d+31/(32c^{2})\geq d+3 we have Y11/α\|Y\|_{1}\leq 1/\alpha, which completes the proof. Note that we can take c=1/(8d+3)c=1/(8\sqrt{d+3}).

Proof of (ii). Let pp be a discrete 11-dimensional random variable supported on the points {0,±(2α)1/2}\{0,\pm(2\alpha)^{-1/2}\} such that its expectation is and its variance is 11. In particular pp is plus or minus (2α)1/2(2\alpha)^{-1/2} each with probability α\alpha, and otherwise . Let XX be an nn-dimensional distribution whose coordinates are independent copies of pp. Let DiD_{i} be the distribution XX conditioned on the ithi^{th} coordinate being positive. Notice that X=αDi+(1α)YiX=\alpha D_{i}+(1-\alpha)Y_{i} for some distribution YiY_{i}. Note also that the mean of DiD_{i} is (2α)1/2ei(2\alpha)^{-1/2}e_{i}, and its covariance is at most the identity. Suppose that D=DiD=D_{i} for some randomly chosen value of ii, and the algorithm is given sample access to XX. If the algorithm returns a list of fewer than n/2n/2 hypotheses, then for at least half of possible ii, the mean of DiD_{i} will be at least α1/2/2\alpha^{-1/2}/2-far from the closest hypothesis. Since the list of returned hypotheses was assumed to have size independent of ii, this means with probability at least 1/21/2 the algorithm fails to produce a hypothesis closer than α1/2/2\alpha^{-1/2}/2 to the mean of DD.

Proof of (iii). To generalize the proof of (ii) to the case when DD matches the first kk moments with a Gaussian, we will need the following technical lemma:

For k>0k>0, there exists a one-dimensional distribution A=αN(μ,1)+(1α)EA=\alpha N(\mu,1)+(1-\alpha)E, for some distribution EE and μ=Ω((kα)1/k)\mu=\Omega((k\alpha)^{-1/k}), so that AA’s first kk moments agree with those of N(0,1)N(0,1). Furthermore, the pdf of EE can be taken to be point-wise at most twice the pdf of a standard Gaussian.

where G(x)G^{\prime}(x) is the pdf of a Gaussian N(μ,1)N(\mu,1), G(x)G(x) is the pdf of N(0,1)N(0,1), μ=O(kα)1/k\mu=O(k\alpha)^{-1/k}, and pp is an appropriately selected degree-kk polynomial. Specifically, we select pp to be the unique degree-kk polynomial whose first kk moments on $matchthefirstmatch the firstkmomentsofmoments of\alpha(G-G^{\prime}),where, whereGisisN(0,1)$. We will need the following technical claim:

Let p(x)p(x) be the unique polynomial of degree kk such that

where μ1|\mu|\geq 1 and G(x)G(x) is the pdf of N(0,1)N(0,1). Then, maxxp(x)O(α(2kμ)k)\max_{x\in}|p(x)|\leq O(\alpha(2k\mu)^{k}).

We can write p(x)p(x) as a linear combination of Legendre polynomials Pk(x)P_{k}(x) as p(x)=i=1kaiPi(x)p(x)=\sum_{i=1}^{k}a_{i}P_{i}(x), where 11Pi(x)Pj(x)dx=(2/(2i+1))δij\int_{-1}^{1}P_{i}(x)P_{j}(x)dx=(2/(2i+1))\delta_{ij}. Now we have for 1ik1\leq i\leq k that

where switching the order of summation and integration at ()(\ast) is justified by Fubini-Tonelli, since

Noting, as in Corollary 5.4 of [DKS16b], that Pj(x)(4x)j|P_{j}(x)|\leq(4|x|)^{j} for x1|x|\geq 1, we have that

Finally, since Pk(x)1|P_{k}(x)|\leq 1 for all xx\in, for any xx\in we have

By Claim 5.6, we have that pinf1/2πe\|p\|_{\inf}\leq 1/\sqrt{2\pi e} for μ=(Ckα)1/k\mu=(Ck\alpha)^{-1/k}, when CC is sufficiently large, and therefore AA is non-negative and EE satisfies the claimed bound. We note that pp is normalized since its 0th0^{th} moment is correct. Therefore, we have constructed an appropriate probability distribution, which completes the proof of Lemma 5.5. ∎

To prove the lower bound for matching kk moments, we let AA be the distribution constructed in Lemma 5.5, and let XX be a product of copies of AA. We note that DD could be a product of copies of AA in all but one direction and a copy of N(μ,1)N(\mu,1) in the remaining direction. Thus, DD could have mean μei\mu e_{i} for any ii. Once again, any algorithm that reliably produce some element of its list within μ/2\mu/2 of the mean of DD, must return a list of length at least Ω(n)\Omega(n). This gives (iii). ∎

2 SQ Lower Bounds for List-Decodable Mean Estimation

Recall that our list-decoding algorithm from Theorem 1.3 achieves error O(α1/d)O(\alpha^{-1/d}) in time (n/α)O(d)(n/\alpha)^{O(d)}. Can this runtime be improved? In particular, is there an algorithm that achieves the optimal error of Θ(log(1/α))\Theta(\sqrt{\log(1/\alpha)}) and runs in time poly(n/α)\operatorname{poly}(n/\alpha)? We show that this is not the case, if we restrict ourselves to Statistical Query (SQ) algorithms. Specifically, we prove the following theorem:

For 1/2>c>01/2>c>0, any SQ list-decoding algorithm that returns a hypothesis within ckα1/kc_{k}\alpha^{-1/k} (for some constant ck>0c_{k}>0) of the true mean, does one of the following:

Uses queries with error tolerance at most exp(O(α2/k))O(n)k(1/4c/2)\exp(O(\alpha^{-2/k}))O(n)^{k(1/4-c/2)}.

Uses a number of queries at least exp(Ω(nc/2))\exp(\Omega(n^{c/2})).

Returns a list of more than exp(Ω(n)c)\exp(\Omega(n)^{c}) hypotheses.

We will use the terminology and techniques of our recent work on this topic [DKS16b]. We provide a sketch of the proof here. Using the generic construction of [DKS16b], for the distribution AA given in Lemma 5.5, we have that:

AA matches the first kk moments with N(0,1)N(0,1), and

Note that (i) follows immediately from Lemma 5.5 and (ii) follows from noting that

Let Pv\mathbf{P}_{v} be the nn-dimensional distribution that is distributed as AA in the vv-direction and an independent standard Gaussian in orthogonal directions. Then pvp_{v} will be an α\alpha-mixture of a Gaussian N(vμ,I)N(v|\mu|,I) with some other distribution.

References

APPENDIX

Appendix A Proof of Claim 3.11

We have the following sequence of (in-)equalities:

Appendix B Reducing the List Size to O​(1/α)𝑂1𝛼O(1/\alpha)

To apply this to a list of O(1/α3)O(1/\alpha^{3}) points including an approximation to μ\mu^{*} when TT contains a set SS representative for N(μ,I)N(\mu^{*},I), we may take δ=1/(Clog(1/α)\delta=1/(C\log(1/\alpha) and t=log(Clog(1/α))t=\sqrt{\log(C\log(1/\alpha))} for a sufficiently large CC. This tt will be smaller than β\beta in our applications.

Algorithm ListReduction Input: a list M={μ1,,μN}M=\{\mu_{1},\ldots,\mu_{N}\}, a set of samples TT, α,β,δ,t>0\alpha,\beta,\delta,t>0. 1. For all i,j[n]i,j\in[n] so that iji\neq j, let vijv_{ij} denote the unit vector in the μiμj\mu_{i}-\mu_{j} direction. 2. Let Ti=ji{XT:vijT(Xμi)<β+t}T_{i}=\bigcap_{j\neq i}\{X\in T:|v_{ij}^{T}(X-\mu_{i})|<\beta+t\}. 3. Let MM^{\prime} be the empty list 4. For each ii, if Tiα(1δN)T|T_{i}|\geq\alpha(1-\delta N)|T|, and no μjM\mu_{j}\in M^{\prime} has μiμj2<2(β+t)\|\mu_{i}-\mu_{j}\|_{2}<2(\beta+t), then add μ\mu^{\prime} to MM^{\prime} 5. Return MM^{\prime}.

It is easy to see these operations can be done in polynomial time.

We claim the output of this algorithm satisfies the desired guarantees. We first show that MM^{\prime} contains a μ\mu^{\prime} with μμ23(β+t)\|\mu^{\prime}-\mu^{*}\|_{2}\leq 3(\beta+t). By assumption, if μiμ2β\|\mu_{i}-\mu^{*}\|_{2}\leq\beta, then μi\mu_{i} satisfies this property. By the triangle inequality, we have that for every jij\neq i,

and hence by a union bound Tiα(1δN)T|T_{i}|\geq\alpha(1-\delta N)|T|. Since such a μi\mu_{i} is contained in MM by assumption, this implies that MM^{\prime} is non-empty and either μiM\mu_{i}\in M^{\prime}, when or there is some other μM\mu^{\prime}\in M^{\prime} so that μμi22(β+t)\|\mu^{\prime}-\mu_{i}\|_{2}\leq 2(\beta+t), and therefore μμ23(β+t)\|\mu^{\prime}-\mu^{*}\|_{2}\leq 3(\beta+t) by the triangle inequality.

It remains to bound the size of MM^{\prime}. Observe that if μiμj22(β+t)\|\mu_{i}-\mu_{j}\|_{2}\geq 2(\beta+t), then TiTj=T_{i}\cap T_{j}=\varnothing. Therefore we have

but since Tiα(1δN)T|T_{i}|\geq\alpha(1-\delta N)|T| by assumption, this implies that Mα(1δN)TT  ,|M^{\prime}|\alpha(1-\delta N)|T|\leq|T|\;, or M1α(1+O(δN))|M^{\prime}|\leq\frac{1}{\alpha}(1+O(\delta N)), as desired. ∎