Mixture Models, Robustness, and Sum of Squares Proofs

Samuel B. Hopkins, Jerry Li

Introduction

We propose and analyze a family of new algorithms for some fundamental high-dimensional statistical estimation problems. In particular, we give new algorithms for the following problems.

Mixture models, and especially Gaussian mixture models (where D1,,Dk\mathcal{D}_{1},\ldots,\mathcal{D}_{k} are Gaussian distributions) have been studied since Pearson in 1894 [Pea94]. Work in theoretical computer science dates at least to the pioneering algorithm of Dasgupta in 1999 [Das99], which has been followed by numerous other algorithms and lower bounds [Wu83, DS07, AK05, VW02, KK10, AM05, FSO06, KMV10, BS10, MV10, HK13, ABG+14, BCMV14, DK14, SOAJ14, HP15, XHM16, GHK15, LS17, RV17, DTZ17].

Robust estimation in the form we study here is a more recent transplant to theoretical computer science [DKK+16, LRV16, CSV16, DKS16, CJN17, DKK+17b, DKK+17a, SCV17], but statisticians have long sought outlier-robust estimators. Formal study of arbitrarily-bad/adversarially-chosen outliers originates in the 1960s with the advent of “breakdown points” in statistics [Hub64, Tuk75b, HRRS86, JP78, Ber06].

For mixture models, a special case of our main result yields the first progress in more than 15 years on efficiently clustering mixtures of separated spherical Gaussians. The question here is: if D1,,Dk\mathcal{D}_{1},\ldots,\mathcal{D}_{k} are all Gaussian with covariance identity and k=poly(d)k=\operatorname{poly}(d), what is the minimum cluster separation Δ\Delta which allows for a polynomial-time algorithm to estimate μ1,,μk\mu_{1},\ldots,\mu_{k} from poly(k,d)\operatorname{poly}(k,d) samples from the mixture model? The guarantees of the previous best algorithms for this problem, which require ΔO(k1/4)\Delta\geq O(k^{1/4}), are captured by a simple greedy clustering algorithm, sometimes called single-linkage clustering: when ΔO(k1/4)\Delta\geq O(k^{1/4}), with high probability every pair of samples from the same cluster is closer in Euclidean distance than every pair of samples from differing clusters. We break this single-linkage clustering barrier: for every γ>0\gamma>0 we give a poly(k,d)\operatorname{poly}(k,d)-time algorithm for this problem when Δ>kγ\Delta>k^{\gamma}.

To do so we make novel algorithmic use of higher moments (in fact, O(1/γ)O(1/\gamma) moments) of the underlying distributions Di\mathcal{D}_{i}. Our main technical contribution is a new algorithmic technique for finding either a structured subset of data points or the empirical mean of such a subset when the subset consists of independent samples from a distribution D\mathcal{D} which has bounded higher-order moments and there is a simple certificate of this boundedness. This technique leverages the Sum of Squares (SoS) hierarchy of semidefinite programs (SDPs), and in particular a powerful approach for designing SoS-based algorithms in machine learning settings, developed and used in [BKS14, BKS15, GM15, BM16, HSS15, MSS16, PS17]. We suspect use of higher moments is necessary in light of second-moment indistinguishability results for mixtures with small separation [AM05].

This SoS approach to unsupervised learning rests on a notion of simple identifiability proofs: the main step in designing an algorithm using SoS to recover some parameters θ\theta from samples x1,,xnp(xθ)x_{1},\ldots,x_{n}\sim p(x\,|\,\theta) is to prove in a restricted proof system that θ\theta is likely to be uniquely identifiable from x1,,xnx_{1},\ldots,x_{n}. We develop this thoroughly later on, but roughly speaking one may think of this as requiring the identifiability proof to use only simple inequalities, such as Cauchy-Schwarz and Hölder’s inequality, applied to low-degree polynomials. The simple identifiability proofs we construct for both the mixture models and robust estimation settings are heavily inspired by the robust estimation algorithms of Diakonikolas et al. [DKK+16].

Both of the problems we study have a long history; for now we just note some highlights and state our main results.

The problem of learning mixture models dates to Pearson in 1894, who invented the method of moments in order to separate a mixture of two Gaussians [Pea94]. Mixture models have since become ubiquitous in data analysis across many disciplines [TSM85, MP04]. In recent years, computer scientists have devised many ingenious algorithms for learning mixture models as it became clear that classical statistical methods (e.g. maximum likelihood estimation) often suffer from computational intractability, especially when there are many mixture components or the components are high dimensional.

Separation Δ=k1/4\Delta=k^{1/4} represents a natural algorithmic barrier: when Δk1/4\Delta\geq k^{1/4}, every pair of samples from the same cluster are closer to each other in Euclidean distance than are every pair of samples from distinct clusters (with high probability), while this is no longer true if Δ<k1/4\Delta<k^{1/4}. Thus, when Δk1/4\Delta\geq k^{1/4}, a simple greedy algorithm correctly clusters the samples into their components (this algorithm is sometimes called single-linkage clustering). On the other hand, standard information-theoretic arguments show that the means remain approximately identifiable from poly(k,d)\operatorname{poly}(k,d) samples when Δ\Delta is as small as O(logk)O(\sqrt{\log k}), but these methods yield only exponential-time algorithms.Recent and sophisticated arguments show that the means are identifiable (albeit inefficiently) with error depending only on the number of samples and not on the separation Δ\Delta even when Δ=O(logk)\Delta=O(\sqrt{\log k}) [RV17]. Nonetheless, despite substantial attention, this Δ=k1/4\Delta=k^{1/4} barrier representing the breakdown of single-linkage clustering has stood for nearly 20 years.

We prove the following main theorem, breaking the single-linkage clustering barrier.

We pause here to make several remarks about this theorem. Our algorithm makes novel use of higher order moments of Gaussian (and sub-Gaussian) distributions. Previous work for efficiently learning well-separated mixtures of Gaussians used only second order moment information, whereas we use O(1/γ)O(1/\gamma) moments.

For mixtures of distributions with sufficiently-many bounded moments (such as Gaussians), our guarantees go even further. We show that using dO(logk)2d^{O(\log k)^{2}} time and dO(logk)d^{O(\log k)} samples, we can recover the means to error 1/poly(k)1/\operatorname{poly}(k) even if the separation is only ClogkC\sqrt{\log k} for some universal constant CC. Strikingly, [RV17] show that any algorithm that can learn the means nontrivially given separation o(logk)o(\sqrt{\log k}) must require super-polynomial samples and time. Our results show that just above this threshold, it is possible to learn with just quasipolynomially many samples and time.

Robust mean estimation

Estimators which are robust to outlying or corrupted samples have been studied in statistics at least since the 1960s [Hub64, Tuk75a]. The model we consider in this paper is a slight generalization of Hüber’s contamination model [Hub64]. We are given X1,,XnX_{1},\ldots,X_{n}, originally drawn iid from some unknown distribution D\mathcal{D}, but an adversary has changed an ε\varepsilon fraction of these points adversarially. We call such a set of points ε\varepsilon-corrupted.Hüber’s contamination model essentially only allows the adversary to add corrupted points, but not remove uncorrupted points. The goal of robust statistics is to recover statistics of D\mathcal{D} such as mean and covariance, given ε\varepsilon-corrupted samples from D\mathcal{D}.

In classical robust statistics, the robust mean estimation problem is known as robust estimation of location, and robust covariance estimation is known as robust estimation of scale. Classical works consider a measure known as breakdown point, which is (informally) the fraction of samples that an adversary must corrupt before the estimator has no provable guarantees. They often design robust estimators for mean and covariance that achieve optimal error in many fundamental settings. For instance, given samples from a symmetric sub-Gaussian distribution in kk dimensions such that an ε\varepsilon-fraction are arbitrarily corrupted, an estimator known as the Tukey median [Tuk75a] achieves error O(ε)O(\varepsilon), which is information theoretically optimal. However, these estimators are all NPNP-hard to compute [JP78, Ber06] and the best known algorithms require exp(d)\exp(d) time in general.

For a long time, all known computationally efficient robust statistics for the mean or covariance of a dd-dimensional Gaussian had error degrading polynomially with the dimension.We remark that this was the state of affairs even for the Hüber contamination model. In recent work, [DKK+16, LRV16] gave efficient and robust estimators for these statistics which achieve substantially better error. In particular, [DKK+16] achieve error O(εlog1/ε)O(\varepsilon\sqrt{\log 1/\varepsilon}) for estimating the mean of a Gaussian with identity covariance, and error O(εlog3/21/ε)O(\varepsilon\log^{3/2}1/\varepsilon) for robustly estimating the mean of a Gaussian with unknown variance ΣI\Sigma\preceq I.

Informally stated, our algorithms will work under the condition that this moment bound can be certified by a low degree SoS proof, for all sts\leq t. We call such distributions tt-explicitly bounded (we are ignoring some parameters, see Definition 3.1 for a formal definition). This class captures many natural sub-Gaussian distributions, such as Gaussians, product distributions of sub-Gaussians, and rotations thereof (see Appendix A.1). For such distributions, we show:

As with mixture models, we can push our statistical rates further, if we are willing to tolerate quasipolynomial runtime and sample complexity. In particular, we can obtain error O(εlog1/ε)O(\varepsilon\sqrt{\log 1/\varepsilon}) error with dO(log1/ε)d^{O(\log 1/\varepsilon)} samples and dO(log1/ε)2d^{O(\log 1/\varepsilon)^{2}} time.

2 Related work

Other works have relaxed the requirement that the underlying distributions be Gaussian [KK10, AM05]; we also address non-Gaussian distributions, relaxing the Gaussian-ness assumption to explicit moment boundedness. One recent work in this spirit uses SDPs to cluster mixture models under separation assumptions [MVW17]; the authors show that a standard SDP relaxation of kk-means achieves guarantees comparable to previously-known specially-tailored mixture model algorithms; however, this algorithm suffers from the same k1/4k^{1/4} barrier as other previous works.

Fixed number of Gaussians in many dimensions: Other works address parameter estimation for mixtures of kdk\ll d Gaussians (generally k=O(1)k=O(1) and dd grows) under weak identifiability assumptions [KMV10, BS10, MV10, HP15]. In these works the only assumptions are that the component Gaussians are statistically distinguishable; the goal is to recover their parameters of the underlying Gaussians. It was shown in [HP15] that algorithms in this setting provably require exp(k)\exp(k) samples and running time. The question addressed in our paper is whether this lower bound is avoidable under stronger identifiability assumptions. A related line of work addresses proper learning of mixtures of Gaussians [FSO06, DK14, SOAJ14, LS17], where the goal is to output a mixture of Gaussians which is close to the unknown mixture in total-variation distance, avoiding the exp(k)\exp(k) parameter-learning sample-complexity lower bound. These algorithms achieve poly(k,d)\operatorname{poly}(k,d) sample complexity, but they all require exp(k)\exp(k) running time, and moreover, do not provide any guarantee that the parameters of the distributions output are close to those for the true mixture.

In contrast, we do not assume any non-degeneracy conditions and use moment information only about the individual components rather than the full mixture, which always hold under separation conditions. Moreover, our algorithms do not need to know the exact structure of the 3rd or 4th moments. In general, clustering-based algorithms like ours seem more robust to modelling errors than algebraic or tensor-decomposition methods.

Expectation-maximization (EM): EM is the most popular algorithm for Gaussian mixtures in practice, but it is notoriously difficult to analyze theoretically. The works [DS07, DTZ17, XHM16] offer some theoretical guarantees for EM, but non-convergence results are a barrier to strong theoretical guarantees [Wu83].

Robust statistics

The literature on robust estimation is too large to do justice to here. There has been a long line of work on making algorithms tolerant to error in supervised settings [Val85, KL93], especially for learning halfspaces [Ser03, KLS09, ABL14, DKS17b], and for problems such as PCA [Bru09, CLMW11, LMTZ12, ZL14]. See [DKK+16] for a more detailed discussion on the relationship between these questions (and others) and the model we consider here.

SoS algorithms for unsupervised learning

SoS algorithms for unsupervised learning obtain the best known polynomial-time guarantees for many problems, including dictionary learning, tensor completion, and others [BKS14, BKS15, GM15, HSS15, MSS16, BM16, PS17]. While the running times of such algorithms are often large polynomials, due to the need to solve large SDPs, insights from the SoS algorithms have often been used in later works obtaining fast polynomial running times [HSSS16, SS17, HKP+17]. This lends hope that in light of our results there is a practical algorithm to learn mixture models under separation k1/4εk^{1/4-\varepsilon} for some ε>0\varepsilon>0.

Concurrent work

Finally, we note that concurrent and independent works by several groups [KS17a, KS17b, DKS17a] have either obtained results or developed techniques similar to ours.

3 Organization

In Section 2 we discuss at a high level the ideas in our algorithms and SoS proofs. In Section 3 we give standard background on SoS proofs. Section 4 discusses the important properties of the family of polynomial inequalities we use in both algorithms. Section 5 and Section 6 state our algorithms formally and analyze them. Finally, Section 7 describes the polynomial inequalities our algorithms employ in more detail.

Techniques

The Sum of Squares (SoS) hierarchy is a powerful tool in optimization, originally designed to approximately solve systems of polynomial equations via a hierarchy of increasingly strong but increasingly large semidefinite programming (SDP) relaxations (see [BS14] and the references therein). There has been much recent interest in using the SoS method to solve unsupervised learning problems in generative models [BKS14, BKS15, GM15, HSS15, MSS16, PS17]. .

Remarkably, many useful polynomial inequalities have such certificates. For example, the usual proof of the Cauchy-Schwarz inequality y,z2y2z2\langle y,z\rangle^{2}\leq\|y\|^{2}\|z\|^{2}, where y,zy,z are mm-dimensional vectors, actually shows that the polynomial y2z2y,z2\|y\|^{2}\|z\|^{2}-\langle y,z\rangle^{2} is a sum-of-squares in yy and zz. The simplicity of the certificate is measured by the degree of the polynomials rr and ss; when these polynomials have small (usually constant) degree there is hope of transforming SoS proofs into polynomial-time algorithms. This transformation is possible because (under mild assumptions on pp and qq) the set of low-degree SoS proofs is in fact captured by a polynomial-size semidefinite program.

Returning to unsupervised learning, the concentration/union-bound style of identifiability proofs described above are almost never captured by low-degree SoS proofs. Instead, the goal is to design

A system of constant-degree polynomial equations and inequalties A={p1(θ)=0,,pm(θ)=0,q1(θ)0,,qm(θ)0}\mathcal{A}=\{p_{1}(\theta)=0,\ldots,p_{m}(\theta)=0,q_{1}(\theta)\geq 0,\ldots,q_{m}(\theta)\geq 0\}, where the polynomials pp and qq depend on the samples x1,,xnx_{1},\ldots,x_{n}, such that with high probability θ\theta^{*} satisfies all the equations and inequalities.

A low-degree SoS proof that A\mathcal{A} implies θθδ\|\theta-\theta^{*}\|\leq\delta for some small δ\delta and appropriate norm \|\cdot\|.

Clearly these imply that any solution θ\theta of A\mathcal{A} also solves the unsupervised learning problem. It is in general NP-hard to find a solution to a system of low-degree polynomial equations and inequalities.

However, the SoS proof (2) means that such a search can be avoided. Instead, we will relax the set of solutions θ\theta to A\mathcal{A} to a simple(er) convex set: the set of pseudodistributions satisfying A\mathcal{A}. We define pseudodistributions formally later, for now saying only that they are the convex duals of SoS proofs which use the axioms A\mathcal{A}. By this duality, the SoS proof (2) implies not only that any solution θ\theta to A\mathcal{A} is a good choice of parameters but also that a good choice of parameters can be extracted any pseudodistribution satisfying A\mathcal{A}. (We are glossing over for now that this last step requires some SDP rounding algorithm, since we use only standard rounding algorithms in this paper.)

Thus, the final SoS algorithms from this method take the form: solve an SDP to find a pseudodistribution which satisfies A\mathcal{A} and round it to obtain a estimate θ^\hat{\theta} of θ\theta^{*}. To analyze the algorithm, use the SoS proof (2) to prove that θ^θδ\|\hat{\theta}-\theta^{*}\|\leq\delta.

2 Hölder’s inequality and identifiability from higher moments

Now we discuss the core ideas in our simple SoS identifiability proofs. We have not yet formally defined SoS proofs, so our goal will just be to construct identifiability proofs which are (a) phrased in terms of inequalities of low-degree polynomials and (b) provable using only simple inequalities, like Cauchy-Schwarz and Hölder’s, leaving the formalities for later.

This inequality says that every one-dimensional projection uu of the samples in SS, centered around their empirical mean, has a sub-Gaussian empirical tt-th moment. (The factor 22 accounts for deviations in the tt-th moments of the samples.) By standard concentration of measure, if αndt\alpha n\gg d^{t} the inequality holds for S=TS=T. It turns out that this property can be enforced by polynomials of degree tt. (Actually our final construction of A\mathcal{A} will need to use inequalities of matrix-valued polynomials but this can be safely ignored here.)

Intuitively, we would like to show that any SS which satisfies A\mathcal{A} has empirical mean close to μ\mu^{*} using a low-degree SoS proof,. This is in fact true when α=1ε\alpha=1-\varepsilon for small ε\varepsilon, which is at the core of our robust estimation algorithm. However, in the mixture model setting, when α=1/(# of components)\alpha=1/(\text{\# of components}), for each component jj there is a subset Tj[n]T_{j}\subseteq[n] of samples from component jj which provides a valid solution S=TjS=T_{j} to A\mathcal{A}. The empirical mean of TjT_{j} is close to μj\mu_{j} and hence not close to μi\mu_{i} for any iji\neq j.

We will prove something slightly weaker, which still demonstrates the main idea in our identifiability proof.

With high probability, for every S[n]S\subseteq[n], if μ=1SiSXi\mu=\tfrac{1}{|S|}\sum_{i\in S}X_{i} is the empirical mean of samples in SS, then μμ4t1/2(T/ST)1/t\|\mu-\mu^{*}\|\leq 4t^{1/2}\cdot(|T|/|S\cap T|)^{1/t}.

Notice that a random S[n]S\subseteq[n] of size αn\alpha n will have STα2n|S\cap T|\approx\alpha^{2}n. In this case the lemma would yield the bound μμ4t1/2α1/t\|\mu-\mu^{*}\|\leq\tfrac{4t^{1/2}}{\alpha^{1/t}}. Thinking of α1/t\alpha\ll 1/t, this bound improves exponentially as tt grows. In the dd-dimensional kk-component mixture model setting, one has 1/α=poly(k)1/\alpha=\operatorname{poly}(k), and thus the bound becomes μμ4t1/2kO(1/t)\|\mu-\mu^{*}\|\leq 4t^{1/2}\cdot k^{O(1/t)}. In a mixture model where components are separated by kεk^{\varepsilon}, such an estimate is nontrivial when μμkε\|\mu-\mu^{*}\|\ll k^{\varepsilon}, which requires t=O(1/ε)t=O(1/\varepsilon). This is the origin of the quantitative bounds in our mixture model algorithm.

We turn to the proof of Lemma 2.1. As we have already emphasized, the crucial point is that this proof will be accomplished using only simple inequalities, avoiding any union bound over all possible subsets SS.

Let wiw_{i} be the 0/10/1 indicator of iSi\in S. To start the argument, we expand in terms of samples:

The key term to bound is the first one; the second amounts to a deviation term. By Hölder’s inequality and for even tt,

The second line follows by adding the samples from [n]T[n]\setminus T to the sum; since tt is even this only increases its value. The third line uses the moment inequality (1). The last line just uses the definition of ww.

For the second, deviation term, we use Hölder’s inequality again:

with high probability and hence, using the quadratic form at (μμ)t/2(\mu^{*}-\mu)^{\otimes t/2},

Putting these together and simplifying constants, we have obtained that with high probability,

3 From identifiability to algorithms

We now discuss how to use the ideas described above algorithmically for learning well-separated mixture models. The high level idea for robust estimation is similar. Given Lemma 2.1, a naive algorithm for learning mixture models would be the following: find a set of points TT of size roughly n/kn/k that satisfy the moment bounds described, and simply output their empirical mean. Since by a simple counting argument this set must have nontrivial overlap with the points from some mixture component, Lemma 2.1 guarantees that the empirical mean is close to mean of this component.

However, in general finding such a set of points is algorithmically difficult. In fact, it would suffice to find a distribution over such sets of points (since then one could simply sample from this distribution), however, this is just as computationally difficult. The critical insight is that because of the proof of Lemma 2.1 only uses facts about low degree polynomials, it suffices to find an object which is indistinguishable from such a distribution, considered as a functional on low-degree polynomials.

The natural object in this setting is a pseudo-distribution. Pseudo-distributions form a convex set, and for a set of low-degree polynomial equations and inequalities A\mathcal{A}, it is possible to find a pseudo-distribution which is indistinguishable from a distribution over solutions to A\mathcal{A} (as such a functional) in polynomial time via semidefinite programming (under mild assumptions on A\mathcal{A}). More specifically, the set of SoS proofs using axioms A\mathcal{A} is a semidefinite program (SDP), and the above pseudodistributions form the dual SDP. (We will make these ideas more precise in the next two sections.)

Our algorithm then proceeds via the following general framework: find an appropriate pseudodistribution via convex optimization, then leverage our low-degree sum of squares proofs to show that information about the true clusters can be extracted from this object by a standard SDP rounding procedure.

Preliminaries

We now formally define the class of distributions we will consider throughout this paper. At a high level, we will consider distributions which have bounded moments, for which there exists a low degree SoS proof of this moment bound. Formally:

Equivalently, the polynomial p(u)=(σs)s/2usEYDk(Yμ),usp(u)=(\sigma s)^{s/2}\|u\|^{s}-E_{Y\sim\mathcal{D}_{k}}\langle\left(Y-\mu\right),u\rangle^{s} should be a sum-of-squares. In our typical use case, σ=1\sigma=1, we will omit it and call the distribution tt-explicitly bounded.

We remark that our results also hold for somewhat more general settings. It is not particularly important that the ss-th moment bound has a degree ss proof; our techniques can tolerate degree O(s)O(s) proofs. Our techniques also generally apply for weaker moment bounds. For instance, our techniques naturally extend to explicitly bounded sub-exponential type distributions in the obvious way. We omit these details for simplicity.

As we show in Appendix A.1, this class still captures many interesting types of nice distributions, including Gaussians, product distributions with sub-Gaussian components, and rotations therof. With this definition in mind, we can now formally state the problems we consider in this paper:

We first define the class of mixture models for which our algorithm works:

Robust mean estimation

We consider the same basic model of corruption introduced in [DKK+16].

We say a set of samples X1,,XnX_{1},\ldots,X_{n} is ε\varepsilon-corrupted from a distribution D\mathcal{D} if they are generated via the following process. First, nn independent samples are drawn from D\mathcal{D}. Then, an adversary changes εn\varepsilon n of these points arbitrarily, and the altered set of points is then returned to us in an arbitrary order.

The problem we consider in this setting is the following:

1 The SoS proof system

We refer the reader to [OZ13, BS14] and the references therein for a thorough exposition of the SoS algorithm and proof system; here we only define what we need.Our definition of SoS proofs differs slightly from O’Donnell and Zhou’s in that we allow proofs to use products of axioms.

Let x1,,xnx_{1},\ldots,x_{n} be indeterminates and A\mathcal{A} be the set of polynomial equations and inequalities {p1(x)0,,pm(x)0,q1(x)=0,,qm(x)=0}\{p_{1}(x)\geq 0,\ldots,p_{m}(x)\geq 0,q_{1}(x)=0,\ldots,q_{m}(x)=0\}. We say that the statement p(x)0p(x)\geq 0 has an SoS proof if there are polynomials {rα}α[m]\{r_{\alpha}\}_{\alpha\subseteq[m]} (where α\alpha may be a multiset) and {si}i[m]\{s_{i}\}_{i\in[m]} such that

and each polynomial rα(x)r_{\alpha}(x) is a sum of squares.

If the polynomials rα(x)iαpi(x)r_{\alpha}(x)\cdot\prod_{i\in\alpha}p_{i}(x) and si(x)qi(x)s_{i}(x)q_{i}(x) have degree at most dd, we say the proof has degree at most dd, and we write

SoS proofs compose well, and we frequently use the following without comment.

If Adp(x)0\mathcal{A}\vdash_{d}\,p(x)\geq 0 and Adq(x)0\mathcal{A}\vdash_{d^{\prime}}\,q(x)\geq 0, then ABmax(d,d)p(x)+q(x)0\mathcal{A}\cup\mathcal{B}\vdash_{\max(d,d^{\prime})}\,p(x)+q(x)\geq 0 and ABddp(x)q(x)0\mathcal{A}\cup\mathcal{B}\vdash_{dd^{\prime}}\,p(x)q(x)\geq 0.

The main fact relating pseudoexpectations and SoS proofs is:

In Section A we state and prove many basic SoS inequalities that we will require throughout the paper.

In Section A we show that product distributions (and rotations thereof) with bounded tt-th moments are explicitly bounded.

Then D\mathcal{D} is tt-explicitly bounded (with variance proxy 11).

(The factors of 12\frac{1}{2} can be removed for many distributions, including Gaussians.)

Capturing empirical moments with polynomials

α\alpha\in be a number (the intention is S=αn|S|=\alpha n).

τ>0\tau>0 be some error magnitude accounting for fluctuations in the sizes of clusters (which may be safely ignored at first reading).

Let A\mathcal{A} be the following system of equations and inequalities, depending on all the parameters above.

wi2=wiw_{i}^{2}=w_{i} for all i[n]i\in[n] (enforcing that ww is a 0/10/1 vector, which we interpret as the indicator vector of the set SS).

(1τ)αni[n]wi(1+τ)αn(1-\tau)\alpha n\leq\sum_{i\in[n]}w_{i}\leq(1+\tau)\alpha n, enforcing that Sαn|S|\approx\alpha n (we will always choose τ=o(1)\tau=o(1)).

μi[n]wi=i[n]wiXi\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}, enforcing that μ\mu is the empirical mean of the samples in SS

i[n]wiXiμ,μμjt2tt/2i[n]wiμμjt\sum_{i\in[n]}w_{i}\langle X_{i}-\mu,\mu-\mu_{j}\rangle^{t}\leq 2\cdot t^{t/2}\sum_{i\in[n]}w_{i}\|\mu-\mu_{j}\|^{t} for every μj\mu_{j} among μ1,,μm\mu_{1},\ldots,\mu_{m}. This enforces that the tt-th empirical moment of the samples in SS is bounded in the direction μμj\mu-\mu_{j}.

Notice that since we will eventually take μj\mu_{j}’s to be unknown parameters we are trying to estimate, the algorithm cannot make use of A\mathcal{A} directly, since the last family of inequalities involve the μj\mu_{j}’s. Later in this paper we exhibit a system of inequalities which requires the empirical tt-th moments to obey a sub-Gaussian type bound in every direction, hence implying the inequalities here without requiring knowledge of the μj\mu_{j}’s to write down. Formally, we will show:

There is a family A^\widehat{\mathcal{A}} of polynomial equations and inequalities of degree O(t)O(t) on variables w=(w1,,wn),μ=(μ1,,μk)w=(w_{1},\ldots,w_{n}),\mu=(\mu_{1},\ldots,\mu_{k}) and at most nO(t)n^{O(t)} other variables, whose coefficients depend on α,t,τ,X1,,Xn\alpha,t,\tau,X_{1},\ldots,X_{n}, such that

(Booleanness) A^\widehat{\mathcal{A}} includes the equations wi2=wiw_{i}^{2}=w_{i} for all i[n]i\in[n].

(Size) A^\widehat{\mathcal{A}} includes the inequalities (1τ)αnwi(1+τ)αn(1-\tau)\alpha n\leq\sum w_{i}\leq(1+\tau)\alpha n.

(Empirical mean) A^\widehat{\mathcal{A}} includes the equation μi[n]wi=i[n]wiXi\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}.

In particular this implies that A^O(t)A\widehat{\mathcal{A}}\vdash_{O(t)}\,\mathcal{A}.

The proof of Lemma 4.1 can be found in Section 7.

Aside from mentioning at a couple key points why our SoS proofs have bounded coefficients, we henceforth ignore all numerical issues. For further discussion of numerical accuracy and well-conditioned-ness issues in SoS, see [O’D17, BS17, RW17]

Mixture models: algorithm and analysis

In this section we formally describe and analyze our algorithm for mixture models. We prove the following theorem.

On the other hand, if Δ=Clogk\Delta=C^{\prime}\sqrt{\log k} (for some universal CC^{\prime}) then taking t=O(logk)t=O(\log k) gives error

which, for large-enough CC^{\prime} and tt, can be made 1/poly(k)1/\operatorname{poly}(k). Thus for Δ=Clogk\Delta=C^{\prime}\sqrt{\log k} and any O(logk)O(\log k)-explicitly bounded distrituion we obtain error 1/poly(k)1/\operatorname{poly}(k) with dO(logk)d^{O(\log k)} samples and dO(logk)2d^{O(\log k)^{2}} time.

In this section we describe and analyze our algorithm. To avoid some technical work we analyze the uniform mixtures setting, with λi=1/m\lambda_{i}=1/m. In Section D we describe how to adapt the algorithm to the nonuniform mixture setting.

We formally describe our mixture model algorithm now. We use the following lemma, which we prove in Section 5.6. The lemma says that given a matrix which is very close, in Frobenious norm, to the 0/10/1 indicator matrix of a partition of [n][n] it is possible to approximately recover the partition. (The proof is standard.)

As described, LearnMixtureMeans has two phases: a clustering phase and a mean-estimation phase. The clustering phase is the heart of the algorithm; we will show that after running RoundSecondMoments the algorithm has obtained clusters C1,,CkC_{1},\ldots,C_{k} which err from the ground-truth clustering on only a 2O(t)tt/2poly(k)Δt\tfrac{2^{O(t)}t^{t/2}\operatorname{poly}(k)}{\Delta^{t}}-fraction of points. To obtain estimates μ^i\hat{\mu}_{i} of the underlying means from such a clustering, one simple option is to output the empirical mean of the clusters. However, without additional pruning this risks introducing error in the mean estimates which grows with the ambient dimension dd. By using the robust mean estimation algorithm instead to obtain mean estimates from the clusters we obtain errors in the mean estimates which depend only on the number of clusters kk, the between-cluster separation Δ\Delta, and the number tt of bounded moments.

We observe that LearnMixtureMeans can be implemented in time nO(t)n^{O(t)}. The main theorem requires nkO(1)dO(t)n\geq k^{O(1)}d^{O(t)}, which means that the final running time of the algorithm is (kdt)O(t)(kd^{t})^{O(t)}.As discussed in Section 4, correctness of our algorithm at the level of numerical accuracy requires that the coefficients of every polynomial in the SoS program A^\widehat{\mathcal{A}} (and every polynomial in the SoS proofs we use to analyze A^\widehat{\mathcal{A}}) are polynomially bounded. This may not be the case if some vectors μ1,,μm\mu_{1},\ldots,\mu_{m} have norms μidω(1)\|\mu_{i}\|\geq d^{\omega(1)}. This can be fixed by naively clustering the samples X1,,XnX_{1},\ldots,X_{n} via single-linkage clustering, then running LearnMixtureMeans on each cluster. It is routine to show that the diameter of each cluster output by a naive clustering algorithm is at most poly(d,k)\operatorname{poly}(d,k) under our assumptions, and that with high probability single-linkage clustering produces a clustering respecting the distributions Di\mathcal{D}_{i}. Hence, by centering each cluster before running LearnMixtureMeans we can assume that μipoly(d,k)\|\mu_{i}\|\leq\operatorname{poly}(d,k) for every idi\leq d.

2 Proof of main theorem

In this section we prove our main theorem using the key lemmata; in the following sections we prove the lemmata.

(Empirical moments) For every cluster S_{j}=\{X_{i}\,:\,X_{i}\text{ is from\mathcal{D}_{j}}\}, the system A^\widehat{\mathcal{A}} from Lemma 4.1 with α=1/m\alpha=1/m and τ=Δt\tau=\Delta^{-t} has a solution which takes w{0,1}nw\in\{0,1\}^{n} to be the 0/10/1 indicator vector of SjS_{j}.

(Empirical means) Let μj\overline{\mu}_{j} be the empirical mean of cluster SjS_{j}. The μj\overline{\mu}_{j}’s satisfy μiμiΔt\|\overline{\mu}_{i}-\mu_{i}\|\leq\Delta^{-t}.

We note a few useful consequences of these conditions, especially (D1). First of all, it implies all clusters have almost the same size: (1Δt)nkSj(1+Δt)nk(1-\Delta^{-t})\cdot\tfrac{n}{k}\leq|S_{j}|\leq(1+\Delta^{-t})\cdot\tfrac{n}{k}. Second, it implies that all clusters have explicitly bounded moments: for every SjS_{j},

Lemmas

The following key lemma captures our SoS identifiability proof for mixture models.

The other main lemma shows that conditions (D1) and (D2) occur with high probability.

With notation as above, conditions (D1) and (D2) simultaneously occur with probability at least 11/d151-1/d^{15} over samples X1,,XnX_{1},\ldots,X_{n}, so long as ndO(t)kO(1)n\geq d^{O(t)}k^{O(1)}, for Δ1\Delta\geq 1.

Lemma 5.4 follows from Lemma 4.1, for (D1), and standard concentration arguments for (D2). Now we can prove the main theorem.

which can be rearranged to give MA2n2O(t)tt/2k4Δt\|M-A\|^{2}\leq n\cdot\tfrac{2^{O(t)}t^{t/2}k^{4}}{\Delta^{t}}. ∎

3 Identifiability

In this section we prove Lemma 5.3. We use the following helpful lemmas. The first is in spirit an SoS version of Lemma 2.1.

Let μ1,,μk,D1,,Dk,t\mu_{1},\ldots,\mu_{k},\mathcal{D}_{1},\ldots,\mathcal{D}_{k},t be as in Theorem 5.1. Let μi\overline{\mu}_{i} be as in (D1). Suppose (D1) occurs for samples X1,,XnX_{1},\ldots,X_{n}. Let A\mathcal{A} be the system from Lemma 4.1, with α=1/k\alpha=1/k and any τ\tau. Then

The second lemma is an SoS triangle inequality, capturing the consequences of separation of the means. The proof is standard given Fact A.2.

The last lemma helps put the previous two together. Although we have phrased this lemma to concorde with the mixture model setting, we note that the proof uses nothing about mixture models and consists only of generic manipulations of pseudodistributions.

Now we have the tools to prove Lemma 5.3.

4 Proof of Lemma 5.5

In this subsection we prove Lemma 5.5. We use the following helpful lemmata. The first bounds error from samples selected from the wrong cluster using the moment inequality.

Let j,A,X1,,Xn,μj,μjj,\mathcal{A},X_{1},\ldots,X_{n},\mu_{j},\overline{\mu}_{j} be as in Lemma 5.5. Then

The proof goes by Hölder’s inequality followed by the moment inequality in A\mathcal{A}. Carrying this out, by Fact A.6 and evenness of tt,

Then, using the main inequality in A\mathcal{A},

The second lemma bounds error from deviations in the empirical tt-th moments of the samples from the jj-th cluster.

Let μ1,,μk,D1,,Dk\mu_{1},\ldots,\mu_{k},\mathcal{D}_{1},\ldots,\mathcal{D}_{k} be as in Theorem 5.1. Suppose condition (D1) holds for samples X1,,XnX_{1},\ldots,X_{n}. Let w1,,wnw_{1},\ldots,w_{n} be indeterminates. Let u=u1,,udu=u_{1},\ldots,u_{d} be an indeterminate. Then for every j[k]j\in[k],

The first step is Hölder’s inequality again:

We can prove Lemma 5.5 by putting together Lemma 5.8 and Lemma 5.9.

Let j[k]j\in[k] be a cluster and recall aj{0,1}na_{j}\in\{0,1\}^{n} is the 0/10/1 indicator for the samples in cluster jj. Let SjS_{j} be the samples in the jj-th cluster, with empirical mean μj\overline{\mu}_{j}. We begin by writing aj,wμμj2\langle a_{j},w\rangle\|\mu-\overline{\mu}_{j}\|^{2} in terms of samples X1,,XnX_{1},\ldots,X_{n}.

Hence, using (a+b)t2t(at+bt)(a+b)^{t}\leq 2^{t}(a^{t}+b^{t}), we obtain

5 Proof of Lemma 5.7

We prove Lemma 5.7. The proof only uses standard SoS and pseudodistribution tools. The main inequality we will use is the following version of Hölder’s inequality.

We first establish the following inequality.

Raising both sides to the t/2t/2 power gives

6 Rounding

In this section we state and analyze our second-moment round algorithm. As have discussed already, our SoS proofs in the mixture model setting are quite strong, meaning that the rounding algorithm is relatively naive.

To get started analyzing the algorithm, we need a definition.

We proceed inductively. We first prove the base case. WLOG assume that the algorithm picks v1v_{1}, and that v1v_{1} is good, and is from component jj. Then, for all iIg,ji\in I_{g,j}, by the triangle inequality we have viv122n1/2B\|v_{i}-v_{1}\|_{2}\leq 2\frac{n^{1/2}}{B}, and so Ig,jC1I_{g,j}\subseteq C_{1}. Moreover, if iIg,ji\in I_{g,j^{\prime}} for some jjj^{\prime}\neq j, we have

and so in this case i∉C1i\not\in C_{1}. Hence C1=Ig,jB1C_{1}=I_{g,j}\cup B_{1} for some B1IbB_{1}\subseteq I_{b}.

There are at most ε2En\varepsilon^{2}En indices which are EE-bad; i.e. Ibε2En|I_{b}|\leq\varepsilon^{2}En.

from which the claim follows by simplifying. ∎

With probability at least 1ε2m31-\varepsilon^{2}m^{3}, the algorithm RoundSecondMoments chooses good indices in all kk iterations.

By Lemma 5.13, in the first iteration the probability that a bad vector is chosen is at most ε2E\varepsilon^{2}E. Conditioned on the event that in iterations 1,,a1,\ldots,a the algorithm has chosen good vectors, then by Lemma 5.12, there is at least one jaj_{a} so that no points in Ig,jaI_{g,j_{a}} have been removed. Thus at least (1δ)n/m(1-\delta)n/m vectors remain, and in total there are at most ε2En\varepsilon^{2}En bad vectors, by Lemma 5.13. So, the probability of choosing a bad vector is at most ε2Em\varepsilon^{2}Em. Therefore, by the chain rule of conditional expectation and our assumption , the probability we never choose a bad vector is at least

Choosing E=mE=m this is (1ε2m2)m1ε2m3(1-\varepsilon^{2}m^{2})^{m}\geq 1-\varepsilon^{2}m^{3}. as claimed. ∎

Now Theorem 5.11 follows from putting together the lemmas.

Robust estimation: algorithm and analysis

Our main contribution in this section is a formal description of an algorithm EstimateMean which makes these ideas rigorous, and the proof of the following theorem about its correctness:

As a remark, observe that if we set t=2log1/εt=2\log 1/\varepsilon, then the error becomes O(εlog1/ε)O(\varepsilon\sqrt{\log 1/\varepsilon}). Thus, with n=O(dO(log1/ε)/ε2)n=O(d^{O(\log 1/\varepsilon)}/\varepsilon^{2}) samples and nO(log1/ε)=dO(log1/ε)2n^{O(\log 1/\varepsilon)}=d^{O(\log 1/\varepsilon)^{2}} runtime, we achieve the same error bounds for general explicitly bounded distributions as the best known polynomial time algorithms achieve for Gaussian mean estimation.

Throughout this section, let [n]=SgSb[n]=S_{g}\cup S_{b}, where SgS_{g} is the indices of the uncorrupted points, and SbS_{b} is the indices of the corrupted points, so that Sb=εn|S_{b}|=\varepsilon n by assumption. Moreover, let Y1,,YnY_{1},\ldots,Y_{n} be iid from D\mathcal{D} so that Yi=XiY_{i}=X_{i} for all iSgi\in S_{g}.

We now state some additional tools we will require in our algorithm.

We will require the following elementary pruning algorithm, which removes all points which are very far away from the mean. We require this only to avoid some bit-complexity issues in semidefinite programming; in particular we just need to ensure that the vectors X1,,XnX_{1},\ldots,X_{n} used to form the SDP have polynomially-bounded norms. Formally:

Let ε,t,μ\varepsilon,t,\mu^{*}, and X1,,XnX_{1},\ldots,X_{n} be as in Theorem 6.1. There is an algorithm NaivePrune, which given ε,t\varepsilon,t and X1,,XnX_{1},\ldots,X_{n}, runs in time O(εdn2)O(\varepsilon dn^{2}), and outputs a subset S[n]S\subseteq[n] so that with probability 11/d101-1/d^{10}, the following holds:

No uncorrupted points are removed, that is SgSS_{g}\subseteq S, and

For all iSi\in S, we have XiμO(d)\|X_{i}-\mu^{*}\|\leq O(d).

In this case, we say that NaivePrune succeeds.

This algorithm goes by straightforward outlier-removal. It is very similar the procedure described in Fact 4.18 of [DKK+16] (using bounded tt-th moments instead of sub-Gaussianity), so we omit it.

Satisfiability

In our algorithm, we will use the same set of polynomial equations A^\widehat{\mathcal{A}} as in Lemma 4.1. However, the data we feed in does not exactly fit the assumptions in the Lemma. Specifically, because the adversary is allowed to remove an ε\varepsilon-fraction of good points, the resulting uncorrupted points are no longer iid from D\mathcal{D}. Despite this, we are able to specialize Lemma 4.1 to this setting:

We sketch the proof of Lemma 6.3 in Section 7.4.

2 Formal Algorithm Specification

With these tools in place, we can now formally state the algorithm. The formal specification of this algorithm is given in Algorithm 3.

The first two lines of Algorithm 3 are only necessary for bit complexity reasons, since we cannot solve SDPs exactly. However, since we can solve them to doubly-exponential accuracy in polynomial time, it suffices that all the quantities are at most polynomially bounded (indeed, exponentially bounded suffices) in norm, which these two lines easily achieve. For the rest of this section, for simplicity of exposition, we will ignore these issues.

3 Deterministic conditions

With these tools in place, we may now state the deterministic conditions under which our algorithm will succeed. Throughout this section, we will condition on the following events holding simultaneously:

We have the following concentration of the uncorrupted points:

We have the following concentration of the empirical tt-th moment tensor:

The following lemma says that with high probability, these conditions hold simultaneously:

We defer the proof of this lemma to the Appendix.

For simplicity of notation, throughout the rest of the section, we will assume that NaivePrune does not remove any points whatsoever. Because we are conditioning on the event that it removes no uncorrupted points, it is not hard to see that this is without loss of generality.

4 Identifiability

Our main identifiability lemma is the following.

The third type of error is similar in spirit: it is the contribution of the original uncorrupted points that the adversary removed. Formally:

Finally, the fourth type of error comes from the εn\varepsilon n adversarially-chosen vectors. We prove this lemma by using the bounded-moments inequality in A\mathcal{A}.

With these lemmas in place, we now have the tools to prove Lemma 6.5.

where (a) follows from the mean axioms, (b) follows from splitting up the uncorrupted and the corrupted samples, (c) follows by adding and subtracting 11 to each term in SgS_{g}, and (d) follows from the assumption that Yi=XiY_{i}=X_{i} for all i[n]i\in[n]. We will rearrange the last term by adding and subtracting μ\mu. Note the following polynomial identity:

and put it together with the above to get

Now we use t(x+y+z+w)texp(t)(xt+yt+zt+wt)\vdash_{t}\,(x+y+z+w)^{t}\leq\exp(t)\cdot(x^{t}+y^{t}+z^{t}+w^{t}) for any even tt, and Lemma 6.6, Lemma 6.7, and Lemma 6.9 and simplify to conclude

Lastly, since A2iTwi(12ε)n\mathcal{A}\vdash_{2}\,\sum_{i\in T}w_{i}\geq(1-2\varepsilon)n, we get

5 Rounding

6 Proofs of Lemmata 6.6–6.9

We first prove Lemma 6.6, which is a relatively straightforward application of SoS Cauchy Schwarz.

where the last inequality follows from (E3). This completes the proof. ∎

Before we prove Lemmata 6.7–6.9, we prove the following lemma which we will use repeatedly:

where (a) follows from (E4) and (b) follows from 10t10t-explicitly boundedness. ∎

We now return to the proof of the remaining Lemmata.

We start by applying Hölder’s inequality, Fact A.6, (implicitly using that wi2=wi2(1wi)2=1wiw_{i}^{2}=w_{i}\vdash_{2}\,(1-w_{i})^{2}=1-w_{i}), to get

We apply Hölder’s inequality to obtain that

where (a) follows from the assumption on the size of SbS_{b} and since the additional terms in the sum are SoS, and (b) follows follows from Lemma 6.10. This completes the proof. ∎

The proof is very similar to the proof of the two previous lemmas, except that we use the moment bound inequality in A\mathcal{A}. Getting started, by Hölder’s:

Combining this with the moment bound in A\mathcal{A},

Finally, clearly A2iTwiεn\mathcal{A}\vdash_{2}\,\sum_{i\notin T}w_{i}\leq\varepsilon n, which finishes the proof. ∎

Encoding structured subset recovery with polynomials

The goal in this section is to prove Lemma 4.1. The eventual system A^\widehat{\mathcal{A}} of polynomial inequalities we describe will involve inequalities among matrix-valued polynomials. We start by justifying the use of such inequalities in the SoS proof system.

Let x=(x1,,xn)x=(x_{1},\ldots,x_{n}) be indeterminates. We describe a proof system which can reason about inequalities of the form M(x)0M(x)\succeq 0, where M(x)M(x) is a symmetric matrix whose entries are polynomials in xx.

if there are vector-valued polynomials {rSj}jN,S[m]\{r_{S}^{j}\}_{j\leq N,S\subseteq[m]} (where the SS’s are multisets), a matrix BB, and a matrix QQ whose entries are polynomials in the ideal generated by q1,,qmq_{1},\ldots,q_{m}, such that

and furthermore that deg(j(rSj(x))(rSj(x)))[iSMi(x)]d\deg\left(\sum_{j}(r_{S}^{j}(x))(r_{S}^{j}(x))^{\top}\right)\otimes\left[\otimes_{i\in S}M_{i}(x)\right]\leq d for every S[m]S\subseteq[m], and degQd\deg Q\leq d. Observe that in the case M1,,Mm,MM_{1},\ldots,M_{m},M are actually 1×11\times 1 matrices, this reduces to the usual notion of scalar-valued sum of squares proofs.

For completeness, we prove the following lemmas in the appendix.

2 Warmup: Gaussian moment matrix-polynomials

Size: (1τ)αni[n]wi(1+τ)αn(1-\tau)\alpha n\leq\sum_{i\in[n]}w_{i}\leq(1+\tau)\alpha n.

Empirical mean: μ=1i[n]wii[n]wiXi\mu=\tfrac{1}{\sum_{i\in[n]}w_{i}}\sum_{i\in[n]}w_{i}X_{i}.

tt-th Moments: the tt-th empirical moments of the vectors selected by the vector ww, centered about μ\mu, are subgaussian. That is,

The second property is already phrased as two polynomial inequalities, and the third can be rearranged to a polynomial equation. For the first, we use polynomial equations wi2=wiw_{i}^{2}=w_{i} for every i[n]i\in[n]. The moment constraint will be the most difficult to encode. We give two versions of this encoding: a simple one which will work when the distribution of the structured subset of samples to be recovered is Gaussian, and a more complex version which allows for any explicitly bounded distribution. For now we describe only the Gaussian version. We state some key lemmas and prove them for the Gaussian case. We carry out the general case in the following section.

To encode the bounded moment constraint, for this section we let M(w,μ)M(w,\mu) be the following matrix-valued polynomial

For parameters α\alpha\in (for the size of the subset), tt (for which empirical moment to control), and τ>0\tau>0 (to account for some empirical deviations), the structured subset axioms are the following matrix-polynomial inequalities on variables w=(w1,,wn),μ=(μ1,,μd)w=(w_{1},\ldots,w_{n}),\mu=(\mu_{1},\ldots,\mu_{d}).

booleanness: wi2=wiw_{i}^{2}=w_{i} for all i[n]i\in[n]

size: (1τ)αni[n]wi(1+τ)αn(1-\tau)\alpha n\leq\sum_{i\in[n]}w_{i}\leq(1+\tau)\alpha n

μ\mu is the empirical mean: μi[n]wi=i[n]wiXi\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}.

Notice that in light of the last constraint, values for the variables μ\mu are always determined by values for the variables ww, so strictly speaking μ\mu could be removed from the program. However, we find it notationally convenient to use μ\mu. We note also that the final constraint, that μ\mu is the empirical mean, will be used only for the robust statistics setting but seems unnecessary in the mixture model setting.

Next, we state and prove some key lemmas for this Gaussian setting, as warmups for the general setting.

Suppose SS has size exactly (1τ)αn(1-\tau)\alpha n; otherwise replace SS with a random subset of SS of size exactly (ατ)n(\alpha-\tau)n. As a solution to the polynomials, we will take ww to be the indicator vector of SS and μ=1Si[n]wiXi\mu=\frac{1}{|S|}\sum_{i\in[n]}w_{i}X_{i}. The booleanness and size axioms are trivially satisfied. The spectral inequality

follows from concentration of the empirical mean to the true mean μ\mu^{*} and standard matrix concentration (see e.g. [Tro12]). ∎

The next lemma is actually a corollary of Lemma 7.2.

3 Moment polynomials for general distributions

We start by defining polynomial equations A^\widehat{\mathcal{A}}, for which we introduce some extra variables For every pair of multi-indices γ,ρ\gamma,\rho over [k][k] with degree at most t/2t/2, we introduce a variable Mγ,ρM_{\gamma,\rho}. The idea is that M=[Mγ,ρ]γ,ρM=[M_{\gamma,\rho}]_{\gamma,\rho} forms an nt/2×nt/2n^{t/2}\times n^{t/2} matrix. By imposing equations of the form Mγ,ρ=fγ,ρ(w,μ)M_{\gamma,\rho}=f_{\gamma,\rho}(w,\mu) for some explicit polynomials fγ,ρf_{\gamma,\rho} of degree O(t)O(t), we can ensure that

(This equation should be interpreted as an equality of polynomials in indeterminates uu.) Let L\mathcal{L} be such a family of polynomial equations. Our final system A^(α,t,τ)\widehat{\mathcal{A}}(\alpha,t,\tau) of polynomial equations and inequalities follows. The important parameters are α\alpha, controlling the size of the set of samples to be selected, and tt, how many moments to control. The parameter τ\tau is present to account for random fluctuations in the sizes of the cluster one wants to recover.

Let A^(α,t,τ)\widehat{\mathcal{A}}(\alpha,t,\tau) be the set of (matrix)-polynomial equations and inequalities on variables w,μ,Mγ,ρw,\mu,M_{\gamma,\rho} containing the following.

Booleanness: wi2=wiw_{i}^{2}=w_{i} for all i[n]i\in[n]

Size: (1τ)αnwi(1+τ)αn(1-\tau)\alpha n\leq\sum w_{i}\leq(1+\tau)\alpha n.

Empirical mean: μi[n]wi=i[n]wiXi\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}.

The equations L\mathcal{L} on MM described above.

In the remainder of this section we prove the satisfiability and moment bounds parts of Lemma 4.1. To prove the lemma we will need a couple of simple facts about SoS proofs.

and similarly for 1mi[n][Yi,utXi,ut]\tfrac{1}{m}\sum_{i\in[n]}\left[\langle Y_{i},u\rangle^{t}-\langle X_{i},u\rangle^{t}\right].

Expanding Yi,ut\langle Y_{i},u\rangle^{t}, we get

For each term, by Cauchy-Schwarz, tXi,usv,utsXisvtsut\vdash_{t}\,\langle X_{i},u\rangle^{s}\langle v,u\rangle^{t-s}\leq\|X_{i}\|^{s}\|v\|^{t-s}\cdot\|u\|^{t}. Putting these together with the hypothesis on 1nXis\tfrac{1}{n}\|X_{i}\|^{s} and counting terms finishes the proof. ∎

By taking a random subset SS if necessary, we assume S=(1τ)αn=m|S|=(1-\tau)\alpha n=m. We describe a solution to the system A^\widehat{\mathcal{A}}. Let ww be the 0/10/1 indicator vector for SS. Let μ=1miSwiXi\mu=\tfrac{1}{m}\sum_{i\in S}w_{i}X_{i}. This satisfies the Boolean-ness, size, and empirical mean axioms.

Describing the assignment to the variables {Mγ,ρ}\{M_{\gamma,\rho}\} takes a little more work. Re-indexing and centering, let Y1=Xi1μ,,Ym=XimμY_{1}=X_{i_{1}}-\mu,\ldots,Y_{m}=X_{i_{m}}-\mu be centered versions of the samples in SS, where S={i1,,im}S=\{i_{1},\ldots,i_{m}\} and μ\mu remains the empirical mean 1miSXi\tfrac{1}{m}\sum_{i\in S}X_{i}. First suppose that the following SoS proof exists:

Just substituting definitions, we also obtain

Let Mγ,ρ=Pγ,ρM_{\gamma,\rho}=P_{\gamma,\rho}. Then clearly M0M\succeq 0 and M,w,μM,w,\mu together satisfy L\mathcal{L}.

It remains to show that the first SoS proof exists with high probability for large enough mm. Since tt is even and 0.1>τ>00.1>\tau>0, it is enough to show that

Let Zi=XiμZ_{i}=X_{i}-\mu^{*}, where μ\mu^{*} is the true mean of D\mathcal{D}. Let

(bounded norms) for every sts\leq t it holds that 1mi[m]Ziss100sds/2\tfrac{1}{m}\sum_{i\in[m]}\|Z_{i}\|^{s}\leq s^{100s}d^{s/2}.

(concentration of empirical mean) μμd5t\|\mu-\mu^{*}\|\leq d^{-5t}.

Starting with a(u)a(u), using Fact 7.5, it is enough that 2tCt1v142^{t}C^{t-1}\|v\|\leq\tfrac{1}{4}, where v=μμv=\mu-\mu^{*} and CC is such that 1mi[m]ZisCs\tfrac{1}{m}\sum_{i\in[m]}\|Z_{i}\|^{s}\leq C^{s}. By 1 and 2, we can assume vd5t\|v\|\leq d^{-5t} and C=t100d1/2C=t^{100}d^{1/2}. Then the conclusion follows for t3t\geq 3.

We turn to b(u)b(u). A typical coefficient of b(u)b(u) in the monomial basis—say, the coefficient of uθu^{\theta} for some multiindiex θ\theta of degree θ=t|\theta|=t, looks like

By assumption this is at most d10td^{-10t} in magnitude, so the sum of squared coefficients of b(u)b(u) is at most d18td^{-18t}. The bound on b(u)b(u) for d2d\geq 2. ∎

Using this in conjunction with the linear equations L\mathcal{L},

(bounded norms) for every sts\leq t it holds that 1mi[m]Ziss100sds/2\tfrac{1}{m}\sum_{i\in[m]}\|Z_{i}\|^{s}\leq s^{100s}d^{s/2}.

(concentration of empirical mean) 1mi[m]Zid5t\left\lVert\tfrac{1}{m}\sum_{i\in[m]}Z_{i}\right\rVert\leq d^{-5t}.

The proofs are standard applications of central limit theorems, in particular the Berry-Esseen central limit theorem [Ber41], since all the quantities in question are sums of iid random variables with bounded moments. We will prove only the first statement; the others are similar.

Finally we remark on the polynomial-time algorithm to find a pseudoexpectation satisfying A^\widehat{\mathcal{A}}. As per [BS17], it is just necessary to ensure that if x=(w,μ)x=(w,\mu), the polynomials in A^\widehat{\mathcal{A}} include x2M\|x\|^{2}\leq M for some large number MM. In our case the equation x2(nkm)O(1)\|x\|^{2}\leq(nkm)^{O(1)} can be added without changing any arguments.

4 Modifications for robust estimation

We briefly sketch how the proof of Lemma 4.1 may be modified to prove Lemma 6.3. The main issue is that A^\widehat{\mathcal{A}} of Lemma 4.1 is satisfiable when there exists an SoS proof

where μ\mu is the empirical mean of XiX_{i} such that wi=1w_{i}=1. In the proof of Lemma 4.1 we argued that this holds when ww is the indicator for a set of iid samples from a 10t10t-explicitly bounded distribution D\mathcal{D}. However, in the robust setting, ww should be taken to be the indicator of the (1ε)n(1-\varepsilon)n good samples remaining from such a set of iid samples after εn\varepsilon n samples are removed by the adversary. If Y1,,YnY_{1},\ldots,Y_{n} are the original samples, with empirical mean μ\mu^{*}, the proof of Lemma 4.1 (with minor modifications in constants) says that with high probability,

For small-enough ε\varepsilon, this also means that

This almost implies that A^\widehat{\mathcal{A}} is satisfiable given the ε\varepsilon-corrupted vectors X1,,XnX_{1},\ldots,X_{n} and parameter α=(1ε)n\alpha=(1-\varepsilon)n, except for that μ=1ni[n]Yi\mu^{*}=\tfrac{1}{n}\sum_{i\in[n]}Y_{i} and we would like to replace it with μ=1(1ε)ni goodXi\mu=\tfrac{1}{(1-\varepsilon)n}\sum_{i\text{ good}}X_{i}. This can be accomplished by noting that, as argued in Section 6, with high probability μμO(tε11/t)\|\mu-\mu^{*}\|\leq O(t\cdot\varepsilon^{1-1/t}).

Acknowledgements

The authors would like to thank David Steurer, Daniel Freund, Guatam Kamath, Pablo Parillo, and especially Aravindan Vijayaraghavan for some helpful conversations.

References

Appendix A Toolkit for sum of squares proofs

Let x1,,xn,y1,,ynx_{1},\ldots,x_{n},y_{1},\ldots,y_{n} be indeterminates. Then

Let x,yx,y be nn-length vectors of indeterminates. Then

The sum of squares proof of Cauchy-Schwarz implies that x2+y22x,y\|x\|^{2}+\|y\|^{2}-2\langle x,y\rangle is a sum of squares. Now we just expand

Let MM be a matrix whose rows and columns are indexed by multisets S[n]S\subseteq[n] of sizes aa and bb. Thus MM has four blocks: an (a,a)(a,a) block, an (a,b)(a,b) block, a (b,a)(b,a) block, and a (b,b)(b,b) block. In the (a,b)(a,b) and (b,a)(b,a) blocks, put matrices Mab,MbaM_{ab},M_{ba} such that xa,Mabxb=12.P(x)\langle x^{\otimes a},M_{ab}x^{\otimes b}\rangle=\tfrac{1}{2}\,.P(x). In the (a,a)(a,a) and (b,b)(b,b) blocks, put CI\sqrt{C}\cdot I. Then, letting z=(xa,xb)z=(x^{\otimes a},x^{\otimes b}), we get z,Mz=C(x2a+x2b)P(x)\langle z,Mz\rangle=\sqrt{C}(\|x\|^{2a}+\|x\|^{2b})-P(x). Note that MabC\|M_{ab}\|\leq\sqrt{C} by hypothesis, so M0M\succeq 0, which completes the proof. ∎

Let u=(u1,,uk)u=(u_{1},\ldots,u_{k}) be a vector of indeterminantes. Let DD be sub-Gaussian with variancy proxy 11. Let t0t\geq 0 be an integer. Then we have

Expand the polynomial in question. We have

Let x1,,xn,y1,,ynx_{1},\ldots,x_{n},y_{1},\ldots,y_{n} be indeterminates. Then

We will only prove the first inequality. The second inequality follows since wi2=wi2wixi=wi(wixi)w_{i}^{2}=w_{i}\vdash_{2}\,w_{i}x_{i}=w_{i}\cdot(w_{i}x_{i}), applying the first inequality, and observing that wi2=wiqwiq=wiw_{i}^{2}=w_{i}\vdash_{q}\,w_{i}^{q}=w_{i}.

Applying Cauchy-Schwarz (Fact A.1) and the axioms, we obtain to start that for any even number tt,

Applying Fact A.1 one more time to get (inwixiq/2)(inwi2)(inxiq)\left(\sum_{i\leq n}w_{i}x_{i}^{q/2}\right)\leq\left(\sum_{i\leq n}w_{i}^{2}\right)\left(\sum_{i\leq n}x_{i}^{q}\right) and then the axioms wi2=wiw_{i}^{2}=w_{i} completes the proof. ∎

In this section, we show that many natural high dimensional distributions are explicitly bounded. Recall that if a univariate distribution XX sub-Gaussian (with variancy proxy σ\sigma) with mean μ\mu then we have the following bound on its even centered moments for t4t\geq 4:

More generally, we will say a univariate distribution is tt-bounded with mean μ\mu and variance proxy σ\sigma if the following general condition holds for all even 4st4\leq s\leq t:

The factor of 1/21/2 in this expression is not important and can be ignored upon first reading.

Our main result in this section is that any rotation of products of independent tt-bounded distributions with variance proxy 1/21/2 is tt-explicitly bounded with variance proxy 11:

Since the definition of explicitly bounded is clearly rotation invariant, it suffices to show that D\mathcal{D}^{\prime} is tt-explicitly bounded. For any vector of indeterminants uu, and for any 4st4\leq s\leq t even, we have

where XX^{\prime} is an independent copy of XX, and the last line follows from SoS Cauchy-Schwarz. We then expand the resulting polynomial in the monomial basis:

since all α\alpha with odd monomials disappear since XXX-X^{\prime} is a symmetric product distribution. By tt-boundedness, all remaining coefficients are at most scss^{cs}, from which we deduce

which proves that D\mathcal{D}^{\prime} is tt-explicitly bounded, as desired. ∎

As a corollary observe this trivially implies that all Guassians N(μ,Σ)\operatorname{\mathcal{N}}(\mu,\Sigma) with ΣI\Sigma\preceq I are tt-explicitly bounded for all tt.

We note that our results are tolerant to constant changes in the variancy proxy (just by scaling down). In particular, this implies that our results immediately apply for all rotations of products of tt-bounded distributions with a loss of at most 22.

Appendix B Sum of squares proofs for matrix positivity – omitted proofs

By hypothesis, there are rSjr_{S}^{j} and BB such that

Now, let T[m]T\subseteq[m] and pp be a polynomial. Let M=iTMiM^{\prime}=\otimes_{i\in T}M_{i}. Suppose that deg(p2MM)2d\deg\left(p^{2}\cdot M\otimes M^{\prime}\right)\leq 2d. Using the hypothesis on MM, we obtain

Appendix C Omitted Proofs from Section 6

We will show that each event (E1)–(E4) holds with probability at least 1d81-d^{-8}. Clearly for dd sufficiently large this implies the desired guarantee. That (E1) and (E2) occur with probability 1d81-d^{-8} follow from Lemmas 6.2 and 6.3, respectively. It now suffices to show (E3) and (E4) holds with high probability. Indeed, that (E4) holds with probability 1d81-d^{-8} follows trivially from the same proof of Lemma 4.1 (it is in fact a simpler version of this fact).

By basic concentration arguments (see e.g. [Ver10]), we know that by our choice of nn, with probability 1d81-d^{-8} we have that

Condition on the event that this and (E4) simultaneously hold. Recall that YiY_{i} for i=1,,ni=1,\ldots,n are defined so that YiY_{i} are iid and Yi=XiY_{i}=X_{i} for iSgi\in S_{g}. By the triangle inequality, we have

where (a) follows from (E4), and (b) follows since δtt\delta\ll t^{t}. Hence

Taking the tt-th root on both sides and combining it with (7) yields

Appendix D Mixture models with nonuniform weights

In this section we describe at a high level how to adapt the algorithm given in Section 5 to handle non-uniform weights. We assume the mixture components now have mixture weights ηλ1λk1\eta\leq\lambda_{1}\leq\ldots\leq\lambda_{k}\leq 1 where λi=1\sum\lambda_{i}=1, where η>0\eta>0 is some fixed constant. We still assume that all pairs of means satisfy μiμjkγ\|\mu_{i}-\mu_{j}\|\geq k^{\gamma} for all iji\neq j. In this section we describe an algorithm LearnNonUniformMixtureModel, and we sketch a proof of the following theorem concerning its correctness:

However, the rounding algorithm we require here is somewhat more involved than the naive rounding algorithm used previously for learning mixture models. In particular, we no longer know exactly the Frobenius norm of the optimal solution: we cannot give tight upper and lower bounds. This is because components with mixing weights which are just below the threshold α\alpha^{\prime} may or may not contribute to the optimal solution that the SDP finds. Instead, we develop a more involved rounding algorithm RoundSecondMomentsNonuniform, which we describe below.

Our invariant is that every time we have a feasible solution to the SDP, we remove at least one cluster (we make this more formal below). Repeatedly run the SDP with this α\alpha^{\prime} until we no longer get a feasible solution, and then repeat with a slightly smaller α\alpha^{\prime}. After the loop terminates, output the empirical mean of every cluster. The formal specification of this algorithm is given in Algorithm 4.

It is not hard to show, via arguments exactly as in Section 6 and 7 that the remaining fraction of points from these components which we have not removed as well as the small fraction of points we have removed from good components do not affect the calculations, and so we will assume for simplicity in the rest of this discussion that we have removed all samples from components jj with λjα+O(ξ)\lambda_{j}\geq\alpha^{\prime}+O(\xi).

Here we outline the proof of correctness of Algorithm 4. The proof follows very similar ideas as the proof of correctness of Algorithm 1, and so for conciseness we omit many of the details. As before, for simplicity assume that the naive clustering returns only one cluster, as otherwise we can work on each cluster separately, so that for all ii, we have μiO(poly(d,k))\|\mu_{i}\|\leq O(\operatorname{poly}(d,k)) after centering.

We now show why this invariant holds. Clearly this holds at the beginning of the algorithm. We show that if it holds at any step, it must also hold at the next time at which the SDP is feasible. Fix such an α\alpha^{\prime}. By assumption, we have removed almost all points from components jj with λjα+k8\lambda_{j}\geq\alpha^{\prime}+k^{-8}, and have only removed a very small fraction of points not from these components.

By basic concentration, we have λjnSjo(n)\left|\lambda_{j}n-|S_{j}|\right|\leq o(n) for all jj except with negligble probability, and so for the rest of the section, for simplicity, we will slightly cheat and assume that λjn=Sj\lambda_{j}n=|S_{j}|. It is not hard to show that this also does not effect any calculations.

The main observation is that for any choice of α\alpha^{\prime}, by essentially same logic as in Section 5, we still have the following bound for all iji\neq j for an α\alpha^{\prime} well-behaved run:

for A^\widehat{\mathcal{A}} instantiated with α=α\alpha=\alpha^{\prime}, where the last line follows by our choice of tt sufficiently large.

With parameters as above, for any α\alpha^{\prime} well-behaved run, we have A^O(t)ai,wO(ξ2)αn\widehat{\mathcal{A}}\vdash_{O(t)}\,\langle a_{i},w\rangle\leq O(\xi^{2})\cdot\alpha^{\prime}n for any jj so that λjn(αO(ξ4))n\lambda_{j}n\leq(\alpha^{\prime}-O(\xi^{4}))n.

from which we deduce A^O(t)ai,wO(ξ2)αn\widehat{\mathcal{A}}\vdash_{O(t)}\,\langle a_{i},w\rangle\leq O(\xi^{2})\cdot\alpha^{\prime}n. ∎

We now show that under these conditions, there is an algorithm to remove a cluster:

D.2 Rounding Well-behaved runs

We now turn to correctness of this algorithm. Define TT to be the set of clusters jj with λjαO(ξ4)|\lambda_{j}-\alpha^{\prime}|\leq O(\xi^{4}). We first show:

The first term is at most O(d2ηξ2)(α)2n2O(d^{2}\eta\xi^{2})(\alpha^{\prime})^{2}n^{2} by (8) and the second term is at most dO(ξ2)αndO(\xi^{2})\alpha^{\prime}n by Lemma D.2, so overall we have that

Observe that since vi21\|v_{i}\|^{2}\leq 1 and vi2=αn\sum\|v_{i}\|^{2}=\alpha^{\prime}n there are at least (11/100)αn(1-1/100)\alpha^{\prime}n vectors with vi2α/100\|v_{i}\|^{2}\geq\alpha^{\prime}/100. We have