Mixture Models, Robustness, and Sum of Squares Proofs

Samuel B. Hopkins, Jerry Li

Introduction

We propose and analyze a family of new algorithms for some fundamental high-dimensional statistical estimation problems. In particular, we give new algorithms for the following problems.

Mixture models, and especially Gaussian mixture models (where $\mathcal{D}_{1},\ldots,\mathcal{D}_{k}$ are Gaussian distributions) have been studied since Pearson in 1894 [Pea94]. Work in theoretical computer science dates at least to the pioneering algorithm of Dasgupta in 1999 [Das99], which has been followed by numerous other algorithms and lower bounds [Wu83, DS07, AK05, VW02, KK10, AM05, FSO06, KMV10, BS10, MV10, HK13, ABG+14, BCMV14, DK14, SOAJ14, HP15, XHM16, GHK15, LS17, RV17, DTZ17].

Robust estimation in the form we study here is a more recent transplant to theoretical computer science [DKK+16, LRV16, CSV16, DKS16, CJN17, DKK+17b, DKK+17a, SCV17], but statisticians have long sought outlier-robust estimators. Formal study of arbitrarily-bad/adversarially-chosen outliers originates in the 1960s with the advent of “breakdown points” in statistics [Hub64, Tuk75b, HRRS86, JP78, Ber06].

For mixture models, a special case of our main result yields the first progress in more than 15 years on efficiently clustering mixtures of separated spherical Gaussians. The question here is: if $\mathcal{D}_{1},\ldots,\mathcal{D}_{k}$ are all Gaussian with covariance identity and $k=\operatorname{poly}(d)$ , what is the minimum cluster separation $\Delta$ which allows for a polynomial-time algorithm to estimate $\mu_{1},\ldots,\mu_{k}$ from $\operatorname{poly}(k,d)$ samples from the mixture model? The guarantees of the previous best algorithms for this problem, which require $\Delta\geq O(k^{1/4})$ , are captured by a simple greedy clustering algorithm, sometimes called single-linkage clustering: when $\Delta\geq O(k^{1/4})$ , with high probability every pair of samples from the same cluster is closer in Euclidean distance than every pair of samples from differing clusters. We break this single-linkage clustering barrier: for every $\gamma>0$ we give a $\operatorname{poly}(k,d)$ -time algorithm for this problem when $\Delta>k^{\gamma}$ .

To do so we make novel algorithmic use of higher moments (in fact, $O(1/\gamma)$ moments) of the underlying distributions $\mathcal{D}_{i}$ . Our main technical contribution is a new algorithmic technique for finding either a structured subset of data points or the empirical mean of such a subset when the subset consists of independent samples from a distribution $\mathcal{D}$ which has bounded higher-order moments and there is a simple certificate of this boundedness. This technique leverages the Sum of Squares (SoS) hierarchy of semidefinite programs (SDPs), and in particular a powerful approach for designing SoS-based algorithms in machine learning settings, developed and used in [BKS14, BKS15, GM15, BM16, HSS15, MSS16, PS17]. We suspect use of higher moments is necessary in light of second-moment indistinguishability results for mixtures with small separation [AM05].

This SoS approach to unsupervised learning rests on a notion of simple identifiability proofs: the main step in designing an algorithm using SoS to recover some parameters $\theta$ from samples $x_{1},\ldots,x_{n}\sim p(x\,|\,\theta)$ is to prove in a restricted proof system that $\theta$ is likely to be uniquely identifiable from $x_{1},\ldots,x_{n}$ . We develop this thoroughly later on, but roughly speaking one may think of this as requiring the identifiability proof to use only simple inequalities, such as Cauchy-Schwarz and Hölder’s inequality, applied to low-degree polynomials. The simple identifiability proofs we construct for both the mixture models and robust estimation settings are heavily inspired by the robust estimation algorithms of Diakonikolas et al. [DKK+16].

Both of the problems we study have a long history; for now we just note some highlights and state our main results.

The problem of learning mixture models dates to Pearson in 1894, who invented the method of moments in order to separate a mixture of two Gaussians [Pea94]. Mixture models have since become ubiquitous in data analysis across many disciplines [TSM85, MP04]. In recent years, computer scientists have devised many ingenious algorithms for learning mixture models as it became clear that classical statistical methods (e.g. maximum likelihood estimation) often suffer from computational intractability, especially when there are many mixture components or the components are high dimensional.

Separation $\Delta=k^{1/4}$ represents a natural algorithmic barrier: when $\Delta\geq k^{1/4}$ , every pair of samples from the same cluster are closer to each other in Euclidean distance than are every pair of samples from distinct clusters (with high probability), while this is no longer true if $\Delta<k^{1/4}$ . Thus, when $\Delta\geq k^{1/4}$ , a simple greedy algorithm correctly clusters the samples into their components (this algorithm is sometimes called single-linkage clustering). On the other hand, standard information-theoretic arguments show that the means remain approximately identifiable from $\operatorname{poly}(k,d)$ samples when $\Delta$ is as small as $O(\sqrt{\log k})$ , but these methods yield only exponential-time algorithms.Recent and sophisticated arguments show that the means are identifiable (albeit inefficiently) with error depending only on the number of samples and not on the separation $\Delta$ even when $\Delta=O(\sqrt{\log k})$ [RV17]. Nonetheless, despite substantial attention, this $\Delta=k^{1/4}$ barrier representing the breakdown of single-linkage clustering has stood for nearly 20 years.

We prove the following main theorem, breaking the single-linkage clustering barrier.

We pause here to make several remarks about this theorem. Our algorithm makes novel use of higher order moments of Gaussian (and sub-Gaussian) distributions. Previous work for efficiently learning well-separated mixtures of Gaussians used only second order moment information, whereas we use $O(1/\gamma)$ moments.

For mixtures of distributions with sufficiently-many bounded moments (such as Gaussians), our guarantees go even further. We show that using $d^{O(\log k)^{2}}$ time and $d^{O(\log k)}$ samples, we can recover the means to error $1/\operatorname{poly}(k)$ even if the separation is only $C\sqrt{\log k}$ for some universal constant $C$ . Strikingly, [RV17] show that any algorithm that can learn the means nontrivially given separation $o(\sqrt{\log k})$ must require super-polynomial samples and time. Our results show that just above this threshold, it is possible to learn with just quasipolynomially many samples and time.

Robust mean estimation

Estimators which are robust to outlying or corrupted samples have been studied in statistics at least since the 1960s [Hub64, Tuk75a]. The model we consider in this paper is a slight generalization of Hüber’s contamination model [Hub64]. We are given $X_{1},\ldots,X_{n}$ , originally drawn iid from some unknown distribution $\mathcal{D}$ , but an adversary has changed an $\varepsilon$ fraction of these points adversarially. We call such a set of points $\varepsilon$ -corrupted.Hüber’s contamination model essentially only allows the adversary to add corrupted points, but not remove uncorrupted points. The goal of robust statistics is to recover statistics of $\mathcal{D}$ such as mean and covariance, given $\varepsilon$ -corrupted samples from $\mathcal{D}$ .

In classical robust statistics, the robust mean estimation problem is known as robust estimation of location, and robust covariance estimation is known as robust estimation of scale. Classical works consider a measure known as breakdown point, which is (informally) the fraction of samples that an adversary must corrupt before the estimator has no provable guarantees. They often design robust estimators for mean and covariance that achieve optimal error in many fundamental settings. For instance, given samples from a symmetric sub-Gaussian distribution in $k$ dimensions such that an $\varepsilon$ -fraction are arbitrarily corrupted, an estimator known as the Tukey median [Tuk75a] achieves error $O(\varepsilon)$ , which is information theoretically optimal. However, these estimators are all $NP$ -hard to compute [JP78, Ber06] and the best known algorithms require $\exp(d)$ time in general.

For a long time, all known computationally efficient robust statistics for the mean or covariance of a $d$ -dimensional Gaussian had error degrading polynomially with the dimension.We remark that this was the state of affairs even for the Hüber contamination model. In recent work, [DKK+16, LRV16] gave efficient and robust estimators for these statistics which achieve substantially better error. In particular, [DKK+16] achieve error $O(\varepsilon\sqrt{\log 1/\varepsilon})$ for estimating the mean of a Gaussian with identity covariance, and error $O(\varepsilon\log^{3/2}1/\varepsilon)$ for robustly estimating the mean of a Gaussian with unknown variance $\Sigma\preceq I$ .

Informally stated, our algorithms will work under the condition that this moment bound can be certified by a low degree SoS proof, for all $s\leq t$ . We call such distributions $t$ -explicitly bounded (we are ignoring some parameters, see Definition 3.1 for a formal definition). This class captures many natural sub-Gaussian distributions, such as Gaussians, product distributions of sub-Gaussians, and rotations thereof (see Appendix A.1). For such distributions, we show:

As with mixture models, we can push our statistical rates further, if we are willing to tolerate quasipolynomial runtime and sample complexity. In particular, we can obtain error $O(\varepsilon\sqrt{\log 1/\varepsilon})$ error with $d^{O(\log 1/\varepsilon)}$ samples and $d^{O(\log 1/\varepsilon)^{2}}$ time.

2 Related work

Other works have relaxed the requirement that the underlying distributions be Gaussian [KK10, AM05]; we also address non-Gaussian distributions, relaxing the Gaussian-ness assumption to explicit moment boundedness. One recent work in this spirit uses SDPs to cluster mixture models under separation assumptions [MVW17]; the authors show that a standard SDP relaxation of $k$ -means achieves guarantees comparable to previously-known specially-tailored mixture model algorithms; however, this algorithm suffers from the same $k^{1/4}$ barrier as other previous works.

Fixed number of Gaussians in many dimensions: Other works address parameter estimation for mixtures of $k\ll d$ Gaussians (generally $k=O(1)$ and $d$ grows) under weak identifiability assumptions [KMV10, BS10, MV10, HP15]. In these works the only assumptions are that the component Gaussians are statistically distinguishable; the goal is to recover their parameters of the underlying Gaussians. It was shown in [HP15] that algorithms in this setting provably require $\exp(k)$ samples and running time. The question addressed in our paper is whether this lower bound is avoidable under stronger identifiability assumptions. A related line of work addresses proper learning of mixtures of Gaussians [FSO06, DK14, SOAJ14, LS17], where the goal is to output a mixture of Gaussians which is close to the unknown mixture in total-variation distance, avoiding the $\exp(k)$ parameter-learning sample-complexity lower bound. These algorithms achieve $\operatorname{poly}(k,d)$ sample complexity, but they all require $\exp(k)$ running time, and moreover, do not provide any guarantee that the parameters of the distributions output are close to those for the true mixture.

In contrast, we do not assume any non-degeneracy conditions and use moment information only about the individual components rather than the full mixture, which always hold under separation conditions. Moreover, our algorithms do not need to know the exact structure of the 3rd or 4th moments. In general, clustering-based algorithms like ours seem more robust to modelling errors than algebraic or tensor-decomposition methods.

Expectation-maximization (EM): EM is the most popular algorithm for Gaussian mixtures in practice, but it is notoriously difficult to analyze theoretically. The works [DS07, DTZ17, XHM16] offer some theoretical guarantees for EM, but non-convergence results are a barrier to strong theoretical guarantees [Wu83].

Robust statistics

The literature on robust estimation is too large to do justice to here. There has been a long line of work on making algorithms tolerant to error in supervised settings [Val85, KL93], especially for learning halfspaces [Ser03, KLS09, ABL14, DKS17b], and for problems such as PCA [Bru09, CLMW11, LMTZ12, ZL14]. See [DKK+16] for a more detailed discussion on the relationship between these questions (and others) and the model we consider here.

SoS algorithms for unsupervised learning

SoS algorithms for unsupervised learning obtain the best known polynomial-time guarantees for many problems, including dictionary learning, tensor completion, and others [BKS14, BKS15, GM15, HSS15, MSS16, BM16, PS17]. While the running times of such algorithms are often large polynomials, due to the need to solve large SDPs, insights from the SoS algorithms have often been used in later works obtaining fast polynomial running times [HSSS16, SS17, HKP+17]. This lends hope that in light of our results there is a practical algorithm to learn mixture models under separation $k^{1/4-\varepsilon}$ for some $\varepsilon>0$ .

Concurrent work

Finally, we note that concurrent and independent works by several groups [KS17a, KS17b, DKS17a] have either obtained results or developed techniques similar to ours.

3 Organization

In Section 2 we discuss at a high level the ideas in our algorithms and SoS proofs. In Section 3 we give standard background on SoS proofs. Section 4 discusses the important properties of the family of polynomial inequalities we use in both algorithms. Section 5 and Section 6 state our algorithms formally and analyze them. Finally, Section 7 describes the polynomial inequalities our algorithms employ in more detail.

Techniques

The Sum of Squares (SoS) hierarchy is a powerful tool in optimization, originally designed to approximately solve systems of polynomial equations via a hierarchy of increasingly strong but increasingly large semidefinite programming (SDP) relaxations (see [BS14] and the references therein). There has been much recent interest in using the SoS method to solve unsupervised learning problems in generative models [BKS14, BKS15, GM15, HSS15, MSS16, PS17]. .

Remarkably, many useful polynomial inequalities have such certificates. For example, the usual proof of the Cauchy-Schwarz inequality $\langle y,z\rangle^{2}\leq\|y\|^{2}\|z\|^{2}$ , where $y,z$ are $m$ -dimensional vectors, actually shows that the polynomial $\|y\|^{2}\|z\|^{2}-\langle y,z\rangle^{2}$ is a sum-of-squares in $y$ and $z$ . The simplicity of the certificate is measured by the degree of the polynomials $r$ and $s$ ; when these polynomials have small (usually constant) degree there is hope of transforming SoS proofs into polynomial-time algorithms. This transformation is possible because (under mild assumptions on $p$ and $q$ ) the set of low-degree SoS proofs is in fact captured by a polynomial-size semidefinite program.

Returning to unsupervised learning, the concentration/union-bound style of identifiability proofs described above are almost never captured by low-degree SoS proofs. Instead, the goal is to design

A system of constant-degree polynomial equations and inequalties $\mathcal{A}=\{p_{1}(\theta)=0,\ldots,p_{m}(\theta)=0,q_{1}(\theta)\geq 0,\ldots,q_{m}(\theta)\geq 0\}$ , where the polynomials $p$ and $q$ depend on the samples $x_{1},\ldots,x_{n}$ , such that with high probability $\theta^{*}$ satisfies all the equations and inequalities.

A low-degree SoS proof that $\mathcal{A}$ implies $\|\theta-\theta^{*}\|\leq\delta$ for some small $\delta$ and appropriate norm $\|\cdot\|$ .

Clearly these imply that any solution $\theta$ of $\mathcal{A}$ also solves the unsupervised learning problem. It is in general NP-hard to find a solution to a system of low-degree polynomial equations and inequalities.

However, the SoS proof (2) means that such a search can be avoided. Instead, we will relax the set of solutions $\theta$ to $\mathcal{A}$ to a simple(er) convex set: the set of pseudodistributions satisfying $\mathcal{A}$ . We define pseudodistributions formally later, for now saying only that they are the convex duals of SoS proofs which use the axioms $\mathcal{A}$ . By this duality, the SoS proof (2) implies not only that any solution $\theta$ to $\mathcal{A}$ is a good choice of parameters but also that a good choice of parameters can be extracted any pseudodistribution satisfying $\mathcal{A}$ . (We are glossing over for now that this last step requires some SDP rounding algorithm, since we use only standard rounding algorithms in this paper.)

Thus, the final SoS algorithms from this method take the form: solve an SDP to find a pseudodistribution which satisfies $\mathcal{A}$ and round it to obtain a estimate $\hat{\theta}$ of $\theta^{*}$ . To analyze the algorithm, use the SoS proof (2) to prove that $\|\hat{\theta}-\theta^{*}\|\leq\delta$ .

2 Hölder’s inequality and identifiability from higher moments

Now we discuss the core ideas in our simple SoS identifiability proofs. We have not yet formally defined SoS proofs, so our goal will just be to construct identifiability proofs which are (a) phrased in terms of inequalities of low-degree polynomials and (b) provable using only simple inequalities, like Cauchy-Schwarz and Hölder’s, leaving the formalities for later.

This inequality says that every one-dimensional projection $u$ of the samples in $S$ , centered around their empirical mean, has a sub-Gaussian empirical $t$ -th moment. (The factor $2$ accounts for deviations in the $t$ -th moments of the samples.) By standard concentration of measure, if $\alpha n\gg d^{t}$ the inequality holds for $S=T$ . It turns out that this property can be enforced by polynomials of degree $t$ . (Actually our final construction of $\mathcal{A}$ will need to use inequalities of matrix-valued polynomials but this can be safely ignored here.)

Intuitively, we would like to show that any $S$ which satisfies $\mathcal{A}$ has empirical mean close to $\mu^{*}$ using a low-degree SoS proof,. This is in fact true when $\alpha=1-\varepsilon$ for small $\varepsilon$ , which is at the core of our robust estimation algorithm. However, in the mixture model setting, when $\alpha=1/(\text{\# of components})$ , for each component $j$ there is a subset $T_{j}\subseteq[n]$ of samples from component $j$ which provides a valid solution $S=T_{j}$ to $\mathcal{A}$ . The empirical mean of $T_{j}$ is close to $\mu_{j}$ and hence not close to $\mu_{i}$ for any $i\neq j$ .

We will prove something slightly weaker, which still demonstrates the main idea in our identifiability proof.

With high probability, for every $S\subseteq[n]$ , if $\mu=\tfrac{1}{|S|}\sum_{i\in S}X_{i}$ is the empirical mean of samples in $S$ , then $\|\mu-\mu^{*}\|\leq 4t^{1/2}\cdot(|T|/|S\cap T|)^{1/t}$ .

Notice that a random $S\subseteq[n]$ of size $\alpha n$ will have $|S\cap T|\approx\alpha^{2}n$ . In this case the lemma would yield the bound $\|\mu-\mu^{*}\|\leq\tfrac{4t^{1/2}}{\alpha^{1/t}}$ . Thinking of $\alpha\ll 1/t$ , this bound improves exponentially as $t$ grows. In the $d$ -dimensional $k$ -component mixture model setting, one has $1/\alpha=\operatorname{poly}(k)$ , and thus the bound becomes $\|\mu-\mu^{*}\|\leq 4t^{1/2}\cdot k^{O(1/t)}$ . In a mixture model where components are separated by $k^{\varepsilon}$ , such an estimate is nontrivial when $\|\mu-\mu^{*}\|\ll k^{\varepsilon}$ , which requires $t=O(1/\varepsilon)$ . This is the origin of the quantitative bounds in our mixture model algorithm.

We turn to the proof of Lemma 2.1. As we have already emphasized, the crucial point is that this proof will be accomplished using only simple inequalities, avoiding any union bound over all possible subsets $S$ .

Let $w_{i}$ be the $0/1$ indicator of $i\in S$ . To start the argument, we expand in terms of samples:

The key term to bound is the first one; the second amounts to a deviation term. By Hölder’s inequality and for even $t$ ,

The second line follows by adding the samples from $[n]\setminus T$ to the sum; since $t$ is even this only increases its value. The third line uses the moment inequality (1). The last line just uses the definition of $w$ .

For the second, deviation term, we use Hölder’s inequality again:

with high probability and hence, using the quadratic form at $(\mu^{*}-\mu)^{\otimes t/2}$ ,

Putting these together and simplifying constants, we have obtained that with high probability,

3 From identifiability to algorithms

We now discuss how to use the ideas described above algorithmically for learning well-separated mixture models. The high level idea for robust estimation is similar. Given Lemma 2.1, a naive algorithm for learning mixture models would be the following: find a set of points $T$ of size roughly $n/k$ that satisfy the moment bounds described, and simply output their empirical mean. Since by a simple counting argument this set must have nontrivial overlap with the points from some mixture component, Lemma 2.1 guarantees that the empirical mean is close to mean of this component.

However, in general finding such a set of points is algorithmically difficult. In fact, it would suffice to find a distribution over such sets of points (since then one could simply sample from this distribution), however, this is just as computationally difficult. The critical insight is that because of the proof of Lemma 2.1 only uses facts about low degree polynomials, it suffices to find an object which is indistinguishable from such a distribution, considered as a functional on low-degree polynomials.

The natural object in this setting is a pseudo-distribution. Pseudo-distributions form a convex set, and for a set of low-degree polynomial equations and inequalities $\mathcal{A}$ , it is possible to find a pseudo-distribution which is indistinguishable from a distribution over solutions to $\mathcal{A}$ (as such a functional) in polynomial time via semidefinite programming (under mild assumptions on $\mathcal{A}$ ). More specifically, the set of SoS proofs using axioms $\mathcal{A}$ is a semidefinite program (SDP), and the above pseudodistributions form the dual SDP. (We will make these ideas more precise in the next two sections.)

Our algorithm then proceeds via the following general framework: find an appropriate pseudodistribution via convex optimization, then leverage our low-degree sum of squares proofs to show that information about the true clusters can be extracted from this object by a standard SDP rounding procedure.

Preliminaries

We now formally define the class of distributions we will consider throughout this paper. At a high level, we will consider distributions which have bounded moments, for which there exists a low degree SoS proof of this moment bound. Formally:

Equivalently, the polynomial $p(u)=(\sigma s)^{s/2}\|u\|^{s}-E_{Y\sim\mathcal{D}_{k}}\langle\left(Y-\mu\right),u\rangle^{s}$ should be a sum-of-squares. In our typical use case, $\sigma=1$ , we will omit it and call the distribution $t$ -explicitly bounded.

We remark that our results also hold for somewhat more general settings. It is not particularly important that the $s$ -th moment bound has a degree $s$ proof; our techniques can tolerate degree $O(s)$ proofs. Our techniques also generally apply for weaker moment bounds. For instance, our techniques naturally extend to explicitly bounded sub-exponential type distributions in the obvious way. We omit these details for simplicity.

As we show in Appendix A.1, this class still captures many interesting types of nice distributions, including Gaussians, product distributions with sub-Gaussian components, and rotations therof. With this definition in mind, we can now formally state the problems we consider in this paper:

We first define the class of mixture models for which our algorithm works:

Robust mean estimation

We consider the same basic model of corruption introduced in [DKK+16].

We say a set of samples $X_{1},\ldots,X_{n}$ is $\varepsilon$ -corrupted from a distribution $\mathcal{D}$ if they are generated via the following process. First, $n$ independent samples are drawn from $\mathcal{D}$ . Then, an adversary changes $\varepsilon n$ of these points arbitrarily, and the altered set of points is then returned to us in an arbitrary order.

The problem we consider in this setting is the following:

1 The SoS proof system

We refer the reader to [OZ13, BS14] and the references therein for a thorough exposition of the SoS algorithm and proof system; here we only define what we need.Our definition of SoS proofs differs slightly from O’Donnell and Zhou’s in that we allow proofs to use products of axioms.

Let $x_{1},\ldots,x_{n}$ be indeterminates and $\mathcal{A}$ be the set of polynomial equations and inequalities $\{p_{1}(x)\geq 0,\ldots,p_{m}(x)\geq 0,q_{1}(x)=0,\ldots,q_{m}(x)=0\}$ . We say that the statement $p(x)\geq 0$ has an SoS proof if there are polynomials $\{r_{\alpha}\}_{\alpha\subseteq[m]}$ (where $\alpha$ may be a multiset) and $\{s_{i}\}_{i\in[m]}$ such that

and each polynomial $r_{\alpha}(x)$ is a sum of squares.

If the polynomials $r_{\alpha}(x)\cdot\prod_{i\in\alpha}p_{i}(x)$ and $s_{i}(x)q_{i}(x)$ have degree at most $d$ , we say the proof has degree at most $d$ , and we write

SoS proofs compose well, and we frequently use the following without comment.

If $\mathcal{A}\vdash_{d}\,p(x)\geq 0$ and $\mathcal{A}\vdash_{d^{\prime}}\,q(x)\geq 0$ , then $\mathcal{A}\cup\mathcal{B}\vdash_{\max(d,d^{\prime})}\,p(x)+q(x)\geq 0$ and $\mathcal{A}\cup\mathcal{B}\vdash_{dd^{\prime}}\,p(x)q(x)\geq 0$ .

The main fact relating pseudoexpectations and SoS proofs is:

In Section A we state and prove many basic SoS inequalities that we will require throughout the paper.

In Section A we show that product distributions (and rotations thereof) with bounded $t$ -th moments are explicitly bounded.

Then $\mathcal{D}$ is $t$ -explicitly bounded (with variance proxy $1$ ).

(The factors of $\frac{1}{2}$ can be removed for many distributions, including Gaussians.)

Capturing empirical moments with polynomials

$\alpha\in$ be a number (the intention is $|S|=\alpha n$ ).

$\tau>0$ be some error magnitude accounting for fluctuations in the sizes of clusters (which may be safely ignored at first reading).

Let $\mathcal{A}$ be the following system of equations and inequalities, depending on all the parameters above.

$w_{i}^{2}=w_{i}$ for all $i\in[n]$ (enforcing that $w$ is a $0/1$ vector, which we interpret as the indicator vector of the set $S$ ).

$(1-\tau)\alpha n\leq\sum_{i\in[n]}w_{i}\leq(1+\tau)\alpha n$ , enforcing that $|S|\approx\alpha n$ (we will always choose $\tau=o(1)$ ).

$\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}$ , enforcing that $\mu$ is the empirical mean of the samples in $S$

$\sum_{i\in[n]}w_{i}\langle X_{i}-\mu,\mu-\mu_{j}\rangle^{t}\leq 2\cdot t^{t/2}\sum_{i\in[n]}w_{i}\|\mu-\mu_{j}\|^{t}$ for every $\mu_{j}$ among $\mu_{1},\ldots,\mu_{m}$ . This enforces that the $t$ -th empirical moment of the samples in $S$ is bounded in the direction $\mu-\mu_{j}$ .

Notice that since we will eventually take $\mu_{j}$ ’s to be unknown parameters we are trying to estimate, the algorithm cannot make use of $\mathcal{A}$ directly, since the last family of inequalities involve the $\mu_{j}$ ’s. Later in this paper we exhibit a system of inequalities which requires the empirical $t$ -th moments to obey a sub-Gaussian type bound in every direction, hence implying the inequalities here without requiring knowledge of the $\mu_{j}$ ’s to write down. Formally, we will show:

There is a family $\widehat{\mathcal{A}}$ of polynomial equations and inequalities of degree $O(t)$ on variables $w=(w_{1},\ldots,w_{n}),\mu=(\mu_{1},\ldots,\mu_{k})$ and at most $n^{O(t)}$ other variables, whose coefficients depend on $\alpha,t,\tau,X_{1},\ldots,X_{n}$ , such that

(Booleanness) $\widehat{\mathcal{A}}$ includes the equations $w_{i}^{2}=w_{i}$ for all $i\in[n]$ .

(Size) $\widehat{\mathcal{A}}$ includes the inequalities $(1-\tau)\alpha n\leq\sum w_{i}\leq(1+\tau)\alpha n$ .

(Empirical mean) $\widehat{\mathcal{A}}$ includes the equation $\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}$ .

In particular this implies that $\widehat{\mathcal{A}}\vdash_{O(t)}\,\mathcal{A}$ .

The proof of Lemma 4.1 can be found in Section 7.

Aside from mentioning at a couple key points why our SoS proofs have bounded coefficients, we henceforth ignore all numerical issues. For further discussion of numerical accuracy and well-conditioned-ness issues in SoS, see [O’D17, BS17, RW17]

Mixture models: algorithm and analysis

In this section we formally describe and analyze our algorithm for mixture models. We prove the following theorem.

On the other hand, if $\Delta=C^{\prime}\sqrt{\log k}$ (for some universal $C^{\prime}$ ) then taking $t=O(\log k)$ gives error

which, for large-enough $C^{\prime}$ and $t$ , can be made $1/\operatorname{poly}(k)$ . Thus for $\Delta=C^{\prime}\sqrt{\log k}$ and any $O(\log k)$ -explicitly bounded distrituion we obtain error $1/\operatorname{poly}(k)$ with $d^{O(\log k)}$ samples and $d^{O(\log k)^{2}}$ time.

In this section we describe and analyze our algorithm. To avoid some technical work we analyze the uniform mixtures setting, with $\lambda_{i}=1/m$ . In Section D we describe how to adapt the algorithm to the nonuniform mixture setting.

We formally describe our mixture model algorithm now. We use the following lemma, which we prove in Section 5.6. The lemma says that given a matrix which is very close, in Frobenious norm, to the $0/1$ indicator matrix of a partition of $[n]$ it is possible to approximately recover the partition. (The proof is standard.)

As described, LearnMixtureMeans has two phases: a clustering phase and a mean-estimation phase. The clustering phase is the heart of the algorithm; we will show that after running RoundSecondMoments the algorithm has obtained clusters $C_{1},\ldots,C_{k}$ which err from the ground-truth clustering on only a $\tfrac{2^{O(t)}t^{t/2}\operatorname{poly}(k)}{\Delta^{t}}$ -fraction of points. To obtain estimates $\hat{\mu}_{i}$ of the underlying means from such a clustering, one simple option is to output the empirical mean of the clusters. However, without additional pruning this risks introducing error in the mean estimates which grows with the ambient dimension $d$ . By using the robust mean estimation algorithm instead to obtain mean estimates from the clusters we obtain errors in the mean estimates which depend only on the number of clusters $k$ , the between-cluster separation $\Delta$ , and the number $t$ of bounded moments.

We observe that LearnMixtureMeans can be implemented in time $n^{O(t)}$ . The main theorem requires $n\geq k^{O(1)}d^{O(t)}$ , which means that the final running time of the algorithm is $(kd^{t})^{O(t)}$ .As discussed in Section 4, correctness of our algorithm at the level of numerical accuracy requires that the coefficients of every polynomial in the SoS program $\widehat{\mathcal{A}}$ (and every polynomial in the SoS proofs we use to analyze $\widehat{\mathcal{A}}$ ) are polynomially bounded. This may not be the case if some vectors $\mu_{1},\ldots,\mu_{m}$ have norms $\|\mu_{i}\|\geq d^{\omega(1)}$ . This can be fixed by naively clustering the samples $X_{1},\ldots,X_{n}$ via single-linkage clustering, then running LearnMixtureMeans on each cluster. It is routine to show that the diameter of each cluster output by a naive clustering algorithm is at most $\operatorname{poly}(d,k)$ under our assumptions, and that with high probability single-linkage clustering produces a clustering respecting the distributions $\mathcal{D}_{i}$ . Hence, by centering each cluster before running LearnMixtureMeans we can assume that $\|\mu_{i}\|\leq\operatorname{poly}(d,k)$ for every $i\leq d$ .

2 Proof of main theorem

In this section we prove our main theorem using the key lemmata; in the following sections we prove the lemmata.

(Empirical moments) For every cluster $S_{j}=\{X_{i}\,:\,X_{i}\text{ is from$ \mathcal{D}_{j} $}\}$ , the system $\widehat{\mathcal{A}}$ from Lemma 4.1 with $\alpha=1/m$ and $\tau=\Delta^{-t}$ has a solution which takes $w\in\{0,1\}^{n}$ to be the $0/1$ indicator vector of $S_{j}$ .

(Empirical means) Let $\overline{\mu}_{j}$ be the empirical mean of cluster $S_{j}$ . The $\overline{\mu}_{j}$ ’s satisfy $\|\overline{\mu}_{i}-\mu_{i}\|\leq\Delta^{-t}$ .

We note a few useful consequences of these conditions, especially (D1). First of all, it implies all clusters have almost the same size: $(1-\Delta^{-t})\cdot\tfrac{n}{k}\leq|S_{j}|\leq(1+\Delta^{-t})\cdot\tfrac{n}{k}$ . Second, it implies that all clusters have explicitly bounded moments: for every $S_{j}$ ,

Lemmas

The following key lemma captures our SoS identifiability proof for mixture models.

The other main lemma shows that conditions (D1) and (D2) occur with high probability.

With notation as above, conditions (D1) and (D2) simultaneously occur with probability at least $1-1/d^{15}$ over samples $X_{1},\ldots,X_{n}$ , so long as $n\geq d^{O(t)}k^{O(1)}$ , for $\Delta\geq 1$ .

Lemma 5.4 follows from Lemma 4.1, for (D1), and standard concentration arguments for (D2). Now we can prove the main theorem.

which can be rearranged to give $\|M-A\|^{2}\leq n\cdot\tfrac{2^{O(t)}t^{t/2}k^{4}}{\Delta^{t}}$ . ∎

3 Identifiability

In this section we prove Lemma 5.3. We use the following helpful lemmas. The first is in spirit an SoS version of Lemma 2.1.

Let $\mu_{1},\ldots,\mu_{k},\mathcal{D}_{1},\ldots,\mathcal{D}_{k},t$ be as in Theorem 5.1. Let $\overline{\mu}_{i}$ be as in (D1). Suppose (D1) occurs for samples $X_{1},\ldots,X_{n}$ . Let $\mathcal{A}$ be the system from Lemma 4.1, with $\alpha=1/k$ and any $\tau$ . Then

The second lemma is an SoS triangle inequality, capturing the consequences of separation of the means. The proof is standard given Fact A.2.

The last lemma helps put the previous two together. Although we have phrased this lemma to concorde with the mixture model setting, we note that the proof uses nothing about mixture models and consists only of generic manipulations of pseudodistributions.

Now we have the tools to prove Lemma 5.3.

4 Proof of Lemma 5.5

In this subsection we prove Lemma 5.5. We use the following helpful lemmata. The first bounds error from samples selected from the wrong cluster using the moment inequality.

Let $j,\mathcal{A},X_{1},\ldots,X_{n},\mu_{j},\overline{\mu}_{j}$ be as in Lemma 5.5. Then

The proof goes by Hölder’s inequality followed by the moment inequality in $\mathcal{A}$ . Carrying this out, by Fact A.6 and evenness of $t$ ,

Then, using the main inequality in $\mathcal{A}$ ,

The second lemma bounds error from deviations in the empirical $t$ -th moments of the samples from the $j$ -th cluster.

Let $\mu_{1},\ldots,\mu_{k},\mathcal{D}_{1},\ldots,\mathcal{D}_{k}$ be as in Theorem 5.1. Suppose condition (D1) holds for samples $X_{1},\ldots,X_{n}$ . Let $w_{1},\ldots,w_{n}$ be indeterminates. Let $u=u_{1},\ldots,u_{d}$ be an indeterminate. Then for every $j\in[k]$ ,

The first step is Hölder’s inequality again:

We can prove Lemma 5.5 by putting together Lemma 5.8 and Lemma 5.9.

Let $j\in[k]$ be a cluster and recall $a_{j}\in\{0,1\}^{n}$ is the $0/1$ indicator for the samples in cluster $j$ . Let $S_{j}$ be the samples in the $j$ -th cluster, with empirical mean $\overline{\mu}_{j}$ . We begin by writing $\langle a_{j},w\rangle\|\mu-\overline{\mu}_{j}\|^{2}$ in terms of samples $X_{1},\ldots,X_{n}$ .

Hence, using $(a+b)^{t}\leq 2^{t}(a^{t}+b^{t})$ , we obtain

5 Proof of Lemma 5.7

We prove Lemma 5.7. The proof only uses standard SoS and pseudodistribution tools. The main inequality we will use is the following version of Hölder’s inequality.

We first establish the following inequality.

Raising both sides to the $t/2$ power gives

6 Rounding

In this section we state and analyze our second-moment round algorithm. As have discussed already, our SoS proofs in the mixture model setting are quite strong, meaning that the rounding algorithm is relatively naive.

To get started analyzing the algorithm, we need a definition.

We proceed inductively. We first prove the base case. WLOG assume that the algorithm picks $v_{1}$ , and that $v_{1}$ is good, and is from component $j$ . Then, for all $i\in I_{g,j}$ , by the triangle inequality we have $\|v_{i}-v_{1}\|_{2}\leq 2\frac{n^{1/2}}{B}$ , and so $I_{g,j}\subseteq C_{1}$ . Moreover, if $i\in I_{g,j^{\prime}}$ for some $j^{\prime}\neq j$ , we have

and so in this case $i\not\in C_{1}$ . Hence $C_{1}=I_{g,j}\cup B_{1}$ for some $B_{1}\subseteq I_{b}$ .

There are at most $\varepsilon^{2}En$ indices which are $E$ -bad; i.e. $|I_{b}|\leq\varepsilon^{2}En$ .

from which the claim follows by simplifying. ∎

With probability at least $1-\varepsilon^{2}m^{3}$ , the algorithm RoundSecondMoments chooses good indices in all $k$ iterations.

By Lemma 5.13, in the first iteration the probability that a bad vector is chosen is at most $\varepsilon^{2}E$ . Conditioned on the event that in iterations $1,\ldots,a$ the algorithm has chosen good vectors, then by Lemma 5.12, there is at least one $j_{a}$ so that no points in $I_{g,j_{a}}$ have been removed. Thus at least $(1-\delta)n/m$ vectors remain, and in total there are at most $\varepsilon^{2}En$ bad vectors, by Lemma 5.13. So, the probability of choosing a bad vector is at most $\varepsilon^{2}Em$ . Therefore, by the chain rule of conditional expectation and our assumption , the probability we never choose a bad vector is at least

Choosing $E=m$ this is $(1-\varepsilon^{2}m^{2})^{m}\geq 1-\varepsilon^{2}m^{3}$ . as claimed. ∎

Now Theorem 5.11 follows from putting together the lemmas.

Robust estimation: algorithm and analysis

Our main contribution in this section is a formal description of an algorithm EstimateMean which makes these ideas rigorous, and the proof of the following theorem about its correctness:

As a remark, observe that if we set $t=2\log 1/\varepsilon$ , then the error becomes $O(\varepsilon\sqrt{\log 1/\varepsilon})$ . Thus, with $n=O(d^{O(\log 1/\varepsilon)}/\varepsilon^{2})$ samples and $n^{O(\log 1/\varepsilon)}=d^{O(\log 1/\varepsilon)^{2}}$ runtime, we achieve the same error bounds for general explicitly bounded distributions as the best known polynomial time algorithms achieve for Gaussian mean estimation.

Throughout this section, let $[n]=S_{g}\cup S_{b}$ , where $S_{g}$ is the indices of the uncorrupted points, and $S_{b}$ is the indices of the corrupted points, so that $|S_{b}|=\varepsilon n$ by assumption. Moreover, let $Y_{1},\ldots,Y_{n}$ be iid from $\mathcal{D}$ so that $Y_{i}=X_{i}$ for all $i\in S_{g}$ .

We now state some additional tools we will require in our algorithm.

We will require the following elementary pruning algorithm, which removes all points which are very far away from the mean. We require this only to avoid some bit-complexity issues in semidefinite programming; in particular we just need to ensure that the vectors $X_{1},\ldots,X_{n}$ used to form the SDP have polynomially-bounded norms. Formally:

Let $\varepsilon,t,\mu^{*}$ , and $X_{1},\ldots,X_{n}$ be as in Theorem 6.1. There is an algorithm NaivePrune, which given $\varepsilon,t$ and $X_{1},\ldots,X_{n}$ , runs in time $O(\varepsilon dn^{2})$ , and outputs a subset $S\subseteq[n]$ so that with probability $1-1/d^{10}$ , the following holds:

No uncorrupted points are removed, that is $S_{g}\subseteq S$ , and

For all $i\in S$ , we have $\|X_{i}-\mu^{*}\|\leq O(d)$ .

In this case, we say that NaivePrune succeeds.

This algorithm goes by straightforward outlier-removal. It is very similar the procedure described in Fact 4.18 of [DKK+16] (using bounded $t$ -th moments instead of sub-Gaussianity), so we omit it.

Satisfiability

In our algorithm, we will use the same set of polynomial equations $\widehat{\mathcal{A}}$ as in Lemma 4.1. However, the data we feed in does not exactly fit the assumptions in the Lemma. Specifically, because the adversary is allowed to remove an $\varepsilon$ -fraction of good points, the resulting uncorrupted points are no longer iid from $\mathcal{D}$ . Despite this, we are able to specialize Lemma 4.1 to this setting:

We sketch the proof of Lemma 6.3 in Section 7.4.

2 Formal Algorithm Specification

With these tools in place, we can now formally state the algorithm. The formal specification of this algorithm is given in Algorithm 3.

The first two lines of Algorithm 3 are only necessary for bit complexity reasons, since we cannot solve SDPs exactly. However, since we can solve them to doubly-exponential accuracy in polynomial time, it suffices that all the quantities are at most polynomially bounded (indeed, exponentially bounded suffices) in norm, which these two lines easily achieve. For the rest of this section, for simplicity of exposition, we will ignore these issues.

3 Deterministic conditions

With these tools in place, we may now state the deterministic conditions under which our algorithm will succeed. Throughout this section, we will condition on the following events holding simultaneously:

We have the following concentration of the uncorrupted points:

We have the following concentration of the empirical $t$ -th moment tensor:

The following lemma says that with high probability, these conditions hold simultaneously:

We defer the proof of this lemma to the Appendix.

For simplicity of notation, throughout the rest of the section, we will assume that NaivePrune does not remove any points whatsoever. Because we are conditioning on the event that it removes no uncorrupted points, it is not hard to see that this is without loss of generality.

4 Identifiability

Our main identifiability lemma is the following.

The third type of error is similar in spirit: it is the contribution of the original uncorrupted points that the adversary removed. Formally:

Finally, the fourth type of error comes from the $\varepsilon n$ adversarially-chosen vectors. We prove this lemma by using the bounded-moments inequality in $\mathcal{A}$ .

With these lemmas in place, we now have the tools to prove Lemma 6.5.

where (a) follows from the mean axioms, (b) follows from splitting up the uncorrupted and the corrupted samples, (c) follows by adding and subtracting $1$ to each term in $S_{g}$ , and (d) follows from the assumption that $Y_{i}=X_{i}$ for all $i\in[n]$ . We will rearrange the last term by adding and subtracting $\mu$ . Note the following polynomial identity:

and put it together with the above to get

Now we use $\vdash_{t}\,(x+y+z+w)^{t}\leq\exp(t)\cdot(x^{t}+y^{t}+z^{t}+w^{t})$ for any even $t$ , and Lemma 6.6, Lemma 6.7, and Lemma 6.9 and simplify to conclude

Lastly, since $\mathcal{A}\vdash_{2}\,\sum_{i\in T}w_{i}\geq(1-2\varepsilon)n$ , we get

5 Rounding

6 Proofs of Lemmata 6.6–6.9

We first prove Lemma 6.6, which is a relatively straightforward application of SoS Cauchy Schwarz.

where the last inequality follows from (E3). This completes the proof. ∎

Before we prove Lemmata 6.7–6.9, we prove the following lemma which we will use repeatedly:

where (a) follows from (E4) and (b) follows from $10t$ -explicitly boundedness. ∎

We now return to the proof of the remaining Lemmata.

We start by applying Hölder’s inequality, Fact A.6, (implicitly using that $w_{i}^{2}=w_{i}\vdash_{2}\,(1-w_{i})^{2}=1-w_{i}$ ), to get

We apply Hölder’s inequality to obtain that

where (a) follows from the assumption on the size of $S_{b}$ and since the additional terms in the sum are SoS, and (b) follows follows from Lemma 6.10. This completes the proof. ∎

The proof is very similar to the proof of the two previous lemmas, except that we use the moment bound inequality in $\mathcal{A}$ . Getting started, by Hölder’s:

Combining this with the moment bound in $\mathcal{A}$ ,

Finally, clearly $\mathcal{A}\vdash_{2}\,\sum_{i\notin T}w_{i}\leq\varepsilon n$ , which finishes the proof. ∎

Encoding structured subset recovery with polynomials

The goal in this section is to prove Lemma 4.1. The eventual system $\widehat{\mathcal{A}}$ of polynomial inequalities we describe will involve inequalities among matrix-valued polynomials. We start by justifying the use of such inequalities in the SoS proof system.

Let $x=(x_{1},\ldots,x_{n})$ be indeterminates. We describe a proof system which can reason about inequalities of the form $M(x)\succeq 0$ , where $M(x)$ is a symmetric matrix whose entries are polynomials in $x$ .

if there are vector-valued polynomials $\{r_{S}^{j}\}_{j\leq N,S\subseteq[m]}$ (where the $S$ ’s are multisets), a matrix $B$ , and a matrix $Q$ whose entries are polynomials in the ideal generated by $q_{1},\ldots,q_{m}$ , such that

and furthermore that $\deg\left(\sum_{j}(r_{S}^{j}(x))(r_{S}^{j}(x))^{\top}\right)\otimes\left[\otimes_{i\in S}M_{i}(x)\right]\leq d$ for every $S\subseteq[m]$ , and $\deg Q\leq d$ . Observe that in the case $M_{1},\ldots,M_{m},M$ are actually $1\times 1$ matrices, this reduces to the usual notion of scalar-valued sum of squares proofs.

For completeness, we prove the following lemmas in the appendix.

2 Warmup: Gaussian moment matrix-polynomials

Size: $(1-\tau)\alpha n\leq\sum_{i\in[n]}w_{i}\leq(1+\tau)\alpha n$ .

Empirical mean: $\mu=\tfrac{1}{\sum_{i\in[n]}w_{i}}\sum_{i\in[n]}w_{i}X_{i}$ .

$t$ -th Moments: the $t$ -th empirical moments of the vectors selected by the vector $w$ , centered about $\mu$ , are subgaussian. That is,

The second property is already phrased as two polynomial inequalities, and the third can be rearranged to a polynomial equation. For the first, we use polynomial equations $w_{i}^{2}=w_{i}$ for every $i\in[n]$ . The moment constraint will be the most difficult to encode. We give two versions of this encoding: a simple one which will work when the distribution of the structured subset of samples to be recovered is Gaussian, and a more complex version which allows for any explicitly bounded distribution. For now we describe only the Gaussian version. We state some key lemmas and prove them for the Gaussian case. We carry out the general case in the following section.

To encode the bounded moment constraint, for this section we let $M(w,\mu)$ be the following matrix-valued polynomial

For parameters $\alpha\in$ (for the size of the subset), $t$ (for which empirical moment to control), and $\tau>0$ (to account for some empirical deviations), the structured subset axioms are the following matrix-polynomial inequalities on variables $w=(w_{1},\ldots,w_{n}),\mu=(\mu_{1},\ldots,\mu_{d})$ .

booleanness: $w_{i}^{2}=w_{i}$ for all $i\in[n]$

size: $(1-\tau)\alpha n\leq\sum_{i\in[n]}w_{i}\leq(1+\tau)\alpha n$

$\mu$ is the empirical mean: $\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}$ .

Notice that in light of the last constraint, values for the variables $\mu$ are always determined by values for the variables $w$ , so strictly speaking $\mu$ could be removed from the program. However, we find it notationally convenient to use $\mu$ . We note also that the final constraint, that $\mu$ is the empirical mean, will be used only for the robust statistics setting but seems unnecessary in the mixture model setting.

Next, we state and prove some key lemmas for this Gaussian setting, as warmups for the general setting.

Suppose $S$ has size exactly $(1-\tau)\alpha n$ ; otherwise replace $S$ with a random subset of $S$ of size exactly $(\alpha-\tau)n$ . As a solution to the polynomials, we will take $w$ to be the indicator vector of $S$ and $\mu=\frac{1}{|S|}\sum_{i\in[n]}w_{i}X_{i}$ . The booleanness and size axioms are trivially satisfied. The spectral inequality

follows from concentration of the empirical mean to the true mean $\mu^{*}$ and standard matrix concentration (see e.g. [Tro12]). ∎

The next lemma is actually a corollary of Lemma 7.2.

3 Moment polynomials for general distributions

We start by defining polynomial equations $\widehat{\mathcal{A}}$ , for which we introduce some extra variables For every pair of multi-indices $\gamma,\rho$ over $[k]$ with degree at most $t/2$ , we introduce a variable $M_{\gamma,\rho}$ . The idea is that $M=[M_{\gamma,\rho}]_{\gamma,\rho}$ forms an $n^{t/2}\times n^{t/2}$ matrix. By imposing equations of the form $M_{\gamma,\rho}=f_{\gamma,\rho}(w,\mu)$ for some explicit polynomials $f_{\gamma,\rho}$ of degree $O(t)$ , we can ensure that

(This equation should be interpreted as an equality of polynomials in indeterminates $u$ .) Let $\mathcal{L}$ be such a family of polynomial equations. Our final system $\widehat{\mathcal{A}}(\alpha,t,\tau)$ of polynomial equations and inequalities follows. The important parameters are $\alpha$ , controlling the size of the set of samples to be selected, and $t$ , how many moments to control. The parameter $\tau$ is present to account for random fluctuations in the sizes of the cluster one wants to recover.

Let $\widehat{\mathcal{A}}(\alpha,t,\tau)$ be the set of (matrix)-polynomial equations and inequalities on variables $w,\mu,M_{\gamma,\rho}$ containing the following.

Booleanness: $w_{i}^{2}=w_{i}$ for all $i\in[n]$

Size: $(1-\tau)\alpha n\leq\sum w_{i}\leq(1+\tau)\alpha n$ .

Empirical mean: $\mu\cdot\sum_{i\in[n]}w_{i}=\sum_{i\in[n]}w_{i}X_{i}$ .

The equations $\mathcal{L}$ on $M$ described above.

In the remainder of this section we prove the satisfiability and moment bounds parts of Lemma 4.1. To prove the lemma we will need a couple of simple facts about SoS proofs.

and similarly for $\tfrac{1}{m}\sum_{i\in[n]}\left[\langle Y_{i},u\rangle^{t}-\langle X_{i},u\rangle^{t}\right]$ .

Expanding $\langle Y_{i},u\rangle^{t}$ , we get

For each term, by Cauchy-Schwarz, $\vdash_{t}\,\langle X_{i},u\rangle^{s}\langle v,u\rangle^{t-s}\leq\|X_{i}\|^{s}\|v\|^{t-s}\cdot\|u\|^{t}$ . Putting these together with the hypothesis on $\tfrac{1}{n}\|X_{i}\|^{s}$ and counting terms finishes the proof. ∎

By taking a random subset $S$ if necessary, we assume $|S|=(1-\tau)\alpha n=m$ . We describe a solution to the system $\widehat{\mathcal{A}}$ . Let $w$ be the $0/1$ indicator vector for $S$ . Let $\mu=\tfrac{1}{m}\sum_{i\in S}w_{i}X_{i}$ . This satisfies the Boolean-ness, size, and empirical mean axioms.

Describing the assignment to the variables $\{M_{\gamma,\rho}\}$ takes a little more work. Re-indexing and centering, let $Y_{1}=X_{i_{1}}-\mu,\ldots,Y_{m}=X_{i_{m}}-\mu$ be centered versions of the samples in $S$ , where $S=\{i_{1},\ldots,i_{m}\}$ and $\mu$ remains the empirical mean $\tfrac{1}{m}\sum_{i\in S}X_{i}$ . First suppose that the following SoS proof exists:

Just substituting definitions, we also obtain

Let $M_{\gamma,\rho}=P_{\gamma,\rho}$ . Then clearly $M\succeq 0$ and $M,w,\mu$ together satisfy $\mathcal{L}$ .

It remains to show that the first SoS proof exists with high probability for large enough $m$ . Since $t$ is even and $0.1>\tau>0$ , it is enough to show that

Let $Z_{i}=X_{i}-\mu^{*}$ , where $\mu^{*}$ is the true mean of $\mathcal{D}$ . Let

(bounded norms) for every $s\leq t$ it holds that $\tfrac{1}{m}\sum_{i\in[m]}\|Z_{i}\|^{s}\leq s^{100s}d^{s/2}$ .

(concentration of empirical mean) $\|\mu-\mu^{*}\|\leq d^{-5t}$ .

Starting with $a(u)$ , using Fact 7.5, it is enough that $2^{t}C^{t-1}\|v\|\leq\tfrac{1}{4}$ , where $v=\mu-\mu^{*}$ and $C$ is such that $\tfrac{1}{m}\sum_{i\in[m]}\|Z_{i}\|^{s}\leq C^{s}$ . By 1 and 2, we can assume $\|v\|\leq d^{-5t}$ and $C=t^{100}d^{1/2}$ . Then the conclusion follows for $t\geq 3$ .

We turn to $b(u)$ . A typical coefficient of $b(u)$ in the monomial basis—say, the coefficient of $u^{\theta}$ for some multiindiex $\theta$ of degree $|\theta|=t$ , looks like

By assumption this is at most $d^{-10t}$ in magnitude, so the sum of squared coefficients of $b(u)$ is at most $d^{-18t}$ . The bound on $b(u)$ for $d\geq 2$ . ∎

Using this in conjunction with the linear equations $\mathcal{L}$ ,

(bounded norms) for every $s\leq t$ it holds that $\tfrac{1}{m}\sum_{i\in[m]}\|Z_{i}\|^{s}\leq s^{100s}d^{s/2}$ .

(concentration of empirical mean) $\left\lVert\tfrac{1}{m}\sum_{i\in[m]}Z_{i}\right\rVert\leq d^{-5t}$ .

The proofs are standard applications of central limit theorems, in particular the Berry-Esseen central limit theorem [Ber41], since all the quantities in question are sums of iid random variables with bounded moments. We will prove only the first statement; the others are similar.

Finally we remark on the polynomial-time algorithm to find a pseudoexpectation satisfying $\widehat{\mathcal{A}}$ . As per [BS17], it is just necessary to ensure that if $x=(w,\mu)$ , the polynomials in $\widehat{\mathcal{A}}$ include $\|x\|^{2}\leq M$ for some large number $M$ . In our case the equation $\|x\|^{2}\leq(nkm)^{O(1)}$ can be added without changing any arguments.

4 Modifications for robust estimation

We briefly sketch how the proof of Lemma 4.1 may be modified to prove Lemma 6.3. The main issue is that $\widehat{\mathcal{A}}$ of Lemma 4.1 is satisfiable when there exists an SoS proof

where $\mu$ is the empirical mean of $X_{i}$ such that $w_{i}=1$ . In the proof of Lemma 4.1 we argued that this holds when $w$ is the indicator for a set of iid samples from a $10t$ -explicitly bounded distribution $\mathcal{D}$ . However, in the robust setting, $w$ should be taken to be the indicator of the $(1-\varepsilon)n$ good samples remaining from such a set of iid samples after $\varepsilon n$ samples are removed by the adversary. If $Y_{1},\ldots,Y_{n}$ are the original samples, with empirical mean $\mu^{*}$ , the proof of Lemma 4.1 (with minor modifications in constants) says that with high probability,

For small-enough $\varepsilon$ , this also means that

This almost implies that $\widehat{\mathcal{A}}$ is satisfiable given the $\varepsilon$ -corrupted vectors $X_{1},\ldots,X_{n}$ and parameter $\alpha=(1-\varepsilon)n$ , except for that $\mu^{*}=\tfrac{1}{n}\sum_{i\in[n]}Y_{i}$ and we would like to replace it with $\mu=\tfrac{1}{(1-\varepsilon)n}\sum_{i\text{ good}}X_{i}$ . This can be accomplished by noting that, as argued in Section 6, with high probability $\|\mu-\mu^{*}\|\leq O(t\cdot\varepsilon^{1-1/t})$ .

Acknowledgements

The authors would like to thank David Steurer, Daniel Freund, Guatam Kamath, Pablo Parillo, and especially Aravindan Vijayaraghavan for some helpful conversations.

References

Appendix A Toolkit for sum of squares proofs

Let $x_{1},\ldots,x_{n},y_{1},\ldots,y_{n}$ be indeterminates. Then

Let $x,y$ be $n$ -length vectors of indeterminates. Then

The sum of squares proof of Cauchy-Schwarz implies that $\|x\|^{2}+\|y\|^{2}-2\langle x,y\rangle$ is a sum of squares. Now we just expand

Let $M$ be a matrix whose rows and columns are indexed by multisets $S\subseteq[n]$ of sizes $a$ and $b$ . Thus $M$ has four blocks: an $(a,a)$ block, an $(a,b)$ block, a $(b,a)$ block, and a $(b,b)$ block. In the $(a,b)$ and $(b,a)$ blocks, put matrices $M_{ab},M_{ba}$ such that $\langle x^{\otimes a},M_{ab}x^{\otimes b}\rangle=\tfrac{1}{2}\,.P(x)$ . In the $(a,a)$ and $(b,b)$ blocks, put $\sqrt{C}\cdot I$ . Then, letting $z=(x^{\otimes a},x^{\otimes b})$ , we get $\langle z,Mz\rangle=\sqrt{C}(\|x\|^{2a}+\|x\|^{2b})-P(x)$ . Note that $\|M_{ab}\|\leq\sqrt{C}$ by hypothesis, so $M\succeq 0$ , which completes the proof. ∎

Let $u=(u_{1},\ldots,u_{k})$ be a vector of indeterminantes. Let $D$ be sub-Gaussian with variancy proxy $1$ . Let $t\geq 0$ be an integer. Then we have

Expand the polynomial in question. We have

Let $x_{1},\ldots,x_{n},y_{1},\ldots,y_{n}$ be indeterminates. Then

We will only prove the first inequality. The second inequality follows since $w_{i}^{2}=w_{i}\vdash_{2}\,w_{i}x_{i}=w_{i}\cdot(w_{i}x_{i})$ , applying the first inequality, and observing that $w_{i}^{2}=w_{i}\vdash_{q}\,w_{i}^{q}=w_{i}$ .

Applying Cauchy-Schwarz (Fact A.1) and the axioms, we obtain to start that for any even number $t$ ,

Applying Fact A.1 one more time to get $\left(\sum_{i\leq n}w_{i}x_{i}^{q/2}\right)\leq\left(\sum_{i\leq n}w_{i}^{2}\right)\left(\sum_{i\leq n}x_{i}^{q}\right)$ and then the axioms $w_{i}^{2}=w_{i}$ completes the proof. ∎

In this section, we show that many natural high dimensional distributions are explicitly bounded. Recall that if a univariate distribution $X$ sub-Gaussian (with variancy proxy $\sigma$ ) with mean $\mu$ then we have the following bound on its even centered moments for $t\geq 4$ :

More generally, we will say a univariate distribution is $t$ -bounded with mean $\mu$ and variance proxy $\sigma$ if the following general condition holds for all even $4\leq s\leq t$ :

The factor of $1/2$ in this expression is not important and can be ignored upon first reading.

Our main result in this section is that any rotation of products of independent $t$ -bounded distributions with variance proxy $1/2$ is $t$ -explicitly bounded with variance proxy $1$ :

Since the definition of explicitly bounded is clearly rotation invariant, it suffices to show that $\mathcal{D}^{\prime}$ is $t$ -explicitly bounded. For any vector of indeterminants $u$ , and for any $4\leq s\leq t$ even, we have

where $X^{\prime}$ is an independent copy of $X$ , and the last line follows from SoS Cauchy-Schwarz. We then expand the resulting polynomial in the monomial basis:

since all $\alpha$ with odd monomials disappear since $X-X^{\prime}$ is a symmetric product distribution. By $t$ -boundedness, all remaining coefficients are at most $s^{cs}$ , from which we deduce

which proves that $\mathcal{D}^{\prime}$ is $t$ -explicitly bounded, as desired. ∎

As a corollary observe this trivially implies that all Guassians $\operatorname{\mathcal{N}}(\mu,\Sigma)$ with $\Sigma\preceq I$ are $t$ -explicitly bounded for all $t$ .

We note that our results are tolerant to constant changes in the variancy proxy (just by scaling down). In particular, this implies that our results immediately apply for all rotations of products of $t$ -bounded distributions with a loss of at most $2$ .

Appendix B Sum of squares proofs for matrix positivity – omitted proofs

By hypothesis, there are $r_{S}^{j}$ and $B$ such that

Now, let $T\subseteq[m]$ and $p$ be a polynomial. Let $M^{\prime}=\otimes_{i\in T}M_{i}$ . Suppose that $\deg\left(p^{2}\cdot M\otimes M^{\prime}\right)\leq 2d$ . Using the hypothesis on $M$ , we obtain

Appendix C Omitted Proofs from Section 6

We will show that each event (E1)–(E4) holds with probability at least $1-d^{-8}$ . Clearly for $d$ sufficiently large this implies the desired guarantee. That (E1) and (E2) occur with probability $1-d^{-8}$ follow from Lemmas 6.2 and 6.3, respectively. It now suffices to show (E3) and (E4) holds with high probability. Indeed, that (E4) holds with probability $1-d^{-8}$ follows trivially from the same proof of Lemma 4.1 (it is in fact a simpler version of this fact).

By basic concentration arguments (see e.g. [Ver10]), we know that by our choice of $n$ , with probability $1-d^{-8}$ we have that

Condition on the event that this and (E4) simultaneously hold. Recall that $Y_{i}$ for $i=1,\ldots,n$ are defined so that $Y_{i}$ are iid and $Y_{i}=X_{i}$ for $i\in S_{g}$ . By the triangle inequality, we have

where (a) follows from (E4), and (b) follows since $\delta\ll t^{t}$ . Hence

Taking the $t$ -th root on both sides and combining it with (7) yields

Appendix D Mixture models with nonuniform weights

In this section we describe at a high level how to adapt the algorithm given in Section 5 to handle non-uniform weights. We assume the mixture components now have mixture weights $\eta\leq\lambda_{1}\leq\ldots\leq\lambda_{k}\leq 1$ where $\sum\lambda_{i}=1$ , where $\eta>0$ is some fixed constant. We still assume that all pairs of means satisfy $\|\mu_{i}-\mu_{j}\|\geq k^{\gamma}$ for all $i\neq j$ . In this section we describe an algorithm LearnNonUniformMixtureModel, and we sketch a proof of the following theorem concerning its correctness:

However, the rounding algorithm we require here is somewhat more involved than the naive rounding algorithm used previously for learning mixture models. In particular, we no longer know exactly the Frobenius norm of the optimal solution: we cannot give tight upper and lower bounds. This is because components with mixing weights which are just below the threshold $\alpha^{\prime}$ may or may not contribute to the optimal solution that the SDP finds. Instead, we develop a more involved rounding algorithm RoundSecondMomentsNonuniform, which we describe below.

Our invariant is that every time we have a feasible solution to the SDP, we remove at least one cluster (we make this more formal below). Repeatedly run the SDP with this $\alpha^{\prime}$ until we no longer get a feasible solution, and then repeat with a slightly smaller $\alpha^{\prime}$ . After the loop terminates, output the empirical mean of every cluster. The formal specification of this algorithm is given in Algorithm 4.

It is not hard to show, via arguments exactly as in Section 6 and 7 that the remaining fraction of points from these components which we have not removed as well as the small fraction of points we have removed from good components do not affect the calculations, and so we will assume for simplicity in the rest of this discussion that we have removed all samples from components $j$ with $\lambda_{j}\geq\alpha^{\prime}+O(\xi)$ .

Here we outline the proof of correctness of Algorithm 4. The proof follows very similar ideas as the proof of correctness of Algorithm 1, and so for conciseness we omit many of the details. As before, for simplicity assume that the naive clustering returns only one cluster, as otherwise we can work on each cluster separately, so that for all $i$ , we have $\|\mu_{i}\|\leq O(\operatorname{poly}(d,k))$ after centering.

We now show why this invariant holds. Clearly this holds at the beginning of the algorithm. We show that if it holds at any step, it must also hold at the next time at which the SDP is feasible. Fix such an $\alpha^{\prime}$ . By assumption, we have removed almost all points from components $j$ with $\lambda_{j}\geq\alpha^{\prime}+k^{-8}$ , and have only removed a very small fraction of points not from these components.

By basic concentration, we have $\left|\lambda_{j}n-|S_{j}|\right|\leq o(n)$ for all $j$ except with negligble probability, and so for the rest of the section, for simplicity, we will slightly cheat and assume that $\lambda_{j}n=|S_{j}|$ . It is not hard to show that this also does not effect any calculations.

The main observation is that for any choice of $\alpha^{\prime}$ , by essentially same logic as in Section 5, we still have the following bound for all $i\neq j$ for an $\alpha^{\prime}$ well-behaved run:

for $\widehat{\mathcal{A}}$ instantiated with $\alpha=\alpha^{\prime}$ , where the last line follows by our choice of $t$ sufficiently large.

With parameters as above, for any $\alpha^{\prime}$ well-behaved run, we have $\widehat{\mathcal{A}}\vdash_{O(t)}\,\langle a_{i},w\rangle\leq O(\xi^{2})\cdot\alpha^{\prime}n$ for any $j$ so that $\lambda_{j}n\leq(\alpha^{\prime}-O(\xi^{4}))n$ .

from which we deduce $\widehat{\mathcal{A}}\vdash_{O(t)}\,\langle a_{i},w\rangle\leq O(\xi^{2})\cdot\alpha^{\prime}n$ . ∎

We now show that under these conditions, there is an algorithm to remove a cluster:

D.2 Rounding Well-behaved runs

We now turn to correctness of this algorithm. Define $T$ to be the set of clusters $j$ with $|\lambda_{j}-\alpha^{\prime}|\leq O(\xi^{4})$ . We first show:

The first term is at most $O(d^{2}\eta\xi^{2})(\alpha^{\prime})^{2}n^{2}$ by (8) and the second term is at most $dO(\xi^{2})\alpha^{\prime}n$ by Lemma D.2, so overall we have that

Observe that since $\|v_{i}\|^{2}\leq 1$ and $\sum\|v_{i}\|^{2}=\alpha^{\prime}n$ there are at least $(1-1/100)\alpha^{\prime}n$ vectors with $\|v_{i}\|^{2}\geq\alpha^{\prime}/100$ . We have