Robust Estimators in High Dimensions without the Computational Intractability

Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, Alistair Stewart

Introduction

A central goal of machine learning is to design efficient algorithms for fitting a model to a collection of observations. In recent years, there has been considerable progress on a variety of problems in this domain, including algorithms with provable guarantees for learning mixture models [FOS08, KMV10, MV10, BS10, HK13], phylogenetic trees [CGG02, MR05], HMMs [AHK12], topic models [AGM12, AGHK13], and independent component analysis [AGMS15]. These algorithms crucially rely on the assumption that the observations were actually generated by a model in the family. However, this simplifying assumption is not meant to be exactly true, and it is an important direction to explore what happens when it holds only in an approximate sense. In this work, we study the following family of questions:

This is a natural formalization of the problem of designing robust and efficient algorithms for distribution estimation. We refer to it as (proper) agnostic distribution learning and we refer to the samples as being ε\varepsilon-corrupted. This family of problems has its roots in many fields, including statistics, machine learning, and theoretical computer science. Within computational learning theory, it is related to the agnostic learning model of Haussler [Hau92] and Kearns, Schapire, and Sellie [KSS94], where the goal is to learn a labeling function whose agreement with some underlying target function is close to the best possible, among all functions in some given class. In the even more challenging malicious noise model [Val85, KL93], an adversary is allowed to corrupt both the labels and the samples. A major difference with our setting is that these models apply to supervised learning problems, while here we will work in an unsupervised setting.

Within statistics and machine learning, inference problems like Question 1.1 are often termed “estimation under model misspecification.” The usual prescription is to use the maximum likelihood estimator [Hub67, Whi82], which is unfortunately hard to compute in general. Even ignoring computational considerations, the maximum likelihood estimator is only guaranteed to converge to the distribution PP^{\prime} in D\mathcal{D} that is closest (in Kullback-Leibler divergence) to the distribution from which the observations are generated. This is problematic because such a distribution is not necessarily close to PP at all.

A branch of statistics – called robust statistics [HR09, HRRS86] – aims to tackle questions like the one above. The usual formalization is in terms of breakdown point, which (informally) is the fraction of observations that an adversary would need to control to be able to completely corrupt an estimator. In low-dimensions, this leads to the prescription that one should use the empirical median instead of the empirical mean to robustly estimate the mean of a distribution, and interquartile range for robust estimates of the variance. In high-dimensions, the Tukey median [Tuk75] is a high-dimensional analogue of the median that, although provably robust, is hard to compute [JP78]. Similar hardness results have been shown [Ber06, HM13] for essentially all known estimators in robust statistics.

Is high-dimensional agnostic distribution learning even possible, algorithmically? The difficulty is that corruptions are often hard to detect in high dimensions, and could bias the natural estimator by dimension-dependent factors. In this work, we study agnostic distribution learning for a number of fundamental classes of distributions: (1)(1) a single Gaussian, (2)(2) a product distribution on the hypercube {0,1}d\{0,1\}^{d}, (3)(3) mixtures of two product distributions (under a natural balancedness condition), and (4)(4) mixtures of kk Gaussians with spherical covariances. Prior to our work, all known efficient algorithms (e.g., [LT15, BD15]) for these classes required the error guarantee, f(ε,d)f(\varepsilon,d), to depend polynomially in the dimension dd. Hence, previous efficient estimators could only tolerate at most a 1/poly(d)1/\operatorname{poly}(d) fraction of errors. In this work, we obtain the first efficient algorithms for the aforementioned problems, where f(ε,d)f(\varepsilon,d) is completely independent of dd and depends polynomially (often, nearly linearly) in the fraction ε\varepsilon of corrupted samples. Our work is just a first step in this direction, and there are many exciting questions left to explore.

2 Our Techniques

All of our algorithms are based on a common recipe. The first question to address is the following: Even if we were given a candidate hypothesis PP^{\prime}, how could we test if it is ε\varepsilon-close in total variation distance to PP? The usual way to certify closeness is to exhibit a coupling between PP and PP^{\prime} that marginally samples from both distributions, where the samples produced from each agree with probability 1ε1-\varepsilon. However, we have no control over the process by which samples are generated from PP, in order to produce such a coupling. And even then, the way that an adversary decides to corrupt samples can introduce complex statistical dependencies.

Unfortunately, in our agnostic setting, we cannot afford for (1) to depend on the dimension dd at all. Any such dependence would appear in the error guarantee of our algorithm. Instead, the starting point of our algorithms is a notion of parameter distance that satisfies

Given our notion of parameter distance satisfying (2), our main ingredient is an efficient method for robustly estimating the parameters. We provide two algorithmic approaches which are based on similar principles. Our first approach is faster, requiring only approximate eigenvalue computations. Our second approach relies on convex programming and achieves slightly better sample complexity, in some cases matching the information-theoretic limit. Notably, either approach can be used to give all of our concrete learning applications with nearly identical sample complexity and error guarantees. In what follows, we specialize to the problem of robustly learning the mean μ\mu of a Gaussian whose covariance is promised to be the identity, which we will use to illustrate how both approaches operate. We emphasize that what is needed to learn the parameters in more general settings requires many additional ideas.

Our second algorithmic approach relies on convex programming. Here, instead of rejecting corrupted samples, we compute appropriate weights wiw_{i} for the samples XiX_{i}, such that the weighted empirical average μ^w=i=1NwiXi\widehat{\mu}_{w}=\mathop{\textstyle\sum}_{i=1}^{N}w_{i}X_{i} is close to μ\mu. We work with the convex set:

We prove that any set of weights in Cδ\mathcal{C}_{\delta} yields a good estimate μ^w=i=1NwiXi\widehat{\mu}_{w}=\mathop{\textstyle\sum}_{i=1}^{N}w_{i}X_{i} in the obvious way. The catch is that the set Cδ\mathcal{C}_{\delta} is defined based on μ\mu, which is unknown. Nevertheless, it turns out that we can use the same type of spectral arguments that underlie the filtering approach to design an approximate separation oracle for Cδ\mathcal{C}_{\delta}. Combined with standard results in convex optimization, this yields an algorithm for robustly estimating μ\mu.

The third and final ingredient is some new concentration bounds. In both of the approaches above, at best we are hoping that we can remove all of the corrupted points and be left with only the uncorrupted ones, and then use standard estimators (e.g., the empirical average) on them. However, an adversary could have removed an ε\varepsilon-fraction of the samples in a way that biases the empirical average of the remaining uncorrupted samples. What we need are concentration bounds that show for sufficiently large NN, for samples X1,X2,,XNX_{1},X_{2},\ldots,X_{N} from a Gaussian with mean μ\mu and identity covariance, that every set of (1ε)N(1-\varepsilon)N samples produces a good estimate for μ\mu. In some cases, we can derive such concentration bounds by appealing to known concentration inequalities and taking a union bound. However, in other cases (e.g., concentration bounds for degree-two polynomials of Gaussian random variables) the existing concentration bounds are not strong enough, and we need other arguments to prove that every set of (1ε)N(1-\varepsilon)N samples produces a good estimate.

3 Our Results

We give the first efficient algorithms for agnostically learning several important distribution classes with dimension-independent error guarantees. Our first main result is for a single arbitrary Gaussian with mean μ\mu and covariance Σ\Sigma, which we denote by N(μ,Σ)\operatorname{\mathcal{N}}(\mu,\Sigma). In the previous subsection, we described our convex programming approach for learning the mean vector when the covariance is promised to be the identity. A technically more involved version of the technique can handle the case of zero mean and unknown covariance. More specifically, consider the following convex set, where Σ\Sigma is the unknown covariance matrix and F\left\|\cdot\right\|_{F} is the Frobenius norm:

We design an approximate separation oracle for this unknown convex set, by analyzing the spectral properties of the fourth moment tensor of a Gaussian. Combining these two intermediate results, we obtain our first main result (below). Throughout this paper, we will abuse notation and write NΩ~(f(d,ε,τ))N\geq\widetilde{\Omega}(f(d,\varepsilon,\tau)) when referring to our sample complexity, to signify that our algorithm works if NCf(d,ε,τ)\mboxpolylog(f(d,ε,τ))N\geq Cf(d,\varepsilon,\tau)\mbox{polylog}(f(d,\varepsilon,\tau)) for a large enough universal constant CC.

We can alternatively establish Theorem 1.2 via our filtering technique. See Section 5. In the first version of our paper, our analysis required Nd3log2(1/τ)/ε2N\gtrsim d^{3}\log^{2}(1/\tau)/\varepsilon^{2} samples. In [DKK+17], we showed that a simple adaptation of our algorithm and analysis achieves the improved sample complexity above, which is information-theoretically optimal up to logarithmic factors. We have incorporated this modification (along with the analysis) into this version of the paper, for the sake of completeness.

For the sake of simplicity in the presentation, we did not make an effort to optimize the sample complexity of our robust estimators in the above setting. We note that methods similar to the analysis of the Gaussian setting can lead to near-optimal sample complexity in this setting as well. We also remark that for the case of balanced binary product distributions, our algorithm achieves an error of O(εlog(1/ε)).O(\varepsilon\sqrt{\log(1/\varepsilon)}).

Interestingly enough, the above two distribution classes are trivial to learn in the noiseless case, but in the agnostic setting the learning problem turns out to be surprisingly challenging. Using additional ideas, we are able to generalize our agnostic learning algorithms to mixtures of the above classes under some natural conditions. We note that even in the noiseless case, learning mixtures of the above families is non-trivial. First, we study 22-mixtures of cc-balanced products, which stipulates that the coordinates of the mean vector of each component are in the range (c,1c)(c,1-c). We prove:

This generalizes the algorithm of Freund and Mansour [FM99] to the agnostic setting. An interesting open question is to improve the ε\varepsilon-dependence in the above bound to (nearly) linear, or to remove the assumption of balancedness and obtain an agnostic algorithm for mixtures of two arbitrary product distributions.

Finally, we give an agnostic learning algorithm for mixtures of spherical Gaussians.

In total, these results give new robust and computationally efficient estimators for several well-studied distribution learning problems that can tolerate a constant fraction of errors independent of the dimension. This points to an interesting new direction of making robust statistics algorithmic. The general recipe we have developed here gives us reason to be optimistic about many other problems in this domain.

4 Discussion and Related Work

Our results fit in the framework of density estimation and parameter learning which are both classical problems in statistics with a rich history (see e.g., [BBBB72, DG85, Sil86, Sco92, DL01]). While these problems have been studied for several decades by different communities, the computational complexity of learning is still not well understood, even for some surprisingly simple distribution families. Most textbook estimators are hard to compute in general, especially in high-dimensional settings. In the past few decades, a rich body of work within theoretical computer science has focused on designing computationally efficient distribution learning algorithms. In a seminal work, Kearns, Mansour, Ron, Rubinfeld, Schapire, and Sellie [KMR+94] initiated a systematic investigation of the computational complexity of distribution learning. Since then, efficient learning algorithms have been developed for a wide range of distributions in both low and high-dimensions [Das99, FM99, AK01, VW02, CGG02, MR05, BV08, KMV10, MV10, BS10, DDS12, CDSS13, DDO+13, CDSS14a, CDSS14b, HP15, ADLS17, DDS15b, DDKT16, DKS16b, DKS16a].

We will be particularly interested in efficient learning algorithms for mixtures of high-dimensional Gaussians and mixtures of product distributions, as this is the focus of our algorithmic results in the agnostic setting. In a pioneering work, Dasgupta [Das99] introduced the problem of parameter estimation of a Gaussian mixture to theoretical computer science, and gave the first provably efficient algorithms under the assumption that the components are suitably well-separated. Subsequently, a number of works improved these separation conditions [AK01, VW02, BV08] and ultimately removing them entirely [KMV10, MV10, BS10]. In another line of work, Freund and Mansour [FM99] gave the first polynomial time algorithm for properly learning mixtures of two binary product distributions. This algorithm was substantially generalized to phylogenetic trees [CGG02] and to mixtures of any constant number of discrete product distributions [FOS08]. Given the vast body of work on high-dimensional distribution learning, there is a plethora of problems where one could hope to reconcile robustness and computational efficiency. Thus far, the only setting where robust and efficient algorithms are known is on one-dimensional distribution families, where brute-force search or some form of polynomial regression often works. In contrast, essentially nothing is known about efficient agnostic distribution learning in the high-dimensional setting that we study here.

Question 1.1 also resembles learning in the presence of malicious errors [Val85, KL93]. There, an algorithm is given samples from a distribution along with their labels according to an unknown target function. The adversary is allowed to corrupt an ε\varepsilon-fraction of both the samples and their labels. A sequence of works studied the problem of learning a homogeneous halfspace with malicious noise in the setting where the underlying distribution is a Gaussian [Ser01, Ser03, KLS09], culminating in the work of Awasthi, Balcan, and Long [ABL17], who gave an efficient algorithm that finds a halfspace with agreement O(ε)O(\varepsilon). There is no direct connection between their problem and ours, especially since one is a supervised learning problem and the other is unsupervised. We note however that there is an interesting technical parallel in that the work [KLS09] also uses spectral methods to detect outliers. Both their work and our algorithm for agnostically learning the mean are based on the intuition that an adversary can only substantially bias the empirical mean if the corruptions are correlated along some direction. More specifically, [KLS09] produce a “hard” filter which leads to errors that scale logarithmically with the dimension, even in a weaker corruption model than ours. Our algorithms need to handle many significant conceptual and technical complications that arise when working with higher moments or other distribution families.

Another connection is to the work on robust principal component analysis (PCA). PCA is a transformation that (among other things) is often justified as being able to find the affine transformation Y=Σ1/2(Xμ)Y=\Sigma^{-1/2}(X-\mu) that would place a collection of Gaussian random variables in isotropic position. One can think of our results on agnostically learning a Gaussian as a type of robust PCA that tolerates gross corruptions, where entire samples are corrupted. This is different than other variants of the problem where random sets of coordinates of the points are corrupted [CLMW11], or where the uncorrupted points were assumed to lie in a low-dimensional subspace to begin with [ZL14, LMTZ15]. Finally, Brubaker [Bru09] studied the problem of clustering samples from a well-separated mixture of Gaussians in the presence of adversarial noise. The goal of [Bru09] was to separate the Gaussian components from each other, while the adversarial points are allowed to end up in any of clusters. Our work is orthogonal to [Bru09], since even if such a clustering is given, the problem still remains to estimate the parameters of each component.

5 Concurrent and Subsequent Work

After the initial publication of our results [DKK+16], there has been a flurry of recent work on robust high-dimensional estimation. Diakonikolas, Kane, and Stewart [DKS16c] studied the problem of learning the parameters of a graphical model in the presence of noise, when given its graph theoretic structure. Charikar, Steinhardt, and Valiant [CSV17] developed algorithms that can tolerate a fraction of corruptions greater than a half, under the weaker goal of outputting a small list of candidate hypotheses that contains a parameter set close to the true values. Balakrishnan, Du, Li, and Singh [Li17, DBS17, BDLS17] studied sparse mean and covariance estimation in the presence of noise obtaining computationally efficient robust algorithms with sample complexity sublinear in the dimension. Diakonikolas, Kane, and Stewart [DKS17] proved statistical query lower bounds providing evidence that the error guarantees of our robust mean and covariance estimation algorithms are best possible, within constant factors, for efficient algorithms. In a subsequent paper [DKK+17], we obtained improved bounds on the sample complexity of our algorithms, which are optimal up to polylogarithmic factors. For the sake of completeness, we include these improved sample bounds in the present version of this paper. In the same work [DKK+17], we showed that our algorithmic approach easily extends to obtain dimension-independent robustness guarantees under much weaker distributional assumptions, and gave a practical demonstration of the efficacy of our robust algorithms on both real and synthetic data.

Since the initial submission of the journal version of this paper, there has been a substantial amount of work on robust high-dimensional estimation in a variety of settings. Diakonikolas, Kane, and Stewart [DKS18a] studied PAC learning of geometric concept classes (including low-degree polynomial threshold functions and intersections of halfspaces) in the same corruption model as ours, obtaining the first dimension-independent error guarantees for these classes. Steinhardt, Charikar, and Valiant [SCV18] focused on deterministic conditions of a dataset which allow robust estimation to be possible. In our initial publication, we gave explicit deterministic conditions in various settings; by focusing directly on this goal, [SCV18] somewhat relaxed some of these assumptions. Meister and Valiant [MV17] studied learning in a crowdsourcing model, where the fraction of honest workers may be very small (similar to [CSV17]). Qiao and Valiant [QV18] considered robust estimation of discrete distributions in a setting where we have several sources (a fraction of which are adversarial) who each provide a batch of samples. A number of simultaneous works [KSS18, HL18, DKS18b] investigated robust mean estimation in even more general settings, and apply their techniques to learning mixtures of spherical Gaussians under minimal separation conditions. Finally, several concurrent results study robustness in supervised learning tasks [PSBR18, KKM18, DKK+18], including regression and SVM problems. Despite all of this rapid progress, there are still many interesting theoretical and practical questions left to explore.

6 Organization

The structure of this paper is as follows: In Section 2, we introduce basic notation and a number of useful facts that will be required throughout the paper, as well as the formal definition of our adversary model. In Section 3, we discuss several natural approaches to high-dimensional agnostic learning, all of which lose polynomial factors that depend on the dimension, in terms of their error guarantee.

The main body of the paper is in Sections 4–8. Sections 4 and 6 illustrate our convex programming framework, while Sections 5, 7, and 8 illustrate our filter framework. More specifically, in Sections 4 and 5, we analyze the setting of a single Gaussian with unknown mean and unknown covariance, using our convex programming and filter frameworks, respectively. In Section 6, we generalize the convex programming method to obtain an agnostic algorithm for mixtures of spherical Gaussians with unknown means. In Section 7, we apply our filter techniques to a binary product distribution, and generalize these in Section 8 to obtain an agnostic learning algorithm for a mixture of two balanced binary product distributions.

We note that for some of the more advanced applications of our frameworks, the technical details can get in the way of the fundamental ideas. For the reader who is interested in seeing the details of our most basic application of the convex programming framework, we recommend reading the case a Gaussian with unknown mean, in Section 4.3. Similarly, for the filter framework, we suggest either the Gaussian with unknown mean in Section 5.1 or the balanced product distribution in Section 7.1.

Preliminaries

Throughout this paper, if vv is a vector, we will let v2\|v\|_{2} denote its Euclidean norm. If MM is a matrix, we will let M2\|M\|_{2} denote its spectral norm, and MF\|M\|_{F} denote its Frobenius norm. We will also let \preceq and \succeq denote the PSD ordering on matrices. For a discrete distribution PP, we will denote by P(x)P(x) the probability mass at point xx. For a continuous distribution, let it denote the probability density function at xx. Let SS be a multiset over {0,1}d\{0,1\}^{d} . We will write XuSX\in_{u}S to denote that XX is drawn from the empirical distribution defined by SS. Throughout the paper, we let \otimes denote the Kronecker product of matrices.

As a measure of distance between distributions, we will use the notion of total variation distance:

2 Types of Adversaries

In this paper, we will consider a powerful model for agnostic distribution learning that generalizes many other existing models. The standard setup involves an oblivious adversary who chooses a distribution that is close in total variation distance to an unknown distribution in some class D\mathcal{D}.

Given ε>0\varepsilon>0, and a class of distributions D\mathcal{D}, the full adversary operates as follows: The algorithm specifies some number of samples mm. The adversary generates mm samples X1,X2,,XmX_{1},X_{2},\ldots,X_{m} from some (unknown) distribution DDD\in\mathcal{D}. It then draws mm^{\prime} from an appropriate distribution. This distribution is allowed to depend on X1,X2,,XmX_{1},X_{2},\ldots,X_{m}, but when marginalized over the mm samples satisfies m\mboxBin(m,ε)m^{\prime}\sim\mbox{Bin}(m,\varepsilon). The adversary is allowed to inspect the samples, removes mm^{\prime} of them, and replaces them with arbitrary points. The set of mm points is given (in any order) to the algorithm.

We rely on the following well-known fact:

Now we can describe how the full adversary can corrupt samples from DD to get samples distributed according to PP.

The full adversary can simulate any oblivious adversary.

We draw mm samples X1,X2,,XmX_{1},X_{2},\ldots,X_{m} from DD. We delete each sample XiX_{i} independently with probability ε2\varepsilon_{2} and replace it with an independent sample from N2N_{2}. This gives a set of samples Y1,Y2,,YmY_{1},Y_{2},\ldots,Y_{m} that are independently sampled from (1ε2)D+ε2N2(1-\varepsilon_{2})D+\varepsilon_{2}N_{2}. Since the distributions (1ε1)P+ε1N1(1-\varepsilon_{1})P+\varepsilon_{1}N_{1} and (1ε2)D+ε2N2(1-\varepsilon_{2})D+\varepsilon_{2}N_{2} are identical, we can couple them to independent samples Z1,Z2,,ZmZ_{1},Z_{2},\ldots,Z_{m} from (1ε1)P+ε1N1(1-\varepsilon_{1})P+\varepsilon_{1}N_{1}. Now each sample ZiZ_{i} that came from N1N_{1}, we can delete and replace with an independent sample from PP. The result is a set of samples that are independently sampled from PP where we have made mm^{\prime} edits and marginally m\mboxBin(m,ε1+ε2)m^{\prime}\sim\mbox{Bin}(m,\varepsilon_{1}+\varepsilon_{2}), although mm^{\prime} has and needs to have some dependence on the original samples from DD. ∎

The challenge in working with the full adversary is that even the samples that came from DD can have biases. The adversary can now choose how to remove uncorrupted points in a careful way so as to compensate for certain other biases that he introduces using the corrupted points.

Throughout this paper, we will make use of the following notation and terminology:

We say a set of samples X1,X2,,XmX_{1},X_{2},\ldots,X_{m} is an ε\varepsilon-corrupted set of samples generated by the oblivious (resp. full) adversary if it is generated by the process described above in the definition of the oblivious (resp. full) adversary. If it was generated by the full adversary, we let G[m]G\subseteq[m] denote the indices of the uncorrupted samples, and we let E[m]E\subseteq[m] denote the indices of the corrupted samples.

In this paper, we will give a number of algorithms for agnostic distribution learning that work in the full adversary model. In our analysis, we will identify a set of events that ensure the algorithm succeeds and will bound the probability that any of these events does not occur when mm is suitably large. We will often explicitly invoke the assumption that E2εm|E|\leq 2\varepsilon m. We can do this even though the number of points that are corrupted is itself a random variable, because by the Chernoff bound, as long as mO(log1/τε)m\geq O\left(\frac{\log 1/\tau}{\varepsilon}\right), we know that E2εm|E|\leq 2\varepsilon m holds with probability at least 1O(τ)1-O(\tau). Thus, making the assumption that E2εm|E|\leq 2\varepsilon m costs us an additional additive O(τ)O(\tau) term in our union bound, when bounding the failure probability of our algorithms.

3 Distributions of Interest

One object of study in this paper is the Gaussian (or Normal) distribution.

A Gaussian distribution N(μ,Σ)\mathcal{N}(\mu,\Sigma) with mean μ\mu and covariance Σ\Sigma is the distribution with probability density function

We will also be interested in binary product distributions.

A (binary) product distribution is a probability distribution over {0,1}d\{0,1\}^{d} whose coordinate random variables are independent. Note that a binary product distribution is completely determined by its mean vector.

We will also be interested in mixtures of such distributions.

A mixture PP of distributions P1,,PkP_{1},\dots,P_{k} with mixing weights α1,,αk\alpha_{1},\dots,\alpha_{k} is the distribution defined by

where αj0\alpha_{j}\geq 0 for all jj and j[k]αj=1\sum_{j\in[k]}\alpha_{j}=1.

4 Bounds on TV Distance

The Kullback-Liebler divergence (also known as relative entropy, information gain, or information divergence) is a well-known measure of distance between two distributions.

The primary interest we have in this quantity is the fact that (1) the KL divergence between two Gaussians has a closed form expression, and (2) it can be related (often with little loss) to the total variation distance between the Gaussians. The first statement is expressed in the fact below:

Let N(μ1,Σ1)\operatorname{\mathcal{N}}(\mu_{1},\Sigma_{1}) and N(μ2,Σ2)\operatorname{\mathcal{N}}(\mu_{2},\Sigma_{2}) be two Gaussians such that det(Σ1),det(Σ2)0\det(\Sigma_{1}),\det(\Sigma_{2})\neq 0. Then

The second statement is encapsulated in the well-known Pinsker’s inequality:

With this we can show the following two useful lemmas, which allow us to relate parameter distance between two Gaussians to their total variation distance. The first bounds the total variation distance between two Gaussians with identity covariance in terms of the Euclidean distance between the means:

In the case where Σ1=Σ2=I\Sigma_{1}=\Sigma_{2}=I, (3) simplifies to

Pinsker’s inequality (Theorem 2.12) then implies that

The second bounds the total variation distance between two mean zero Gaussians in terms of the Frobenius norm of the difference between their covariance matrices:

Let δ>0\delta>0 be sufficiently small. Let Σ1,Σ2\Sigma_{1},\Sigma_{2} such that IΣ21/2Σ1Σ21/2F=δ\|I-\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}\|_{F}=\delta. Then,

Let M=Σ21/2Σ1Σ21/2M=\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}. Then (3) simplifies to

Since both terms in the last line are rotationally invariant, we may assume without loss of generality that MM is diagonal. Let M=\mboxdiag(1+λ1,,1+λd)M=\mbox{diag}(1+\lambda_{1},\ldots,1+\lambda_{d}). Thus, the KL divergence between the two distributions is given exactly by 12i=1d(λilog(1+λi)),\frac{1}{2}\sum_{i=1}^{d}\left(\lambda_{i}-\log(1+\lambda_{i})\right), where we are guaranteed that (i=1dλi2)1/2=δ(\sum_{i=1}^{d}\lambda_{i}^{2})^{1/2}=\delta. By the second order Taylor approximation to ln(1+x)\ln(1+x), for xx small, we have that for δ\delta sufficiently small,

Our algorithm for agnostically learning an arbitrary Gaussian will be based on solving two intermediate problems: (1) We are given samples from N(μ,I)\operatorname{\mathcal{N}}(\mu,I) and our goal is to learn μ\mu, and (2) We are given samples from N(0,Σ)\operatorname{\mathcal{N}}(0,\Sigma) and our goal is to learn Σ\Sigma. The above bounds on total variation distance will allow us to conclude that our estimate is close in total variation distance to the unknown Gaussian distribution in each of the two settings.

We note the following folklore sample complexity bounds for learning a Gaussian in the non-agnostic setting.

N=Θ(d+log(1/τ)ε2)N=\Theta\left(\frac{d+\log(1/\tau)}{\varepsilon^{2}}\right) samples are both necessary and sufficient to learn a dd-dimensional Gaussian with unknown mean and known covariance to total variation distance ε\varepsilon with probability 1τ1-\tau.

N=Θ(d2+log(1/τ)ε2)N=\Theta\left(\frac{d^{2}+\log(1/\tau)}{\varepsilon^{2}}\right) samples are both necessary and sufficient to learn a dd-dimensional Gaussian with unknown mean and covariance to total variation distance ε\varepsilon with probability 1τ1-\tau.

We will also need the following lemma bounding the total variation distance between two product distributions:

Let P,QP,Q be binary product distributions with mean vectors p,q(0,1)d.p,q\in(0,1)^{d}. We have that

The elementary inequality 2ab=a+b(ab)2,2\sqrt{ab}=a+b-(\sqrt{a}-\sqrt{b})^{2}, a,b>0,a,b>0, gives that

where the last inequality follows from the union bound. ∎

5 Additional Concentration Lemmata

In this section, we list a number of standard concentration inequalities for nice random variables which we will frequently use throughout this paper. The proofs of these results are standard and omitted, see e.g., [Ver10] for a more thorough treatment of these results.

The first is a Chernoff bound for bounded random variables.

Let Z1,,ZdZ_{1},\ldots,Z_{d} be independent random variables with ZiZ_{i} supported on [ai,bi].[a_{i},b_{i}]. Let Z=i=1dZiZ=\sum_{i=1}^{d}Z_{i}. Then for any T>0T>0,

We will also require the following tail bounds for Gaussians and quadratic forms of Gaussians:

By standard union bound arguments (see e.g. [Ver10]), we obtain the following concentration results for the empirical mean and covariance of a set of Gaussian vectors:

Let nn be a positive integer. Let DD be a sub-gaussian distribution with mean and covariance II. Let YiDY_{i}\sim D be independent, for i=1,,ni=1,\ldots,n. Then, there exist universal constants A,B>0A,B>0 so that for all t>0t>0, we have

With the same setup as in Lemma 2.21, there exist universal constants A,B>0A,B>0 so that for all t>0t>0, we have

6 Agnostic Hypothesis Selection

Several of our algorithms will return a polynomial-sized list of hypotheses at least one of which is guaranteed to be close to the target distribution. Usually (e.g., in a non-agnostic setting), one could use a polynomial number of additional samples to run a tournament to identify the candidate hypothesis that is (roughly) the closest to the target distribution. In the discussion that follows, we will refer to these additional samples as test samples. Such hypothesis selection algorithms have been extensively studied [Yat85, DL96, DL97, DL01, DK14, AJOS14, SOAJ14, DDS15a, DDS15b]. Unfortunately, against a strong adversary we run into a serious technical complication: the training samples and test samples are not necessarily independent. Moreover even if we randomly partition our samples in training and test, a priori there are an unbounded set of possible hypotheses that the training phase could output, and when we analyze the tournament we cannot condition on the list of hypotheses and assume that the test samples are sampled anew. Our approach is to require our original algorithm to return only hypotheses from some finite set of possibilities, and as we will see this mitigates the problem.

As a simple corollary of the agnostic tournament, observe that this allows us to do agnostic learning without knowing the precise error rate ε\varepsilon. Throughout the paper, we assume the algorithm knows ε\varepsilon, and guarantees that the output will have error which is at most O(f(ε))O(f(\varepsilon)). However, if the algorithm is not given this information, and instead given an η\eta and asked to return something with error at most O(f(ε+η))O(f(\varepsilon+\eta)), we may simply grid over {η,(1+γ)η,(1+γ)2η,,1}\{\eta,(1+\gamma)\eta,(1+\gamma)^{2}\eta,\ldots,1\} (here γ\gamma is some arbitrary constant that governs a tradeoff between runtime and accuracy), run our algorithm with ε\varepsilon set to each element in this set, and perform hypothesis selection via Tournament. Then it is not hard to see that we are guaranteed to output something which has error at most O(f(ε+(1+γ)η))O(f(\varepsilon+(1+\gamma)\eta)).

Firstly, we randomly choose a subset of NN of our samples and a disjoint subset of C(log(M)+log(1/τ))/ε2C(\log(|\mathcal{M}|)+\log(1/\tau))/\varepsilon^{2} of our samples for some sufficiently large CC. Note that with high probability over our randomization, at most a 2ε2\varepsilon-fraction of samples from each subset are corrupted. Thus, we may instead consider the stronger adversary who sees a set S1S_{1} of NN independent samples from Π\Pi and another set, S2S_{2}, of C(log(M)+log(1/τ))/ε2C(\log(|\mathcal{M}|)+\log(1/\tau))/\varepsilon^{2} samples from Π\Pi and can arbitrary corrupt a 2ε2\varepsilon-fraction of each, giving sets S1S^{\prime}_{1},S2S^{\prime}_{2}.

With probability at least 1τ/21-\tau/2 over S1S_{1}, the original algorithm run on S1S^{\prime}_{1} returns a set L\mathcal{L} satisfying the desired properties.

We claim that with probability at least 1τ/21-\tau/2 over the choice of S2S_{2}, we have for any P,QMP,Q\in\mathcal{M}:

This follows by Chernoff bounds and a union bound over the M2|\mathcal{M}|^{2} possibilities for PP and QQ. Since the total variation distance between the uniform distributions over S2S_{2} and S2S^{\prime}_{2} is at most 2ε2\varepsilon, we also have for S2S^{\prime}_{2} that

Therefore, if we throw away any QQ in our list for which there is a PP in our list such that

Some Natural Approaches, and Why They Fail

Many of the agnostic distribution learning problems that we study are so natural that one would immediately wonder why simpler approaches do not work. Here we detail some other plausible approaches, and what causes them to lose dimension-dependent factors (if they have any guarantees at all!). For the discussion that follows, we note that by Fact 2.13 in order to achieve an estimate that is O(ε)O(\varepsilon)-close in total variation distance (for a Gaussian when μ\mu is unknown and Σ=I\Sigma=I) it is necessary and sufficient that μ^μ2=O(ε)\|\hat{\mu}-\mu\|_{2}=O(\varepsilon).

One plausible approach for robust mean estimation in high dimensions is to agnostically learn along each coordinate separately. For instance, if our goal is to agnostically learn the mean of a Gaussian with known covariance II, we could try to learn each coordinate of the mean separately. But since an ε\varepsilon-fraction of the samples are corrupted, our estimate can be off by ε\varepsilon in each coordinate and would be off by εd\varepsilon\sqrt{d} in high dimensions.

Maximum Likelihood

Given a set of samples X1,,XNX_{1},\ldots,X_{N}, and a class of distributions D\mathcal{D}, the maximum likelihood estimator (MLE) is the distribution FDF\in\mathcal{D} that maximizes i=1NF(Xi)\prod_{i=1}^{N}F(X_{i}). Equivalently, FF minimizes the negative log likelihood (NLL), which is given by

and so the μ\mu which minimizes NLL(N(μ,I),X1,,XN)\textrm{NLL}(\operatorname{\mathcal{N}}(\mu,I),X_{1},\ldots,X_{N}) is the mean of the samples XiX_{i}, since for any set of vectors v1,,vNv_{1},\ldots,v_{N}, the average of the viv_{i}’s is the minimizer of the function h(x)=i=1Nvix22h(x)=\sum_{i=1}^{N}\|v_{i}-x\|_{2}^{2}. Hence, if an adversary places an ε\varepsilon-fraction of the points at some very large distance, then the estimate for the mean would need to move considerably in that direction. By placing the corruptions further and further away, the MLE can be an arbitrarily bad estimate. That is, even though it is well known [Hub67, Whi82] that the MLE converges to the distribution FDF\in\mathcal{D} that is closest in KL-divergence to the distribution from which our samples were generated (i.e., after the adversary has added corruptions), FF is not necessarily close to the uncorrupted distribution.

Geometric Median

In one dimension, it is well-known that the median provides a provably robust estimate for the mean in a number of settings. The mean of a set of points a1,,aNa_{1},\ldots,a_{N} is the minimizer of the function f(x)=i=1N(aix)2f(x)=\sum_{i=1}^{N}(a_{i}-x)^{2}, and in contrast the median is the minimizer of the function f(x)=i=1Naixf(x)=\sum_{i=1}^{N}|a_{i}-x|. In higher dimensions, there are many natural definitions for the median that generalize the one-dimensional case. The Tukey median is one such notion, but as we discussed it is hard to compute [JP78] and the best known algorithms run in time exponential in dd. Motivated by this, the geometric median is another high-dimensional notion of a median. It often achieves better robustness than the mean, and can be computed quickly [CLM+16]. The formal definition is:

Unfortunately, this notion of median still incurs an error containing a factor of O(d)O(\sqrt{d}):

Given a set SS of N=Ω(d+log(1/τ)ε2)N=\Omega\left(\frac{d+\log(1/\tau)}{\varepsilon^{2}}\right) samples from N(0,I)\operatorname{\mathcal{N}}(0,I), then with probability at least 1τ1-\tau, there exists a corruption SS^{\prime} of SS, such that:

Agnostically Learning a Gaussian, via Convex Programming

In this section we give a polynomial time algorithm to agnostically learn a single Gaussian up to error O~(ε)\widetilde{O}(\varepsilon). Our approach is based on the following ingredients: First, in Section 4.1, we define the set SN,εS_{N,\varepsilon}, which will be a key algorithmic object in our framework. In Section 4.2 we give key, new concentration bounds on certain statistics of Gaussians. We will make crucial use of these concentration bounds throughout this section. In Section 4.3 we give an algorithm to agnostically learn a Gaussian with unknown mean and whose covariance is promised to be the identity via convex programming. This will be an important subroutine in our overall algorithm, and it also helps to illustrate our algorithmic approach without many of the additional complications that arise in our later applications. In Section 4.4 we show how to robustly learn a Gaussian with mean zero and unknown covariance again via convex programming. Finally, in Section 4.5 we show how to combine these two intermediate results to get our overall algorithm.

An important algorithmic object for us will be the following set:

For any 12>ε>0\frac{1}{2}>\varepsilon>0 and any integer NN, let

and so we see that this set is designed to capture the notion of selecting a set of (12ε)N(1-2\varepsilon)N samples from NN samples.

Given wSN,εw\in S_{N,\varepsilon} we will use the following notation

to denote the total weight on good and bad points respectively. The following facts are immediate from E2εN|E|\leq 2\varepsilon N and the properties of SN,εS_{N,\varepsilon}.

If wSN,εw\in S_{N,\varepsilon} and E2εN|E|\leq 2\varepsilon N, then wb2ε12εw_{b}\leq\frac{2\varepsilon}{1-2\varepsilon}. Moreover, the renormalized weights ww^{\prime} on good points given by wi=wiwgw^{\prime}_{i}=\frac{w_{i}}{w_{g}} for all iGi\in G, and wi=0w^{\prime}_{i}=0 otherwise, satisfy wSN,4εw^{\prime}\in S_{N,4\varepsilon}.

2 Concentration Inequalities

Throughout this section and in Section 6 we will make use of various concentration bounds on low moments of Gaussian random variables. Some are well-known, and others are new but follow from known bounds and appropriate union bound arguments.

We will also be interested in how well various statistics of Gaussians concentrate around their expectation, when we take the worst-case set of weights in SN,εS_{N,\varepsilon}. This is more subtle than standard settings such as Lemma 2.21 or Lemma 2.22 because as we take more samples, any fixed statistic (e.g. taking the uniform distribution over the samples) concentrates better but the size of SN,εS_{N,\varepsilon} (e.g. the number of sets of (12ε)N(1-2\varepsilon)N samples) grows too. We defer the proofs to Appendix A. The first concerns the behavior of the empirical covariance:

Fix ε1/2\varepsilon\leq 1/2 and τ1\tau\leq 1. There is a δ1=O(εlog1/ε)\delta_{1}=O(\varepsilon\log 1/\varepsilon) such that if Y1,,YNY_{1},\ldots,Y_{N} are independent samples from N(0,I)\operatorname{\mathcal{N}}(0,I) and N=Ω(d+log(1/τ)δ12)N=\Omega\left(\frac{d+\log(1/\tau)}{\delta_{1}^{2}}\right), then

A nearly identical argument (Using Hoeffding instead of Bernstein in the proof of Theorem 5.50 in [Ver10]) yields:

Fix ε\varepsilon and τ\tau as above. There is a δ2=O(εlog1/ε)\delta_{2}=O(\varepsilon\sqrt{\log 1/\varepsilon}) such that if Y1,,YNY_{1},\ldots,Y_{N} are independent samples from N(0,I)\operatorname{\mathcal{N}}(0,I) and N=Ω(d+log(1/τ)δ22)N=\Omega\left(\frac{d+\log(1/\tau)}{\delta_{2}^{2}}\right), then

Note that by Cauchy-Schwarz, this implies:

Fix ε\varepsilon and τ\tau as above. There is a δ2=O(εlog1/ε)\delta_{2}=O(\varepsilon\sqrt{\log 1/\varepsilon}) such that if Y1,,YNY_{1},\ldots,Y_{N} are independent samples from N(0,I)\operatorname{\mathcal{N}}(0,I) and N=Ω(d+log(1/τ)δ22)N=\Omega\left(\frac{d+\log(1/\tau)}{\delta_{2}^{2}}\right), then

Fix τ>0\tau>0. Let X1,,XNN(0,I)X_{1},\ldots,X_{N}\sim\operatorname{\mathcal{N}}(0,I). Then, with probability 1τ1-\tau, we have that Xi2O(dlog(N/τ))\|X_{i}\|_{2}\leq O\left(\sqrt{d\log(N/\tau)}\right) for all i=1,,Ni=1,\ldots,N.

2.2 Estimation Error in the Frobenius Norm

Let X1,...,XNX_{1},...,X_{N} be NN i.i.d. samples from N(0,I)\operatorname{\mathcal{N}}(0,I). In this section we demonstrate a tight bound on how many samples are necessary such that the sample covariance is close to II in Frobenius norm. Let Σ^\widehat{\Sigma} denote the empirical covariance, defined to be

By self-duality of the Frobenius norm, we know that

Since there is a 1/41/4-net over all PSD matrices with Frobenius norm 11 of size 9d29^{d^{2}} (see e.g. Lemma 1.18 in [RH17]), the Vershynin-type union bound argument combined with Lemma 2.20 immediately gives us the following:

There exist universal constants A,B>0A,B>0 so that for all t>0t>0, we have

By the argument as used in the proof of Lemma 4.3, we obtain:

Fix ε,τ>0\varepsilon,\tau>0. There is a δ1=O(εlog1/ε)\delta_{1}=O(\varepsilon\log 1/\varepsilon) such that if X1,,XNX_{1},\ldots,X_{N} are independent samples from N(0,I)\operatorname{\mathcal{N}}(0,I), with

Since the proof is essentially identical to the proof of Lemma 4.3, we omit the proof. However, we note that in fact, the proof technique there can be used to show something slightly stronger, which we will require later. The technique actually shows that if we take any set of size at most εN\varepsilon N, and take the uniform weights over that set, then the empirical covariance is not too far away from the truth. More formally:

Fix ε,τ>0\varepsilon,\tau>0. There is a δ2=O(εlog1/ε)\delta_{2}=O(\varepsilon\log 1/\varepsilon) such that if X1,,XNX_{1},\ldots,X_{N} are independent samples from N(0,I)\operatorname{\mathcal{N}}(0,I), with

2.3 Understanding the Fourth Moment Tensor

Our algorithms will be based on understanding the behavior of the fourth moment tensor of a Gaussian when restricted to various subspaces. Let \otimes denote the Kronecker product on matrices. We will make crucial use of the following definition:

We will also require the following definitions:

and let ΠS\Pi_{S} and ΠS\Pi_{\operatorname{{\mathcal{S}}}^{\perp}} denote the projection operators onto S\operatorname{{\mathcal{S}}} and S\operatorname{{\mathcal{S}}}^{\perp} respectively. Finally let

In fact, the projection of v=Mv=M^{\flat} onto S\operatorname{{\mathcal{S}}} where MM is symmetric can be written out explicitly. Namely, it is given by

The key result in this section is the following:

It is important to note that the two terms above are not the same; the first term is high rank, but the second term is rank one. The proof of this theorem will require Isserlis’ theorem, and is deferred to Appendix A.

2.4 Concentration of the Fourth Moment Tensor

We also need to show that the fourth moment tensor concentrates:

Fix ε,τ>0\varepsilon,\tau>0. Let YiN(0,I)Y_{i}\sim\operatorname{\mathcal{N}}(0,I) be independent, for i=1,,Ni=1,\ldots,N, where we set

To do so will require somewhat more sophisticated techniques than the ones used so far to bound spectral deviations. At a high level, this is because fourth moments of Gaussians have a sufficiently larger variance that the union bound techniques used so far are insufficient. However, we will show that the tails of degree four polynomials of Gaussians still sufficiently concentrate such that removing points cannot change the mean by too much. The proof requires slightly fancy machinery and appears in Appendix B.

3 Finding the Mean, Using a Separation Oracle

In this section, we consider the problem of approximating μ\mu given NN samples from N(μ,I)\operatorname{\mathcal{N}}(\mu,I) in the full adversary model. Our algorithm will be based on working with the following convex set:

It is not hard to show that Cδ\mathcal{C}_{\delta} is non-empty for reasonable values of δ\delta (and we will show this later). Moreover we will show that for any set of weights ww in Cδ\mathcal{C}_{\delta}, the empirical average

will be a good estimate for μ\mu. The challenge is that since μ\mu itself is unknown, there is not an obvious way to design a separation oracle for Cδ\mathcal{C}_{\delta} even though it is convex. Our algorithm will run in two basic steps. First, it will run a very naive outlier detection to remove any points which are more than O(d)O(\sqrt{d}) away from the good points. These points are sufficiently far away that a very basic test can detect them. Then, with the remaining points, it will use the approximate separation oracle given below to approximately optimize with respect to CδC_{\delta}. It will then take the outputted set of weights and output the empirical mean with these weights. We will explain these steps in detail below.

Our results will hold under the following deterministic conditions:

The concentration bounds we gave earlier were exactly bounds on the failure probability of either of these conditions, albeit for SN,εS_{N,\varepsilon} instead of SN,4εS_{N,4\varepsilon}.

The first step of our algorithm will be to remove points which have distance which is much larger than O(d)O(\sqrt{d}) from the mean. Our algorithm is very naive: it computes all pairwise distances between points, and throws away all points which have distance more than O(d)O(\sqrt{d}) from more than a 2ε2\varepsilon-fraction of the remaining points.

Suppose that (7) holds. Then NaivePrune removes no uncorrupted points, and moreover, if XiX_{i} is not removed by NaivePrune, we have Xiμ2O(dlog(N/τ))\|X_{i}-\mu\|_{2}\leq O\left(\sqrt{d\log(N/\tau)}\right).

That no uncorrupted point is removed follows directly from (7) and the fact that there can be at most 2εN2\varepsilon N corrupted points. Similarly, if XiX_{i} is not removed by NaivePrune, that means there must be an uncorrupted XjX_{j} such that XiXj2O(dlog(N/τ))\|X_{i}-X_{j}\|_{2}\leq O(\sqrt{d\log(N/\tau)}). Then the desired property follows from (7) and a triangle inequality. ∎

Henceforth, for simplicity we shall assume that no point was removed by NaivePrune, and that for all i=1,,Ni=1,\ldots,N, we have Xiμ2<O(dlog(N/τ))\|X_{i}-\mu\|_{2}<O(\sqrt{d\log(N/\tau)}). Otherwise, we can simply work with the pruned set, and it is evident that nothing changes.

3.2 The Separation Oracle

Our main result in this section is an approximate separation oracle for Cδ\mathcal{C}_{\delta}. Throughout this section, let wSN,εw\in S_{N,\varepsilon} and set μ^=i=1NwiXi\widehat{\mu}=\sum_{i=1}^{N}w_{i}X_{i}. Moreover, let Δ=μμ^\Delta=\mu-\widehat{\mu}. Our first step is to show that any set of weights that does not yield a good estimate for μ\mu cannot be in the set Cδ\mathcal{C}_{\delta}:

Suppose that (8)-(9) holds. Suppose that Δ2=Ω(εδ1)=Ω(εlog1/ε)\|\Delta\|_{2}=\Omega(\sqrt{\varepsilon\delta_{1}})=\Omega(\varepsilon\log 1/\varepsilon). Then

By Fact 4.2 and (9) we have iGwiwgXiμ2δ2\|\sum_{i\in G}\frac{w_{i}}{w_{g}}X_{i}-\mu\|_{2}\leq\delta_{2}. Now by the triangle inequality we have

Using the fact that the variance is nonnegative we have

where in the last inequality we have used Fact 4.2 and (8). Hence altogether this implies that

As a corollary, we find that any set of weights in Cδ\mathcal{C}_{\delta} immediately yields a good estimate for μ\mu:

Suppose that (8) and (9) hold. Let wCδw\in\mathcal{C}_{\delta} for δ=O(εlog1/ε)\delta=O(\varepsilon\log 1/\varepsilon). Then

Our main result in this section is an approximate separation oracle for Cδ\mathcal{C}_{\delta} with δ=O(εlog1/ε)\delta=O(\varepsilon\log 1/\varepsilon).

Fix ε>0\varepsilon>0, and let δ=O(εlog1/ε).\delta=O(\varepsilon\log 1/\varepsilon). Suppose that (8) and (9) hold. Let ww^{\ast} denote the weights which are uniform on the uncorrupted points. Then there is a constant cc and an algorithm such that:

(Completeness) If w=ww=w^{\ast}, then it outputs “YES”.

We remark that these two facts imply that for any τ>0\tau>0, the ellipsoid method with this separation oracle will output a ww^{\prime} such that ww<ε/(Ndlog(N/τ))\|w-w^{\prime}\|_{\infty}<\varepsilon/(N\sqrt{d\log(N/\tau)}), for some wCcδw\in\mathcal{C}_{c\delta} in poly(d,1/ε,log1/τ)\operatorname{poly}(d,1/\varepsilon,\log 1/\tau) steps.

The conditions that the separation oracle given here satisfy are slightly weaker than the traditional guarantees, given, for instance, in [GLS88]. However, the correctness of the ellipsoid algorithm with this separation oracle follows because outside CcδC_{c\delta}, the separation oracle acts exactly as a separation oracle for ww^{*}. Thus, as long as the algorithm continues to query points outside of CcδC_{c\delta}, the action of the algorithm is equivalent to one with a separation oracle for ww^{*}. Moreover, the behavior of the algorithm is such that it will never exclude ww^{*}, even if queries are made within CcδC_{c\delta}. From these two conditions, it is clear from the classical theory presented in [GLS88] that the ellipsoid method satisfies the guarantees given above.

The separation oracle is given in Algorithm 2. Next, we prove correctness for our approximate separation oracle:

Again, let Δ=μμ^\Delta=\mu-\widehat{\mu}, and let M=i=1NwiYiYiTIM=\sum_{i=1}^{N}w_{i}Y_{i}Y_{i}^{T}-I. By expanding out the formula for MM, we get:

Suppose w=ww=w^{\ast}. Then M2<c2δ\|M\|_{2}<\frac{c}{2}\delta.

Recall that ww^{\ast} are the weights that are uniform on the uncorrupted points. Because E2εN|E|\leq 2\varepsilon N we have that wSN,εw^{\ast}\in S_{N,\varepsilon}. We can now use (8) to conclude that wCδ1w^{\ast}\in\mathcal{C}_{\delta_{1}}. Now by Corollary 4.16 we have that Δ2O(εlog1/ε)\|\Delta\|_{2}\leq O(\varepsilon\sqrt{\log 1/\varepsilon}). Thus

Suppose that w∉Ccδw\not\in C_{c\delta}. Then λ>c2δ|\lambda|>\frac{c}{2}\delta.

Let us now split into two cases. If Δ2cδ/10\|\Delta\|_{2}\leq\sqrt{c\delta/10}, then the first term above is at least cδc\delta by definition and we can conclude that λ>cδ/2|\lambda|>c\delta/2. On the other hand, if Δ2cδ/10\|\Delta\|_{2}\geq\sqrt{c\delta/10}, by Lemma 4.15, we have that

which for sufficiently small ε\varepsilon also yields λ>cδ/2|\lambda|>c\delta/2. ∎

Suppose Δ2cδ/10\|\Delta\|_{2}\leq\sqrt{c\delta/10}. By (11) we immediately have:

since λ>cδ/2\lambda>c\delta/2. On the other hand, if Δ2cδ/10\|\Delta\|_{2}\geq\sqrt{c\delta/10} then by (10) we have λ=Ω(Δ22ε)\lambda=\Omega\left(\frac{\|\Delta\|_{2}^{2}}{\varepsilon}\right). Putting it all together we have:

where in the last line we used the fact that λ>Ω(Δ22ε)\lambda>\Omega\left(\frac{\|\Delta\|_{2}^{2}}{\varepsilon}\right), and Δ22Ω(ε2log1/ε)\|\Delta\|_{2}^{2}\geq\Omega(\varepsilon^{2}\log 1/\varepsilon). This now completes the proof. ∎

3.3 The Full Algorithm

This separation oracle, along with the classical theory of convex optimization [GLS88], implies that we have shown the following:

Fix ε,τ>0\varepsilon,\tau>0, and let δ=O(εlog1/ε)\delta=O(\varepsilon\sqrt{\log 1/\varepsilon}). Let X1,,XNX_{1},\ldots,X_{N} be an ε\varepsilon-corrupted set of points satisfying (8)-(9), for δ1δ\delta_{1}\leq\delta and δ2δlog1/ε\delta_{2}\leq\delta\sqrt{\log 1/\varepsilon}. Let cc be a sufficiently large constant. Then, there is an algorithm \textscLearnApproxMean(ε,τ,X1,,XN)\textsc{LearnApproxMean}(\varepsilon,\tau,X_{1},\ldots,X_{N}) which runs in time poly(N,d,1/ε,log1/τ)\operatorname{poly}(N,d,1/\varepsilon,\log 1/\tau), and outputs a set of weights wSN,εw^{\prime}\in S_{N,\varepsilon} such that there is a wCcδw\in C_{c\delta} such that wwε/(Ndlog(N/τ))\|w-w^{\prime}\|_{\infty}\leq\varepsilon/(N\sqrt{d\log(N/\tau)}).

This algorithm, while an extremely powerful primitive, is technically not sufficient. However, given this, the full algorithm is not too difficult to state: simply run NaivePrune, then optimize over CcδC_{c\delta} using this separation oracle, and get some ww which is approximately in CcδC_{c\delta}. Then, output i=1NwiXi\sum_{i=1}^{N}w_{i}X_{i}. For completeness, the pseudocode for the algorithm is given below. In the pseudocode, we assume that \textscEllipsoid(\textscSeparationOracleUnknownMean,ε)\textsc{Ellipsoid}(\textsc{SeparationOracleUnknownMean},\varepsilon^{\prime}) is a convex optimization routine, which given the SeparationOracleUnknownMean separation oracle and a target error ε\varepsilon^{\prime}, outputs a ww^{\prime} such that wwε\|w-w^{\prime}\|_{\infty}\leq\varepsilon^{\prime}. From the classical theory of optimization, we know such a routine exists and runs in polynomial time.

Fix ε,τ>0\varepsilon,\tau>0, and let δ=O(εlog1/ε)\delta=O(\varepsilon\sqrt{\log 1/\varepsilon}). Let X1,,XNX_{1},\ldots,X_{N} be an ε\varepsilon-corrupted set of samples, where

Let μ^\widehat{\mu} be the output of \textscLearnMean(ε,τ,X1,,XN)\textsc{LearnMean}(\varepsilon,\tau,X_{1},\ldots,X_{N}). Then with probability 1τ1-\tau, we have μ^μ2δ\|\widehat{\mu}-\mu\|_{2}\leq\delta.

By Fact 4.6, Lemma 4.3, and Lemma 4.4, we know that (7)-(9) hold with probability 1τ1-\tau, with δ1,δ2δ\delta_{1},\delta_{2}\leq\delta. Condition on the event that this event holds. After NaivePrune, by Fact 4.14 we may assume that no uncorrupted points are removed, and all points satisfy Xiμ2O(dlog(N/τ))\|X_{i}-\mu\|_{2}\leq O(\sqrt{d\log(N/\tau)}). Let ww^{\prime} be the output of the algorithm, and let wCcδw\in C_{c\delta} be such that ww<ε/(Ndlog(N/τ))\|w-w^{\prime}\|_{\infty}<\varepsilon/(N\sqrt{d\log(N/\tau)}). By Corollary 4.16, we know that i=1NwiXiμ2O(δ)\|\sum_{i=1}^{N}w_{i}X_{i}-\mu\|_{2}\leq O(\delta). Hence, we have

so the entire error is at most O(δ)O(\delta), as claimed. ∎

4 Finding the Covariance, Using a Separation Oracle

In this section, we consider the problem of approximating Σ\Sigma given NN samples from N(0,Σ)\operatorname{\mathcal{N}}(0,\Sigma) in the full adversary model. Let Ui=Σ1/2XiU_{i}=\Sigma^{-1/2}X_{i} such that if XiN(0,Σ)X_{i}\sim\operatorname{\mathcal{N}}(0,\Sigma) then UiN(0,I)U_{i}\sim\operatorname{\mathcal{N}}(0,I). Moreover let Zi=Ui2Z_{i}=U_{i}^{\otimes 2}. Our approach will parallel the one given earlier in Section 4.3. Again, we will work with a convex set

and our goal is to design an approximate separation oracle. Our results in this section will rely on the following deterministic conditions:

for all wSN,εw\in S_{N,\varepsilon}, and all sets TGT\subseteq G of size T2εN|T|\leq 2\varepsilon N. As before, by Fact 4.2, the renormalized weights over the uncorrupted points are in SN,4εS_{N,4\varepsilon}. Hence, we can appeal to Fact 4.6, Corollary 4.8, Corollary 4.9, and Theorem 4.13 with SN,4εS_{N,4\varepsilon} instead of SN,εS_{N,\varepsilon} to bound the probability that this event does not hold. Let ww^{\ast} be the set of weights which are uniform over the uncorrupted points; by (13) for δΩ(εlog1/ε)\delta\geq\Omega(\varepsilon\sqrt{\log 1/\varepsilon}) we have that wCδw^{\ast}\in C_{\delta}.

Let δ=O(εlog1/ε)\delta=O(\varepsilon\log 1/\varepsilon). Suppose that (13), (14), and 15 hold for δ1,δ2O(δ)\delta_{1},\delta_{2}\leq O(\delta) and δ3O(δlog1/ε)\delta_{3}\leq O(\delta\log 1/\varepsilon). Then, there is a constant cc and an algorithm such that, given any input wSN,εw\in S_{N,\varepsilon} we have:

(Completeness) If w=ww=w^{\ast}, the algorithm outputs “YES”.

As in the case of learning an unknown mean, by the classical theory of convex optimization this implies that we will find a point ww such that wwεpoly(N)\|w-w^{\prime}\|_{\infty}\leq\frac{\varepsilon}{\operatorname{poly}(N)} for some wCcδw^{\prime}\in C_{c\delta}, using polynomially many calls to this oracle. We make this more precise in the following subsubsection.

The pseudocode for the (approximate) separation oracle is given in Algorithm 4. Observe briefly that this algorithm does indeed run in polynomial time. Lines 2-7 require only taking top eigenvalues and eigenvectors, and so can be done in polynomial time. For any ξ{1,+1}\xi\in\{-1,+1\}, line 8 can be run by sorting the samples by wi(Yi2dd)w_{i}\left(\frac{\|Y_{i}\|^{2}}{\sqrt{d}}-\sqrt{d}\right) and seeing if there is a subset of the top 2εN2\varepsilon N samples satisfying the desired condition, and line 9 can be executed similarly.

We now turn our attention to proving the correctness of this separation oracle. We require the following technical lemmata.

Fix δ<1\delta<1 and suppose that MM is symmetric. If MIFδ\|M-I\|_{F}\geq\delta then M1IFδ2\|M^{-1}-I\|_{F}\geq\frac{\delta}{2}.

We will prove this lemma in the contrapositive, by showing that if M1IF<δ2\|M^{-1}-I\|_{F}<\frac{\delta}{2} then MIF<δ\|M-I\|_{F}<\delta. Since the Frobenius norm is rotationally invariant, we may assume that M1=\mboxdiag(1+ν1,,1+νd)M^{-1}=\mbox{diag}(1+\nu_{1},\ldots,1+\nu_{d}), where by assumption νi2<δ2/4\sum\nu_{i}^{2}<\delta^{2}/4. By our assumption that δ<1\delta<1, we have νi1/2|\nu_{i}|\leq 1/2 for all ii. Thus

where we have used the inequality 111+x2x|1-\frac{1}{1+x}|\leq|2x| which holds for all x1/2|x|\leq 1/2. This completes the proof. ∎

Let N1,,NdN_{1},\ldots,N_{d} be the columns of NN. Then

so the desired result follows by taking square roots of both sides. ∎

By the definition of S\|\cdot\|_{\operatorname{{\mathcal{S}}}}, we have

By self duality of the Frobenius norm, we know that

since ISI^{\flat}\in\operatorname{{\mathcal{S}}}^{\perp}. The result now follows. ∎

Let us first prove completeness. Observe that by Theorem 4.12, we know that restricted to S\operatorname{{\mathcal{S}}}, we have that M4=2IM_{4}=2I. Therefore, by (15) we will not output a hyperplane in line 7. Moreover, by (14), we will not output a hyperplane in line 8. This proves completeness.

Thus it suffices to show soundness. Suppose that w∉Ccδw\not\in\mathcal{C}_{c\delta}. We will make use of the following elementary fact:

Let A=Σ1/2Σ^Σ1/2A=\Sigma^{-1/2}\widehat{\Sigma}\Sigma^{-1/2} and B=Σ^1/2ΣΣ^1/2B=\widehat{\Sigma}^{-1/2}\Sigma\widehat{\Sigma}^{-1/2}. Then

In particular A1=Σ1/2Σ^1Σ1/2A^{-1}=\Sigma^{1/2}\widehat{\Sigma}^{-1}\Sigma^{1/2}. Using this expression and the fact that all the matrices involved are symmetric, we can write

where in the third line we have used the fact that the trace of a product of matrices is preserved under cyclic shifts. ∎

Assume (13) holds with δ1O(δ)\delta_{1}\leq O(\delta) and assume furthermore that AIFcδ\|A-I\|_{F}\geq c\delta. Then, if we let δ=(1ε)c2δ=Θ(δ)\delta^{\prime}=\frac{(1-\varepsilon)c}{2}\delta=\Theta(\delta), we have

Let A,BA,B be as in Fact 4.28. Combining Lemma 4.25 and Fact 4.28 we have

We can rewrite (13) as the expression iGwiXiXiT=wgΣ1/2(I+R)Σ1/2\sum_{i\in G}w_{i}X_{i}X_{i}^{T}=w_{g}\Sigma^{1/2}(I+R)\Sigma^{1/2} where RR is symmetric and satisfies RFδ1\|R\|_{F}\leq\delta_{1}. By the definition of Σ^\widehat{\Sigma} we have that i=1NwiYiYiT=I\sum_{i=1}^{N}w_{i}Y_{i}Y_{i}^{T}=I, and so

by applying Lemma 4.26. And putting it all together we have

It is easily verified that for c>10c>10, we have that for all δ\delta, if Σ^1/2ΣΣ^1/2IFcδ\|\widehat{\Sigma}^{-1/2}\Sigma\widehat{\Sigma}^{-1/2}-I\|_{F}\geq c\delta, then

where δ=c(1ε)2δ=Θ(δ)\delta^{\prime}=\frac{c(1-\varepsilon)}{2}\delta=\Theta(\delta). The desired result then follows from the Pythagorean theorem. ∎

Claim 4.29 tells us that if w∉Ccδ,w\not\in C_{c\delta}, we know that one of the terms in (17) must be at least 12δ\frac{1}{2}\delta^{\prime}. We first show that if the first term is large, then the algorithm outputs a separating hyperplane:

Assume that (13)-(15) hold with δ1,δ2O(δ)\delta_{1},\delta_{2}\leq O(\delta) and δ3O(δlog1/ε)\delta_{3}\leq O(\delta\log 1/\varepsilon). Moreover, suppose that

Then the algorithm outputs a hyperplane in line 7, and moreover, it is a separating hyperplane.

Let us first show that given these conditions, then the algorithm indeed outputs a hyperplane in line 7. Since ISI^{\flat}\in S^{\perp}, the first term is just equal to iEwiZiS\left\|\sum_{i\in E}w_{i}Z_{i}\right\|_{S}. But this implies that there is some MSM^{\flat}\in S such that M2=MF=1\|M^{\flat}\|_{2}=\|M\|_{F}=1 and such that

The wi/wbw_{i}/w_{b} are a set of weights satisfying the conditions of Claim 4.24 and so this implies that

Let Σ~=Σ^1Σ\widetilde{\Sigma}=\widehat{\Sigma}^{-1}\Sigma. By Theorem 4.12 and (15), we have that

since it is easily verified that δΣ~2O(Σ~IF)\delta\|\widetilde{\Sigma}\|^{2}\leq O(\|\widetilde{\Sigma}-I\|_{F}) as long as Σ~IFΩ(δ)\|\widetilde{\Sigma}-I\|_{F}\geq\Omega(\delta), which it is by (17).

Equations 18 and 19 then together imply that

Now let us assume that the first term on the LHS is less than 12δ\frac{1}{2}\delta^{\prime}, such that the algorithm does not necessarily output a hyperplane in line 7. Thus, the second term on the LHS of Equation 16 is at least 12δ\frac{1}{2}\delta^{\prime}. We now show that this implies that this implies that the algorithm will output a separating hyperplane in line 9.

Assume that (13)-(15) hold. Moreover, suppose that

Then the algorithm outputs a hyperplane in line 9, and moreover, it is a separating hyperplane.

By the definition of S\operatorname{{\mathcal{S}}}^{\perp}, the assumption implies that

which is equivalent to the condition that

for some ξ{1,1}\xi\in\{-1,1\}. In particular, the algorithm will output a hyperplane

in Step 9, where SS is some set of size at most εN\varepsilon N, and λ=O(δ)\lambda=O(\delta^{\prime}). Since it will not affect anything, for without loss of generality let us assume that ξ=1\xi=1. The other case is symmetrical.

where AF=O(δNT)\|A\|_{F}=O\left(\delta\frac{N}{|T|}\right). Hence,

as long as δO(δ)\delta^{\prime}\geq O(\delta). By self-duality of the Frobenius norm, using the test matrix 1dI\frac{1}{\sqrt{d}}I, this implies that

These two claims in conjunction directly imply the correctness of the theorem. ∎

As before, this separation oracle and the classical theory of convex optimization [GLS88] shows that we have demonstrated an algorithm FindApproxCovariance with the following properties:

Fix ε,τ>0\varepsilon,\tau>0, and let δ=O(εlog1/ε)\delta=O(\varepsilon\log 1/\varepsilon). Let c>0c>0 be a universal constant which is sufficiently large. Let X1,,XNX_{1},\ldots,X_{N} be an ε\varepsilon-corrupted set of points satisfying (13-(15), for δ1,δ2O(δ)\delta_{1},\delta_{2}\leq O(\delta) and δ3O(δlog1/ε)\delta_{3}\leq O(\delta\log 1/\varepsilon). Then \textscFindApproxCovariance(ε,τ,X1,,XN)\textsc{FindApproxCovariance}(\varepsilon,\tau,X_{1},\ldots,X_{N}) runs in time poly(N,d,1/ε,log1/τ)\operatorname{poly}(N,d,1/\varepsilon,\log 1/\tau), and outputs a uu such that there is some wCcδw\in C_{c\delta} such that wuε/(Ndlog(N/τ))\|w-u\|_{\infty}\leq\varepsilon/(Nd\log(N/\tau)).

As before, this is not quite sufficient to actually recover the covariance robustly. Naively, we would just like to output i=1NuiXiXiT\sum_{i=1}^{N}u_{i}X_{i}X_{i}^{T}. However, this can run into issues if there are points XiX_{i} such that Σ1/2Xi2\|\Sigma^{-1/2}X_{i}\|_{2} is extremely large. We show here that we can postprocess the uu such that we can weed out these points. First, observe that we have the following lemma:

Assume X1,,XNX_{1},\ldots,X_{N} satisfy (13). Let wSN,εw\in S_{N,\varepsilon}. Then

This follows since by (13), we have that iGwiXiXiTwg(1δ1)Σ(1O(δ1))Σ\sum_{i\in G}w_{i}X_{i}X_{i}^{T}\succeq w_{g}(1-\delta_{1})\Sigma\succeq(1-O(\delta_{1}))\Sigma. The lemma then follows since iEwiXiXiT0\sum_{i\in E}w_{i}X_{i}X_{i}^{T}\succeq 0 always. ∎

Let uu be such that there is wCcδw\in C_{c\delta} such that uwε/(Ndlog(N/τ))\|u-w\|_{\infty}\leq\varepsilon/(Nd\log(N/\tau)). Then, i=1NuiXiXiT(1+O(δ))Σ\sum_{i=1}^{N}u^{-}_{i}X_{i}X_{i}^{T}\preceq(1+O(\delta))\Sigma.

By the definition of CcδC_{c\delta}, we must have that i=1NwiXiXiT(1+cδ)Σ\sum_{i=1}^{N}w_{i}X_{i}X_{i}^{T}\preceq(1+c\delta)\Sigma. Moreover, we must have u~iwi\widetilde{u}^{-}_{i}\leq w_{i} for every index i[N]i\in[N]. Thus we have that i=1Nu~iwiXiXiT(1+cδ)Σ\sum_{i=1}^{N}\widetilde{u}^{-}_{i}w_{i}X_{i}X_{i}^{T}\preceq(1+c\delta)\Sigma, and hence i=1NuiwiXiXiT(1+cδ)Σ\sum_{i=1}^{N}u^{-}_{i}w_{i}X_{i}X_{i}^{T}\preceq(1+c\delta)\Sigma, since i=1NuiwiXiXiT(1+O(ε))i=1Nu~iwiXiXiT\sum_{i=1}^{N}u^{-}_{i}w_{i}X_{i}X_{i}^{T}\preceq(1+O(\varepsilon))\sum_{i=1}^{N}\widetilde{u}^{-}_{i}w_{i}X_{i}X_{i}^{T}. ∎

We now give the full algorithm. The algorithm proceeds as follows: first run FindApproxCovariance to get some set of weights uu which is close to some element of CcδC_{c\delta}. We then compute the empirical covariance Σ1=i=1NuiXiXiT\Sigma_{1}=\sum_{i=1}^{N}u_{i}X_{i}X_{i}^{T} with the weights uu, and remove any points which have Σ11/2Xi22\|\Sigma_{1}^{-1/2}X_{i}\|_{2}^{2} which are too large. We shall show that this removes no good points, and removes all corrupted points which have Σ1/2Xi22\|\Sigma^{-1/2}X_{i}\|_{2}^{2} which are absurdly large. We then rerun FindApproxCovariance with this pruned set of points, and output the empirical covariance with the output of this second run. Formally, we give the pseudocode for the algorithm in Algorithm 5.

We now show that this algorithm is correct.

Let 1/2ε>01/2\geq\varepsilon>0, and τ>0\tau>0. Let δ=O(εlog1/ε)\delta=O(\varepsilon\log 1/\varepsilon). Let X1,,XNX_{1},\ldots,X_{N} be a ε\varepsilon-corrupted set of samples from N(0,Σ)\operatorname{\mathcal{N}}(0,\Sigma) where

Let Σ^\widehat{\Sigma} be the output of \textscLearnCovariance(ε,τ,X1,,XN)\textsc{LearnCovariance}(\varepsilon,\tau,X_{1},\ldots,X_{N}). Then with probability 1τ1-\tau, Σ1/2Σ^Σ1/2IFO(δ)\|\Sigma^{-1/2}\widehat{\Sigma}\Sigma^{-1/2}-I\|_{F}\leq O(\delta).

We first condition on the event that we satisfy (12)-(15) with δ1,δ2O(δ)\delta_{1},\delta_{2}\leq O(\delta) and δ3O(δlog1/ε)\delta_{3}\leq O(\delta\log 1/\varepsilon). By our choice of NN, Fact 4.6, Corollary 4.7, Corollary 4.9, and Theorem 4.13, and a union bound, we know that this event happens with probability 1τ1-\tau.

By Theorem 4.32, Lemma 4.33, and Lemma 4.34, we have that since ε\varepsilon is sufficiently small,

In particular, this implies that for every vector XiX_{i}, we have

Therefore, by (12), we know that in line 6, we never throw out any uncorrupted points, and moreover, if XiX_{i} is corrupted with Σ1/2Xi22Ω(dlogN/τ)\|\Sigma^{-1/2}X_{i}\|_{2}^{2}\geq\Omega(d\log N/\tau), then it is thrown out. Thus, let SS^{\prime} be the set of pruned points. Because no uncorrupted point is thrown out, we have that S(12ε)N|S^{\prime}|\geq(1-2\varepsilon)N, and moreover, this set of points still satisfies (13)-(15)Technically, the samples satisfy a slightly different set of conditions since we may have thrown out some corrupted points, and so in particular the number of samples may have changed, but the meaning should be clear. and moreover, for ever iSi\in S^{\prime}, we have Σ1/2Xi22O(dlogN/τ)\|\Sigma^{-1/2}X_{i}\|_{2}^{2}\leq O(d\log N/\tau). Therefore, by Theorem 4.32, we have that there is some uCcIu^{\prime\prime}\in C_{c|I|} such that uu<ε/(Ndlog(N/τ))\|u^{\prime}-u^{\prime\prime}\|_{\infty}<\varepsilon/(Nd\log(N/\tau)). But now if Σ^=iIuiXiXiT\widehat{\Sigma}=\sum_{i\in|I|}u^{\prime}_{i}X_{i}X_{i}^{T}, we have

5 Learning an Arbitrary Gaussian Agnostically

We have shown how to agnostically learn the mean of a Gaussian with known covariance, and we have shown how to agnostically learn the covariance of a mean zero Gaussian. In this section, we show how to use these two in conjunction to agnostically learn an arbitrary Gaussian. Throughout, let X1,,XNX_{1},\ldots,X_{N} be an ε\varepsilon-corrupted set of samples from N(μ,Σ)\operatorname{\mathcal{N}}(\mu,\Sigma), where both μ\mu and Σ\Sigma are unknown. We will set

We first show a simple trick which, at the price of doubling the amount of error, allows us to assume that the mean is zero, without changing the covariance. We do so as follows: for each i=1,,N/2i=1,\ldots,N/2, let Xi=(XiXN/2+i)/2X_{i}^{\prime}=(X_{i}-X_{N/2+i})/\sqrt{2}. Observe that if both XiX_{i} and XN/2+iX_{N/2+i} are uncorrupted, then XiN(0,Σ)X_{i}^{\prime}\sim\operatorname{\mathcal{N}}(0,\Sigma). Moreover, observe that XiX_{i}^{\prime} is corrupted only if either XiX_{i} or XN/2+iX_{N/2+i} is corrupted. Then we see that if X1,,XNX_{1},\ldots,X_{N} is ε\varepsilon-corrupted, then the X1,,XN/2X^{\prime}_{1},\ldots,X^{\prime}_{N/2} is a N/2N/2-sized set of samples which is 2ε2\varepsilon-corrupted. Thus, by using the results from Section 4.4, with probability 1τ1-\tau, we can recover a Σ^\widehat{\Sigma} such that

which in particular by Corollary 2.14, implies that

5.2 From Unknown Mean, Approximate Covariance, to Approximate Recovery

For each XiX_{i}, let Xi=Σ^1/2XiX^{\prime\prime}_{i}=\widehat{\Sigma}^{-1/2}X_{i}. Then, for XiX_{i} which is not corrupted, we have that XiN(Σ^1/2μ,Σ1)X_{i}^{\prime\prime}\sim\operatorname{\mathcal{N}}(\widehat{\Sigma}^{-1/2}\mu,\Sigma_{1}), where Σ1=Σ^1/2ΣΣ^1/2\Sigma_{1}=\widehat{\Sigma}^{-1/2}\Sigma\widehat{\Sigma}^{-1/2}. By Corollary 2.14 and Lemma 4.25, if (20) holds, then we have

By Claim 2.5, this means that if (20) holds, the uncorrupted set of XiX_{i}^{\prime\prime} can be treated as an O(εlog1/ε)O(\varepsilon\log 1/\varepsilon)-corrupted set of samples from N(Σ^1/2μ,I)\operatorname{\mathcal{N}}(\widehat{\Sigma}^{-1/2}\mu,I). Thus, if (20) holds, the entire set of samples X1,,XmX^{\prime\prime}_{1},\ldots,X^{\prime\prime}_{m} is a O(εlog1/ε)O(\varepsilon\log 1/\varepsilon)-corrupted set of samples from N(Σ^1/2μ,I)\operatorname{\mathcal{N}}(\widehat{\Sigma}^{-1/2}\mu,I). Then, by using results from Section 4.3, with probability 1τ1-\tau, assuming that 20 holds, we can recover a μ^\widehat{\mu} such that μ^Σ^1/2μ2O(εlog3/2(1/ε))\|\widehat{\mu}-\widehat{\Sigma}^{-1/2}\mu\|_{2}\leq O(\varepsilon\log^{3/2}(1/\varepsilon)). Thus, by Corollary 2.13, this implies that

which in conjunction with (21), implies that

and thus by following this procedure, whose formal pseudocode is given in Algorithm 6, we have shown the following:

Fix ε,τ>0\varepsilon,\tau>0. Let X1,,XNX_{1},\ldots,X_{N} be an ε\varepsilon-corrupted set of samples from N(μ,Σ)\operatorname{\mathcal{N}}(\mu,\Sigma), where μ,Σ\mu,\Sigma are both unknown, and

There is a polynomial-time algorithm \textscRecoverRobustGaussian(ε,τ,X1,,XN)\textsc{RecoverRobustGaussian}(\varepsilon,\tau,X_{1},\ldots,X_{N}) which with probability 1τ1-\tau, outputs a Σ^,μ^\widehat{\Sigma},\widehat{\mu} such that

Agnostically Learning a Gaussian, via Filters

In this section, we use our filter technique to give an agnostic learning algorithm for an unknown mean Gaussian with known covariance matrix. More specifically, we prove:

Notation. We will denote μS=1SXSX\mu^{S}=\frac{1}{|S|}\sum_{X\in S}X and MS=1SXS(XμG)(XμG)TM_{S}=\frac{1}{|S|}\sum_{X\in S}(X-\mu^{G})(X-\mu^{G})^{T} for the sample mean and modified sample covariance matrix of the set SS.

We start by defining our notion of good sample, i.e, a set of conditions on the uncorrupted set of samples under which our algorithm will succeed.

For all xSx\in S we have xμG2O(dlog(S/τ))\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log(|S|/\tau)}).

We have that μSμG2ε.\|\mu^{S}-\mu^{G}\|_{2}\leq\varepsilon.

We have that MSI2ε.\left\|M_{S}-I\right\|_{2}\leq{\varepsilon}.

We show in Appendix B that a sufficiently large set of independent samples from GG is (ε,τ)(\varepsilon,\tau)-good (with respect to GG) with high probability. Specifically, we prove:

Let GG be a Gaussian distribution with identity covariance, and ε,τ>0.\varepsilon,\tau>0. If the multiset SS is obtained by taking Ω((d/ε2)polylog(d/ετ))\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau)) independent samples from G,G, it is (ε,τ)(\varepsilon,\tau)-good with respect to GG with probability at least 1τ.1-\tau.

We require the following definition that quantifies the extent to which a multiset has been corrupted:

Given finite multisets SS and SS^{\prime} we let Δ(S,S)\Delta(S,S^{\prime}) be the size of the symmetric difference of SS and SS^{\prime} divided by the cardinality of S.S.

As in the convex program case, we will first use NaivePrune to remove points which are far from the mean. Then, we iterate the algorithm whose performance guarantee is given by the following:

A mean vector μ^\widehat{\mu} such that μ^μG2=O(εlog(1/ε)).\|\widehat{\mu}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).

We start by showing how Theorem 5.1 follows easily from Proposition 5.5.

By the definition of Δ(S,S),\Delta(S,S^{\prime}), since SS^{\prime} has been obtained from SS by corrupting an ε\varepsilon-fraction of the points in S,S, we have that Δ(S,S)2ε.\Delta(S,S^{\prime})\leq 2\varepsilon. By Lemma 5.3, the set SS of uncorrupted samples is (ε,τ)(\varepsilon,\tau)-good with respect to GG with probability at least 1τ.1-\tau. We henceforth condition on this event.

Since SS is (ε,τ)(\varepsilon,\tau)-good, all xSx\in S have xμG2O(dlogS/τ)\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log|S|/\tau}). Thus, the NaivePrune procedure does not remove from SS^{\prime} any member of SS. Hence, its output, SS^{\prime\prime}, has Δ(S,S)Δ(S,S)\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime}) and for any xSx\in S^{\prime\prime}, there is a ySy\in S with xy2O(dlogS/τ)\|x-y\|_{2}\leq O(\sqrt{d\log|S|/\tau}). By the triangle inequality, for any x,zSx,z\in S^{\prime\prime}, xz2O(dlogS/τ)=O(dlog(d/ετ))\|x-z\|_{2}\leq O(\sqrt{d\log|S|/\tau})=O(\sqrt{d\log(d/\varepsilon\tau})).

Then, we iteratively apply the Filter-Gaussian-Unknown-Mean procedure of Proposition 5.5 until it terminates returning a mean vector μ\mu with μ^μG2=O(εlog(1/ε)).\|\widehat{\mu}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}). We claim that we need at most O(α)O(\alpha) iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets Si,S_{i}^{\prime}, such that Δ(S,Si)Δ(S,S)iε/α.\Delta(S,S_{i}^{\prime})\leq\Delta(S,S^{\prime})-i\cdot\varepsilon/{\alpha}. Thus, if we do not output the empirical mean in the first 2α2\alpha iterations, in the next iteration there are no outliers left. Hence in the next iteration it is impossible for the algorithm to output a subset satisfying Condition (ii) of Proposition 5.5, so it must output a mean vector satisfying (i), as desired. ∎

In this subsection, we describe the efficient algorithm establishing Proposition 5.5 and prove its correctness. Our algorithm calculates the empirical mean vector μS\mu^{S^{\prime}} and empirical covariance matrix Σ\Sigma. If the matrix Σ\Sigma has no large eigenvalues, it returns μS.\mu^{S^{\prime}}. Otherwise, it uses the eigenvector vv^{\ast} corresponding to the maximum magnitude eigenvalue of Σ\Sigma and the mean vector μS\mu^{S^{\prime}} to define a filter. Our efficient filtering procedure is presented in detailed pseudocode below.

1.2 Proof of Correctness of Filter-Gaussian-Unknown-Mean

We define μG,μS,μS,μL,\mu^{G},\mu^{S},\mu^{S^{\prime}},\mu^{L}, and μE\mu^{E} to be the means of G,S,S,L,G,S,S^{\prime},L, and EE, respectively.

Our analysis will make essential use of the following matrices:

Our analysis will hinge on proving the important claim that ΣI\Sigma-I is approximately (E/S)ME(|E|/|S^{\prime}|)M_{E}. This means two things for us. First, it means that if the positive errors align in some direction (causing MEM_{E} to have a large eigenvalue), there will be a large eigenvalue in ΣI\Sigma-I. Second, it says that any large eigenvalue of ΣI\Sigma-I will correspond to an eigenvalue of MEM_{E}, which will give an explicit direction in which many error points are far from the empirical mean.

Useful Structural Lemmas. We will use the following simple fact about the concentration of Gaussian random variables:

We begin by noting that we have concentration bounds on GG and therefore, on SS due to its goodness.

The first line is Fact 5.6, and the former follows from it using the goodness of SS. ∎

By using the above fact, we obtain the following simple claim:

This follows from Fact 5.7 upon noting that w(XμS)>T+μSμG2|w\cdot(X-\mu^{S^{\prime}})|>T+\|\mu^{S^{\prime}}-\mu^{G}\|_{2} only if w(XμG)>T|w\cdot(X-\mu^{G})|>T. ∎

We can use the above facts to prove concentration bounds for LL. In particular, we have the following lemma:

We have that ML2=O(log(S/L)+εS/L).\|M_{L}\|_{2}={O\left(\log(|S|/|L|)+\varepsilon|S|/|L|\right)}.

For unit vectors vv, the RHS is bounded from above as follows:

where the second line follows from the fact that v2=1\|v\|_{2}=1, LSL\subset S, and SS satisfies condition (i) of Definition 5.2; the third line follows from (22); and the fourth line follows from Fact 5.7. ∎

As a corollary, we can relate the matrices MSM_{S^{\prime}} and MEM_{E}, in spectral norm:

We have that MSI=(E/S)ME+O(εlog(1/ε))M_{S^{\prime}}-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon)), where the O(εlog(1/ε))O(\varepsilon\log(1/\varepsilon)) term denotes a matrix of spectral norm O(εlog(1/ε))O(\varepsilon\log(1/\varepsilon)).

By definition, we have that SMS=SMSLML+EME|S^{\prime}|M_{S^{\prime}}=|S|M_{S}-|L|M_{L}+|E|M_{E}. Thus, we can write

where the second line uses the fact that 12εS/S1+2ε1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon, the goodness of SS (condition (iv) in Definition 5.2), and Lemma 5.9. Specifically, Lemma 5.9 implies that (L/S)ML2=O(εlog(1/ε))(|L|/|S^{\prime}|)\|M_{L}\|_{2}=O(\varepsilon\log(1/\varepsilon)). Therefore, we have that

We now establish a similarly useful bound on the difference between the mean vectors:

Since SS is a good set, by condition (iii) of Definition 5.2, we have μSμG2=O(ε)\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon). Since 12εS/S1+2ε1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon, it follows that (S/S)μSμG2=O(ε)(|S|/|S^{\prime}|)\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon). Using the valid inequality ML2μLμG22\|M_{L}\|_{2}\geq\|\mu^{L}-\mu^{G}\|_{2}^{2} and Lemma 5.9, we obtain that μLμG2O(log(S/L)+εS/L)\|\mu^{L}-\mu^{G}\|_{2}\leq O\left(\sqrt{\log(|S|/|L|)}+\sqrt{\varepsilon|S|/|L|}\right). Therefore,

as desired. This completes the proof of the lemma. ∎

By combining the above, we can conclude that ΣI\Sigma-I is approximately proportional to MEM_{E}. More formally, we obtain the following corollary:

We have ΣI=(E/S)ME+O(εlog(1/ε))+O(E/S)2ME2\Sigma-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon))+O(|E|/|S^{\prime}|)^{2}\|M_{E}\|_{2}, where the additive terms denote matrices of appropriately bounded spectral norm.

By definition, we can write ΣI=MSI(μSμG)(μSμG)T.\Sigma-I=M_{S^{\prime}}-I-(\mu^{S^{\prime}}-\mu^{G})(\mu^{S^{\prime}}-\mu^{G})^{T}. Using Corollary 5.10 and Lemma 5.11, we obtain:

where the second line follows from the valid inequality ME2μEμG22\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}. This completes the proof. ∎

On the other hand, since ME2μEμG22\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}, Lemma 5.11 gives that

Case of Large Spectral Norm. We next show the correctness of the algorithm when it returns a filter in Step 7.

Moreover, using the inequality ME2μEμG22\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2} and Lemma 5.11 as above, we get that

Suppose for the sake of contradiction that for all T>0T>0 we have that

Using (24), we obtain that for all T>0T>0 we have that

for some universal constant CC^{\prime\prime}.

We now have the following sequence of inequalities:

Combined with (23), we obtain λ=O(εlog(1/ε))\lambda^{\ast}=O(\varepsilon\log(1/\varepsilon)), which is a contradiction if CC is sufficiently large. Therefore, it must be the case that for some value of TT the condition in Step 7 is satisfied.

Recall that S=(SL)E,S^{\prime}=(S\setminus L)\cup E, with EE and LL disjoint multisets such that LS.L\subset S. We can similarly write S=(SL)E,S^{\prime\prime}=(S\setminus L^{\prime})\cup E^{\prime}, with LLL^{\prime}\supseteq L and EE.E^{\prime}\subset E. Since

it suffices to show that EELL+εS/α.|E\setminus E^{\prime}|\geq|L^{\prime}\setminus L|+\varepsilon|S|/{\alpha}. Note that LL|L^{\prime}\setminus L| is the number of points rejected by the filter that lie in SS.S\cap S^{\prime}. Note that the fraction of elements of SS that are removed to produce SS^{\prime\prime} (i.e., satisfy v(xμS)>T+δ|v^{\ast}\cdot(x-\mu^{S^{\prime}})|>T+\delta) is at most 2exp(T2/2)+ε/α2\exp(-T^{2}/2)+\varepsilon/{\alpha}. This follows from Claim 5.8 and the fact that T=O(dlog(d/ετ))T=O(\sqrt{d\log(d/\varepsilon\tau)}).

Hence, it holds that LL(2exp(T2/2)+ε/α)S.|L^{\prime}\setminus L|\leq(2\exp(-T^{2}/2)+\varepsilon/{\alpha})|S|. On the other hand, Step 7 of the algorithm ensures that the fraction of elements of SS^{\prime} that are rejected by the filter is at least 8exp(T2/2)+8ε/α)8\exp(-T^{2}/2)+8\varepsilon/{\alpha}). Note that EE|E\setminus E^{\prime}| is the number of points rejected by the filter that lie in SS.S^{\prime}\setminus S. Therefore, we can write:

where the second line uses the fact that SS/2|S^{\prime}|\geq|S|/2 and the last line uses the fact that LL/S2exp(T2/2)+ε/α.|L^{\prime}\setminus L|/|S|\leq 2\exp(-T^{2}/2)+\varepsilon/{\alpha}. Noting that log(d/ετ)1\log(d/\varepsilon\tau)\geq 1, this completes the proof of the claim. ∎

2 Learning a Gaussian With Unknown Covariance

In this subsection, we use our filter technique to agnostically learn a Gaussian with zero mean vector and unknown covariance. By combining the algorithms of the current and the previous subsections, as in our convex programming approach (Section 4.5), we obtain a filter-based algorithm to agnostically learn an arbitrary unknown Gaussian.

The main result of this subsection is the following theorem:

Let GN(0,Σ)G\sim\operatorname{\mathcal{N}}(0,\Sigma) be a Gaussian in dd dimensions with mean and unknown covariance, and let ε,τ>0\varepsilon,\tau>0. Let SS be an ε\varepsilon-corrupted set of samples from GG of size Ω((d2/ε2)polylog(d/ετ))\Omega((d^{2}/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon{\tau})). There exists an efficient algorithm that, given SS and ε\varepsilon, returns the parameters of a Gaussian distribution GN(0,Σ^)G^{\prime}\sim\operatorname{\mathcal{N}}(0,\widehat{\Sigma}) such that with probability at least 1τ{1-\tau}, it holds IΣ1/2Σ^Σ1/2F=O(εlog(1/ε)).\|I-\Sigma^{-1/2}\widehat{\Sigma}\Sigma^{-1/2}\|_{F}=O(\varepsilon\log(1/\varepsilon)).

As in the previous subsection, we will need a condition on SS under which our algorithm will succeed.

For all xSx\in S, xTΣ1x<O(dlog(S/τ))x^{T}\Sigma^{-1}x<O(d\log(|S|/\tau)).

We have that Σ1/2\mboxCov(S)Σ1/2IF=O(ε)\|\Sigma^{-1/2}\mbox{Cov}(S)\Sigma^{-1/2}-I\|_{F}=O(\varepsilon).

For all even degree-22 polynomials pp, we have that Var(p(S))=Var(p(G))(1+O(ε))\operatorname*{Var}(p(S))=\operatorname*{Var}(p(G))(1+O(\varepsilon)).

Let us first note some basic properties of such polynomials on a normal distribution. The proof of this lemma is deferred to Section B.

Var[p(X)]=2P2F2\operatorname*{Var}[p(X)]=2\|P_{2}\|_{F}^{2} and

For all δ>0\delta>0, Pr(p(X)δ2)O(δ).\Pr(|p(X)|\leq\delta^{2})\leq O(\delta).

We note that, if SS is obtained by taking random samples from GG, then SS is good with high probability. The proof of this lemma is also deferred to Section B.

Let GG be a dd-dimensional Gaussian with mean and let ε,τ>0\varepsilon,\tau>0. Let NN be a sufficiently large constant multiple of d2log5(d/ετ)/ε2d^{2}\log^{5}(d/\varepsilon\tau)/\varepsilon^{2}. Then a set SS of NN independent samples from GG is (ε,τ)(\varepsilon,\tau)-good with respect to GG with probability at least 1τ1-\tau.

As in Definition 5.4, Δ(S,S)\Delta(S,S^{\prime}) is the size of the symmetric difference of SS and SS^{\prime} divided by S|S|.

The basic thrust of our algorithm is as follows: By Lemma 5.17, with high probability we have that SS is (ε,τ)(\varepsilon,\tau)-good with respect to GG. The algorithm is then handed a new set SS^{\prime} such that Δ(S,S)2εS\Delta(S,S^{\prime})\leq 2\varepsilon|S|. The algorithm will run in stages. In each stage, the algorithm will either output GG^{\prime} or will return a new set SS^{\prime\prime} such that Δ(S,S)<Δ(S,S)\Delta(S,S^{\prime\prime})<\Delta(S,S^{\prime}). In the latter case, the algorithm will recurse on SS^{\prime\prime}. We formalize this idea below:

Given Proposition 5.18, the proof of Theorem 5.14 is straightforward. By Lemma 5.17 the original set SS is (ε,τ)(\varepsilon,\tau)-good with respect to GG with probability at least 1τ1-\tau. Then, SS^{\prime} satisfies the hypotheses of Proposition 5.18. We then repeatedly iterate the algorithm from Proposition 5.18 until it outputs a distribution GG^{\prime} close to GG. This must eventually happen because at every step the distance between SS and the set returned by the algorithm decreases by at least 11.

The function Find-max-poly uses similar notation to SeparationOracleUnknownCovariance, such that Filter-Gaussian-Unknown-Covariance and SeparationOracleUnknownCovariance can be more easily compared.

Let us first show that Find-max-poly is correct.

Algorithm Find-max-poly is correct and Filter-Gaussian-Unknown-Covariance runs time poly(dlogτ/ε).\operatorname{poly}(d\log\tau/\varepsilon).

Note that since the covariance matrix of SS^{\prime} is Σ  ,\Sigma^{\prime}\;, we have

We let TT^{\prime} be the multiset of y=Σ1/2xy=\Sigma^{-1/2}x for xSx\in S^{\prime} and UU^{\prime} the multiset of z=y2z=y^{\otimes 2} for yy in TT^{\prime}. Recall that P2=2vP_{2}^{\flat}=\sqrt{2}v. We thus have

Thus, the p(x)p(x) that maximizes QS(p)Q_{S^{\prime}}(p) is given by the unit vector vv that maximizes vTTSvv^{T}T_{S^{\prime}}v subject to vv^{\sharp} being symmetric.

Let v=vTv^{\prime}=v^{\sharp T\flat}. Note that vTTSv=vTTSvv^{T}T_{S^{\prime}}v=v^{\prime T}T_{S^{\prime}}v^{\prime} by symmetries of TST_{S^{\prime}}. Thus, by linearity, v=v/2+v/2v^{\prime\prime}=v/2+v^{\prime}/2 also has vTTSv=vTTSvv^{\prime\prime T}T_{S^{\prime}}v^{\prime\prime}=v^{T}T_{S^{\prime}}v. However, if vv^{\sharp} is not symmetric, vv^{\prime\prime} has v2<1\|v^{\prime\prime}\|_{2}<1. Thus, the unit vector v/v2v^{\prime\prime}/\|v^{\prime\prime}\|_{2} achieves a higher value of the bilinear form. Consequently, vv^{\ast\sharp} is symmetric.

To prove Lemma 5.20, we will require the following:

This holds essentially because the distribution of p(X)p(X) for XSX\in S is close to that for p(G)p(G), which has rapidly decaying tails. Therefore, throwing away an ε\varepsilon-fraction of the mass cannot change the value of the variance by very much. In particular, we have that

Note that, since the matrix inner product is an inner product,

Let p(x)p(x) denote the quadratic polynomial

As a corollary of this we note that Σ\Sigma^{\prime} cannot be too much smaller than Σ\Sigma.

The first step in verifying correctness is to note that if our algorithm returns on Step 6 that it does so correctly.

If our algorithm returns on Step 6, then Δ(S,S)<Δ(S,S).\Delta(S,S^{\prime\prime})<\Delta(S,S^{\prime}).

This is clearly true if we can show that all xx removed have x∉Sx\not\in S. However, this follows because (Σ)12Σ1(\Sigma^{\prime})^{-1}\leq 2\Sigma^{-1}, and therefore, by (ε,τ)(\varepsilon,\tau)-goodness, all xSx\in S satisfy

By Corollary 2.14, it suffices to show that

Therefore, we will have an appropriate bound unless IΣ1/2ΣEΣ1/2F=Ω(log(1/ε)).\|I-\Sigma^{-1/2}\Sigma_{E}\Sigma^{-1/2}\|_{F}=\Omega(\log(1/\varepsilon)).

Next, note that there is a matrix MM with MF=1\|M\|_{F}=1 such that

Indeed we can take M=(IΣ1/2ΣEΣ1/2)/IΣ1/2ΣEΣ1/2FM=(I-\Sigma^{-1/2}\Sigma_{E}\Sigma^{-1/2})/\|I-\Sigma^{-1/2}\Sigma_{E}\Sigma^{-1/2}\|_{F}. Thus, there is a symmetric MM such that this holds.

This shows that if the algorithm returns in this step, it does so correctly. ∎

Next, we need to show that if the algorithm reaches Step 13 that such a TT exists.

If the algorithm reaches Step 13, then there exists a T>1T>1 such that

Before we begin, we will need the following critical Lemma:

We note that since VarXG(p(G))=QG(p)=1\operatorname*{Var}_{X\sim G^{\prime}}(p(G^{\prime}))=Q_{G^{\prime}}(p)=1, we just need to show that the variance with respect to GG instead of GG^{\prime} is not too much larger. This will essentially be because the covariance matrix of GG cannot be much bigger than the covariance matrix of GG^{\prime} by Corollary 5.22.

We claim that if A,BA,B are matrices, then ABFA2BF\|AB\|_{F}\leq\|A\|_{2}\|B\|_{F}. If BjB_{j} are the columns of BB, then we have ABF2=jABj22A22jBj22=(A2BF)2\|AB\|_{F}^{2}=\sum_{j}\|AB_{j}\|_{2}^{2}\leq\|A\|_{2}^{2}\sum_{j}\|B_{j}\|_{2}^{2}=(\|A\|_{2}\|B\|_{F})^{2}. Similarly for rows, we have ABFAFB2\|AB\|_{F}\leq\|A\|_{F}\|B\|_{2}.

Next, we need to consider μ\mu. In particular, we note that by the similarity of SS and SS^{\prime}, μ\mu must be between the 4040 and 6060 percentiles of values of p(X)p(X) for XSX\in S. However, since SS is (ε,τ)(\varepsilon,\tau)-good, this must be between the 3030 and 7070 percentiles of p(G)p(G). Therefore, by Cantelli’ s inequality,

In particular, we have that VarXuS(p(X))=QS(p)1+Cεln2(1/ε)\operatorname*{Var}_{X\in_{u}S^{\prime}}(p(X))=Q_{S^{\prime}}(p)\geq 1+C\varepsilon\ln^{2}(1/\varepsilon), which means that

Hence, for CC sufficiently large, it must be the case that

For a sufficiently large CC, this contradicts Equation (28). ∎

Finally, we need to verify that if our algorithm returns output in Step 14, that it is correct.

If the algorithm returns during Step 14, then Δ(S,S)Δ(S,S)ε/(dlog(N/τ))2.\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-\varepsilon/(d\log(N/\tau))^{2}.

We note that it is sufficient to show that ES>(SL)S|E\setminus S^{\prime\prime}|>|(S\setminus L)\setminus S^{\prime\prime}|. In particular, it suffices to show that

On the other hand, using (27) and the ε\varepsilon-goodness of SS, we have that

Agnostically Learning a Mixture of Spherical Gaussians, via Convex Programming

In this section, we give an algorithm to agnostically learn a mixture of kk Gaussians with identical spherical covariance matrices up to error O~(poly(k)ε)\widetilde{O}(\operatorname{poly}(k)\cdot\sqrt{\varepsilon}). Let M=j[k]αjN(μj,σ2I)\mathcal{M}=\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\sigma^{2}I) be the unknown kk-GMM each of whose components are spherical. For XMX\sim\mathcal{M}, we write XjMX\sim_{j}\mathcal{M} if XX was drawn from the jjth component of M\mathcal{M}.

Our main result of this section is the following theorem:

There is an algorithm which with probability 1τ1-\tau, outputs a distribution M\mathcal{M}^{\prime} such that

The running time of the algorithm is poly(d,1/ε,log(1/τ))k2\operatorname{poly}(d,1/\varepsilon,\log(1/\tau))^{k^{2}}.

Our overall approach will be a combination of our method for agnostically learning a single Gaussian and recent work on properly learning mixtures of multivariate spherical Gaussians [SOAJ14, LS17]. At a high level, this recent work relies upon the empirical covariance matrix giving an accurate estimate of the overall covariance matrix in order to locate the subspace in which the component mean lie. However, as we have observed already, the empirical moments do not necessarily give good approximations of the true moments in the agnostic setting. Therefore, we will use our separation oracle framework to approximate the covariance matrix, and the rest of the arguments follow similarly as previous methods.

The organization of this section will be as follows. We define some of the notation we will be using and the Schatten top-kk norm in Section 6.1. Section 6.2 states the various concentration inequalities we require. In Section 6.3, we go over our overall algorithm in more detail. Section 6.4 describes a first naive clustering step, which deals with components which are very well separated. Section 6.5 contains details on our separation oracle approach, allowing us to approximate the true covariance. Section 6.6 describes our spectral clustering approach to cluster components with means separated more than Ωk(log1/ε)\Omega_{k}(\log 1/\varepsilon). In Section 6.7, we describe how to exhaustively search over a particular subspace to obtain a good estimate for the component means. In Section 6.8, we go over how to limit the set of hypotheses in order to satisfy the conditions of Lemma 2.23. For clarity of exposition, all of the above describe the algorithm assuming all σj2\sigma_{j}^{2} are equal. In Section 6.9, we discuss the changes to algorithm which are required to handle unequal variances.

For conciseness, many of the proofs are deferred to Section C.

Recall the definition of SN,εS_{N,\varepsilon} from Section 4.1, which we will use extensively in this section. We will use the notation μ=j[k]αjμj\mu=\sum_{j\in[k]}\alpha_{j}\mu_{j} to denote the mean of the unknown GMM. Also, we define parameters γj=αjμjμ22\gamma_{j}=\alpha_{j}\|\mu_{j}-\mu\|_{2}^{2} and let γ=maxjγj\gamma=\max_{j}\gamma_{j}. And for ease of notation, let

to denote the covariance of the unknown GMM. Our algorithm for learning spherical kk-GMMs will rely heavily on the following, non-standard norm:

i.e., it is the sum of the top-kk singular values of MM.

It is easily verified that Tk\|\cdot\|_{T_{k}} has a dual characterization

where the maxima is taken over all XX with orthonormal columns. From this, it is easy to see that the Schatten top-kk norm is indeed a norm, as its name suggests:

MTk\|M\|_{T_{k}} is a norm on symmetric matrices.

2 Concentration Inequalities

In this section, we will establish some concentration inequalities that we will need for our algorithm for agnostically learning mixtures of spherical Gaussians. Recall the notation as described in Section 6.1. The following two concentration lemmata follow from the same proofs as for Lemmata 42 and 44 in [LS17].

Fix ε,δ>0\varepsilon,\delta>0. If Y1,,YNY_{1},\dots,Y_{N} are independent samples from the GMM with PDF j[k]αjN(μj,Σj)\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\Sigma_{j}) where αjΩ(ε)\alpha_{j}\geq\Omega(\varepsilon) for all jj, and N=Ω(d+log(k/δ)ε2)N=\Omega\left(\frac{d+\log{(k/\delta)}}{\varepsilon^{2}}\right) then with probability at least 1O(δ)1-O(\delta),

where QQ is defined as in equation (29).

Fix ε,δ>0\varepsilon,\delta>0. If Y1,,YNY_{1},\dots,Y_{N} are independent samples from the GMM with PDF j[k]αjN(μj,Σj)\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\Sigma_{j}) where αjΩ(ε)\alpha_{j}\geq\Omega(\varepsilon) for all jj, and N=Ω(d+log(k/δ)ε2)N=\Omega\left(\frac{d+\log{(k/\delta)}}{\varepsilon^{2}}\right) then with probability at least 1O(δ)1-O(\delta),

From the same techniques as before, we get the same sort of union bounds as usual over the weight vectors:

Fix ε1/2\varepsilon\leq 1/2 and τ1\tau\leq 1. There is a δ=O(εlog1/ε)\delta=O(\varepsilon\sqrt{\log 1/\varepsilon}) such that if Y1,,YNY_{1},\dots,Y_{N} are independent samples from the GMM with PDF j[k]αjN(μj,Σj)\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\Sigma_{j}) where αjΩ(ε)\alpha_{j}\geq\Omega(\varepsilon) for all jj, and N=Ω(d+log(k/τ)δ12)N=\Omega\left(\frac{d+\log{(k/\tau)}}{\delta_{1}^{2}}\right), then

where QQ is defined as in equation (29).

Fix ε1/2\varepsilon\leq 1/2 and τ1\tau\leq 1. There is a δ=O(εlog1/ε)\delta=O(\varepsilon\sqrt{\log 1/\varepsilon}) such that if Y1,,YNY_{1},\dots,Y_{N} are independent samples from the GMM with PDF j[k]αjN(μj,Σj)\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\Sigma_{j}) where αjΩ(ε)\alpha_{j}\geq\Omega(\varepsilon) for all jj, and N=Ω(d+log(k/τ)δ22)N=\Omega\left(\frac{d+\log{(k/\tau)}}{\delta_{2}^{2}}\right), then

3 Algorithm

Our algorithm is based on the following deterministic conditions:

(32) follows from basic Gaussian concentration, and (33) and (34) follow from the results in Section 6.2 for NN sufficiently large. Note that these trivially imply similar conditions for the Schatten top-kk norm, at the cost of an additional factor of kk on the right-hand side of the inequalities. For the rest of this section, let δ=max(δ1,δ2)\delta=\max(\delta_{1},\delta_{2}).

At this point, we are ready to apply our separation oracle framework. In particular, we will find a weight vector ww over the points such that

for some choice of η\eta. The set of such weights is convex, and concentration implies that the true weight vector will have this property. Furthermore, we can describe a separation oracle given any weight vector not contained in this set (as long as η\eta is not too small). At this point, we use classical convex programming methods to find a vector which satisfies these conditions. Further details are provided in Section 6.5.

After this procedure, Lemma 6.13 shows that the weighted empirical covariance is spectrally close to the true covariance matrix. We are now in the same regime as [SOAJ14], which obtains their results as a consequence of the empirical covariance concentrating about the true covariance matrix. Thus, we will appeal to their analysis, highlighting the differences between our approach and theirs. We note that [LS17] also follows a similar approach and the interested reader may also adapt their arguments instead.

First, if γ\gamma is sufficiently large (i.e., Ωk(log(1/ε))\Omega_{k}(\log(1/\varepsilon))), this implies a separation condition between some component mean and the mixture’s mean. This allows us to cluster the points further, using a spectral method. We take the top eigenvector of the weighted empirical covariance matrix and project in this direction, using the sign of the result as a classifier. In contrast to previous work, which requires that no points are misclassified, we can tolerate poly(ε/k)\operatorname{poly}(\varepsilon/k) misclassifications, since our algorithms are agnostic. This crucially allows us to avoid a dependence on dd in our overall agnostic learning guarantee. Further details are provided in Section 6.6.

Finally, if γ\gamma is sufficiently small, we may perform an exhaustive search. The span of the means is in the span of the top k1k-1 eigenvectors of the true covariance matrix, which we can approximate with our weighted empirical covariance matrix. Since γ\gamma is small, by trying all points within a sufficiently tight mesh, we can guess a set of candidate means which are sufficiently close to the true means. Combining the approximations to the means with Corollary 2.13 and the triangle inequality, we can guarantee that at least one of our guesses is sufficiently close to the true distribution. Additional details are provided in Section 6.7.

To conclude our algorithm, we can apply Lemma 2.23. We note that this hypothesis selection problem has been studied before (see, e.g., [DL01, DK14]), but we must adapt it for our agnostic setting. This allows us to select a hypothesis which is sufficiently close to the true distribution, thus concluding the proof. We note that the statement of Lemma 2.23 requires the hypotheses to come from some fixed finite set, while there are an infinite number of Gaussian mixture models. In Section 6.8, we discuss how to limit the number of hypotheses based on the set of uncorrupted samples in order to satisfy the conditions of Lemma 2.23.

4 Naive Clustering

We give a very naive clustering algorithm, the generalization of NaivePrune, which recursively allows us to cluster components if they are extremely far away. The algorithm is very simple: for each XiX_{i}, add all points within distance O(dlog(k/ε))O(d\log(k/\varepsilon)) to a cluster SiS_{i}. Let C\mathcal{C} be the set of clusters which contain at least 4εN4\varepsilon N points, and let the final clustering be C1,,CkC_{1},\ldots,C_{k^{\prime}} be formed by merging clusters in C\mathcal{C} if they overlap, and stopping if no clusters overlap. We give the pseudocode in Algorithm 10.

We prove here that this process (which may throw away points) throws away only at most a ε\varepsilon fraction of good points, and moreover, the resulting clustering only misclassifies at most an O(ε)O(\varepsilon)-fraction of the good points, assuming (32).

Thus, we have that by applying this algorithm, given an ε\varepsilon-corrupted set of samples from M\mathcal{M}, we may cluster them in a way which misclassifies at most an ε/k\varepsilon/k fraction of the samples from any component, and such that within each cluster, the means of the associated components differ by at most dklogk/εdk\log k/\varepsilon. Thus, each separate cluster is simply a ε\varepsilon-corrupted set of samples from the mixture restricted to the components within that cluster; moreover, the number of components in each cluster must be strictly smaller than kk. Therefore, we may simply recursively apply our algorithm on these clusters to agnostically learn the mixture for each cluster, since if k=1k=1, this is a single Gaussian, which we know how to learn agnostically.

Thus, for the remainder of this section, let us assume that for all j,jj,j^{\prime}, we have μjμj22O(dklog1/ε)\|\mu_{j}-\mu_{j}^{\prime}\|_{2}^{2}\leq O(dk\log 1/\varepsilon). Moreover, we may assume that there are no points j,jj,j^{\prime} (corrupted or uncorrupted), such that XjXj22Ω(dklog1/ε)\|X_{j}-X_{j^{\prime}}\|_{2}^{2}\geq\Omega(dk\log 1/\varepsilon).

5 Estimating the Covariance Using Convex Programming

In this section, we will apply our separation oracle framework to estimate the covariance matrix. While in the non-agnostic case, the empirical covariance will approximate the actual covariance, this is not necessarily true in our case. As such, we will focus on determining a weight vector over the samples such that the weighted empirical covariance is a good estimate for the true covariance.

We first define the convex set for which we want an interior point:

In Section 6.5.1, we prove lemmata indicating important properties of this set. In Section 6.5.2, we give a separation oracle for this convex set. We conclude with Lemma 6.13, which shows that we have obtained an accurate estimate of the true covariance.

We start by proving the following lemma, which states that for any weight vector which is not in our set, the weighted empirical covariance matrix is noticeably larger than it should be (in Schatten top-kk norm).

Suppose that (33) holds, and w∉Cckh(k,γ,δ)w\not\in\mathcal{C}_{ckh(k,\gamma,\delta)}. Then

We also require the following lemma, which shows that if a set of weights poorly approximates μ\mu, then it is not in our convex set.

Suppose that (33) and (34) hold. Let wSm,εw\in S_{m,\varepsilon} and set μ^=i=1mwiXi\widehat{\mu}=\sum_{i=1}^{m}w_{i}X_{i} and Δ=μμ^\Delta=\mu-\widehat{\mu}. Furthermore, suppose that Δ2Ω(h(k,γ,δ))\|\Delta\|_{2}\geq\Omega(h(k,\gamma,\delta)). Then

By contraposition, if a set of weights is in our set, then it provides a good approximation for μ\mu:

Suppose that (33) and (34) hold. Let wCh(k,γ,δ)w\in\mathcal{C}_{h(k,\gamma,\delta)} for δ=Ω(εlog1/ε)\delta=\Omega(\varepsilon\log 1/\varepsilon). Then

5.2 Separation Oracle

In this section, we provide a separation oracle for Cη\mathcal{C}_{\eta}. In particular, we have the following theorem:

Fix ε>0\varepsilon>0, and let δ=Ω(εlog1/ε).\delta=\Omega(\varepsilon\log 1/\varepsilon). Suppose that (33) and (34) hold. Let ww^{\ast} denote the weights which are uniform on the uncorrupted points. Then there is a constant cc and an algorithm such that:

(Completeness) If w=ww=w^{\ast}, then it outputs “YES”.

These two facts imply that the ellipsoid method with this separation oracle will terminate in poly(d,1/ε)\operatorname{poly}(d,1/\varepsilon) steps, and moreover, will with high probability output a ww^{\prime} such that wwε/(Ndklog1/ε)\|w-w^{\prime}\|_{\infty}\leq\varepsilon/(Ndk\log 1/\varepsilon) for some wCckh(k,γ,δ)w\in\mathcal{C}_{ckh(k,\gamma,\delta)}. Moreover, it will do so in polynomially many iterations.

After running this procedure, we technically do not have a set of weights in Cckh(k,γ,δ)\mathcal{C}_{ckh(k,\gamma,\delta)}. But by the same argument as in Section 4.3, because the maximum distance between two points within any cluster is bounded, and we have the guarantee that XiXj2O(dklog1/ε)\|X_{i}-X_{j}\|^{2}\leq O(dk\log 1/\varepsilon) for all i,ji,j, we may assume we have a set of weights satisfying

We require the following lemma, describing the accuracy of the empirical covariance matrix with the obtained weights.

Let μ^=i=1NwiXi\widehat{\mu}=\sum_{i=1}^{N}w_{i}X_{i}. After running the algorithm above, we have a vector ww such that

By triangle inequality and Corollary 6.11,

6 Spectral Clustering

Now that we have a good estimate of the true covariance matrix, we will perform spectral clustering while γ\gamma is sufficiently large. We will adapt Lemma 6 from [SOAJ14], giving the following lemma:

Given a weight vector ww as output by Algorithm 11, if γΩ(poly(k)log1/ε)\gamma\geq\Omega(\operatorname{poly}(k)\cdot\log 1/\varepsilon), there exists an algorithm which produces a unit vector vv with the following guarantees:

There exists a non-trival partition of [k][k] into S0S_{0} and S1S_{1} such that vTμj>0v^{T}\mu_{j}>0 for all jS0j\in S_{0} and vTμj<0v^{T}\mu_{j}<0 for all jS1j\in S_{1};

The probability of a sample being misclassified is at most O(poly(ε/k))O(\operatorname{poly}(\varepsilon/k)), where a misclassification is defined as a sample XX generated from a component in S0S_{0} having vTX<0v^{T}X<0, or a sample generated from a component in S1S_{1} having vTX>0v^{T}X>0.

The algorithm will be as follows. Let vv be the top eigenvector of

For a sample XX, cluster it based on the sign of vTXv^{T}X. After performing this clustering, recursively perform our algorithm from the start on the two clusters.

The proof is very similar to that of Lemma 6 in [SOAJ14]. Their main concentration lemma is Lemma 30, which states that they obtain a good estimate of the true covariance matrix, akin to our Lemma 6.13. Lemma 31 argues that the largest eigenvector of this estimate is highly correlated with the top eigenvector of the true covariance matrix. Since γ\gamma is large, this implies there is a large margin between the mean and the hyperplane. However, by standard Gaussian tail bounds, the probability of a sample landing on the opposite side of this hyperplane is small.

We highlight the main difference between our approach and theirs. For their clustering step, they require that no sample is misclustered with high probability. As such, they may perform spectral clustering while γ=Ω(poly(k)log(d/ε))\gamma=\Omega\left(\operatorname{poly}(k)\cdot\log(d/\varepsilon)\right). We note that, in the next step of our algorithm, we will perform an exhaustive search. This will result in an approximation which depends on the value of γ\gamma at the start of the step, and as such, using the same approach as them would result in an overall approximation which depends logarithmically on the dimension.

We may avoid paying this cost by noting that our algorithm is agnostic. They require that no sample is misclustered with high probability, while our algorithm tolerates that a poly(ε/k)\operatorname{poly}(\varepsilon/k)-fraction of points are misclustered. As such, we can continue spectral clustering until γ=O(poly(k)log(1/ε))\gamma=O\left(\operatorname{poly}(k)\cdot\log(1/\varepsilon)\right).

7 Exhaustive Search

The final stage of the algorithm is when we know that all γi\gamma_{i}’s are sufficiently small. We can directly apply the following lemma:

Given a weight vector ww as output by Algorithm 11, then the projection of μjμμjμ2\frac{\mu_{j}-\mu}{\|\mu_{j}-\mu\|_{2}} onto the space orthogonal to the span of the top k1k-1 eigenvectors of

By taking a kk-wise Cartesian product of this set, we are guaranteed to obtain a vector which has this guarantee simultaneously for all μj\mu_{j}.

8 Applying the Tournament Lemma

In this section, we discuss details about how to apply our hypothesis selection algorithm. First, in Section 6.8.1, we describe how to guess the mixing weights and the variance of the components. Then in Section 6.8.2, we discuss how to ensure our hypotheses come from some fixed finite set, in order to deal with technicalities which arise when performing hypothesis selection with our adversary model.

The majority of our algorithm is focused on generating guesses for the means of the Gaussians. In this section, we guess the remaining parameters: the mixing weights and the variance. While most of these guessing arguments are standard, we emphasize that we reap an additional benefit because our algorithm is agnostic. In particular, most algorithms must deal with error incurred due to misspecification of the parameters. Since our algorithm is agnostic, we can pretend the misspecified parameter is the true one, at the cost of increasing the value of the agnostic parameter ε\varepsilon. If our misspecified parameters are accurate enough, the agnostic learning guarantee remains unchanged.

Guessing the mixing weights is fairly straightforward. For some ν=poly(ε/k)\nu=\operatorname{poly}(\varepsilon/k) sufficiently small, our algorithm generates a set of at most (1/ν)k=poly(k/ε)k(1/\nu)^{k}=\operatorname{poly}(k/\varepsilon)^{k} possible mixing weights by guessing the values {0,ε,ε+ν,ε+2ν,,1ν,1}\{0,\varepsilon,\varepsilon+\nu,\varepsilon+2\nu,\dots,1-\nu,1\} for each αj\alpha_{j}. Note that we may assume each weight is at least ε\varepsilon, since components with weights less than this can be specified arbitrarily at a total cost of O(kε)O(k\varepsilon) in total variation distance.

Next, we need to guess the variance σ2\sigma^{2} of the components. To accomplish this, we will take k+1k+1 samples (hoping to find only uncorrupted ones) and compute the minimum distance between any pair of them. Since we assume k1/εk\ll 1/\varepsilon, we can repeatedly draw k+1k+1 samples until we have the guarantee that at least one set is uncorrupted. If none of the k+1k+1 samples are corrupted, then at least two of them came from the same component, and in our high-dimensional setting the distance between any pair of samples from the same component concentrates around 2dσ\sqrt{2d}\sigma. After rescaling this distance, we can then multiplicatively enumerate around this value with granularity poly(ε/dk)\operatorname{poly}(\varepsilon/dk) to get an estimate for σ2\sigma^{2} that is sufficiently good for our purposes. Applying Corollary 2.14 bounds the cost of this misspecification by O(ε)O(\varepsilon). By rescaling the points, we may assume that σ2=1\sigma^{2}=1.

8.2 Pruning Our Hypotheses

In this section, we describe how to prune our set of hypotheses in order to apply Lemma 2.23. Recall that this lemma requires our hypotheses to come from some fixed finite set, rather than the potentially infinite set of GMM hypotheses. We describe how to prune and discretize the set of hypotheses obtained during the rest of the algorithm to satisfy this condition. For the purposes of this section, a hypothesis will be a kk-tuple of dd-dimensional points, corresponding only to the means of the components. While the candidate mixing weights already come from a fixed finite set (so no further work is needed), the unknown variance must be handled similarly to the means. The details for handling the variance are similar to (and simpler than) those for handling the means, and are omitted.

More precisely, this section will describe a procedure to generate a set of hypotheses M\mathcal{M}, which is exponentially large in kk and dd, efficiently searchable, and comes from a finite set of hypotheses which are fixed with respect to the true distribution. Then, given our set of hypotheses generated by the main algorithm (which is exponentially large in kk but polynomial in dd), we iterate over this set, either replacing each hypothesis with a “close” hypothesis from M\mathcal{M} (i.e., one which is within O(ε)O(\varepsilon) total variation distance), or discarding the hypothesis if none exists. Finally, we run the tournament procedure of Lemma 2.23 on the resulting set of hypotheses.

At a high level, the approach will be as follows. We will take a small set of samples, and remove any samples from this set which are clear outliers (due to having too few nearby neighbors). This will give us a set of points, each of which are within a reasonable distance from some component mean. Taking a union of balls around these samples will give us a space that is a subset of a union of (larger) balls centered at the component centers. We take a discrete mesh over this space to obtain a fixed finite set of possible means, and round each hypothesis such that its means are within this set.

We start by taking N=O(klog(1/τ)/ε2)N=O(k\log(1/\tau)/\varepsilon^{2}) samples, which is sufficient to ensure that the number of (uncorrupted) samples from component jj will be (wj±Θ(ε))N(w_{j}\pm\Theta(\varepsilon))N for all j[k]j\in[k] with probability 1O(τ)1-O(\tau). Recall that we are assuming that wj=Ω(ε)w_{j}=\Omega(\varepsilon) for all jj, as all other components may be defined arbitrarily at the cost of O(kε)O(k\varepsilon) in total variation distance. This implies that even after corruption, each component has generated at least εN\varepsilon N uncorrupted samples.

By standard Gaussian concentration bounds, we know that if NN samples are taken from a Gaussian, the maximum distance between a sample and the Gaussian’s mean will be at most ζ=O(dlog(N/τ))\zeta=O(\sqrt{d\log(N/\tau)}) with probability 1τ1-\tau. Assume this condition holds, and thus each component’s mean will have at least εN\varepsilon N points within distance ζ\zeta. We prune our set of samples by removing any point with fewer than εN\varepsilon N other points at distance less than 2ζ2\zeta. This will not remove any uncorrupted points, by the above assumption, and triangle inequality. However, this will remove any corrupted points at distance at least 3ζ3\zeta from all component means, due to the fact that the adversary may only move an ε\varepsilon-fraction of the points, and reverse triangle inequality.

Now, we consider the union of the balls of radius 3ζ3\zeta centered at each of the remaining points. This set contains all of the component means, and is also a subset of the union of the balls of radius 6ζ6\zeta centered at the component means. We discretize this set by taking its intersection with a lattice of side-length εkd\frac{\varepsilon}{k\sqrt{d}}. We note that any two points in this discretization are at distance at most ε/k\varepsilon/k. By a volume argument, the number of points in the intersection is at most k(12ζkdε)dk\left(\frac{12\zeta k\sqrt{d}}{\varepsilon}\right)^{d}. Each hypothesis will be described by the kk-wise Cartesian product of these points, giving us a set M\mathcal{M} of at most kk(12ζkdε)kdk^{k}\left(\frac{12\zeta k\sqrt{d}}{\varepsilon}\right)^{kd} hypotheses.

Given a set of hypotheses H\mathcal{H} from the main algorithm, we prune it using M\mathcal{M} as a reference. For each hHh\in\mathcal{H}, we see if there exists some hMh^{\prime}\in\mathcal{M} such that the means in hh are at distance at most ε/k\varepsilon/k from the corresponding means in hh^{\prime}.We observe that the complexity of this step is polynomial in dd and kk, not exponential, if one searches for the nearest lattice point in the sphere surrounding each unpruned sample, rather than performing a naive linear scan over the entire list. If such an hh^{\prime} exists, we replace hh with hh^{\prime} – otherwise, hh is simply removed. By Corollary 2.13 and the triangle inequality, this replacement incurs a cost of O(ε)O(\varepsilon) in total variation distance. At this point, the conditions of Lemma 2.23 are satisfied and we may run this procedure to select a sufficiently accurate hypothesis.

9 Handling Unequal Variances

In this section, we describe the changes required to allow the algorithm to handle different variances for the Gaussians. The main idea is to find the minimum variance of any component and perform clustering so we only have uncorrupted samples from Gaussians with variances within some known, polynomially-wide interval. This allows us to grid within this interval in order to guess the variances, and the rest of the algorithm proceeds with minor changes.

The first step is to locate the minimum variance of any component. Again using standard Gaussian concentration, in sufficiently high dimensions, if NN samples are taken from a Gaussian with variance σ2I\sigma^{2}I, the distance between any two samples will be concentrated around σ(2dΘ(d1/4))\sigma(\sqrt{2d}-\Theta(d^{1/4})). With this in hand, we use the following procedure to estimate the minimum variance. For each sample ii, record the distance to the (εN+1)(\varepsilon N+1)st closest sample. We take the (εN+1)(\varepsilon N+1)st smallest of these values, rescale it by 1/2d1/\sqrt{2d}, and similar to before, guess around it using a multiplicative (1+poly(ε/kd))(1+\operatorname{poly}(\varepsilon/kd)) grid, which will give us an estimate σ^min2\hat{\sigma}_{min}^{2} for the smallest variance. We note that discarding the smallest εN\varepsilon N fraction of the points prevents this statistic from being grossly corrupted by the adversary. For the remainder of this section, assume that σmin2\sigma_{min}^{2} is known exactly.

At this point, we partition the points into those that come from components with small variance, and those with large variance. We will rely upon the following concentration inequality from [SOAJ14], which gives us the distance between samples from different components:

Given NN samples from a collection of Gaussian distributions, with probability 1O(τ)1-O(\tau), the following holds for every pair of samples X,YX,Y:

where XN(μ1,σ12I)X\sim\mathcal{N}(\mu_{1},\sigma_{1}^{2}I) and YN(μ2,σ22I)Y\sim\mathcal{N}(\mu_{2},\sigma_{2}^{2}I).

This procedure works due to the following argument. When we compute H1H_{1}, we are guaranteed that it will contain all samples from components with variance σmin2\sigma_{\min}^{2}, by the upper bound in Lemma 6.16. However, it may also contain samples from other components – in particular, those with variance at most γσmin2\gamma\sigma_{\min}^{2}, for

After this clustering step, the algorithm follows similarly to before. The main difference is in the convex programming steps and concentration bounds. For instance, before, we considered the set

Now, to reflect the different expression for the covariance of the GMM, we replace σ2I\sigma^{2}I with j[k]αjσj2I\sum_{j\in[k]}\alpha_{j}\sigma_{j}^{2}I; for example:

We note that since all variances in each cluster are off by a factor of at most ee, this will only affect our concentration and agnostic guarantees by a constant factor.

Agnostically Learning Binary Product Distributions, via Filters

where Σ\Sigma is the modified empirical covariance. Ultimately, we show that this yields an estimate that is O~(ε)\widetilde{O}(\sqrt{\varepsilon}) close in total variation distance. See Section 7.2 for further details.

The main result of this section is the following theorem:

Let PP be a binary product distribution in dd dimensions and ε,τ>0.\varepsilon,\tau>0. Let SS be a multiset of Θ(d4log(1/τ)/ε2)\Theta(d^{4}\log(1/\tau)/\varepsilon^{2}) independent samples from PP and SS^{\prime} be a multiset obtained by arbitrarily changing an ε\varepsilon-fraction of the points in S.S. There exists a polynomial time algorithm that returns a product distribution PP^{\prime} such that, with probability at least 1τ1-\tau, we have pp2=O(εlog(1/ε)),\|p-p^{\prime}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}), where pp and pp^{\prime} are the mean vectors of PP and PP^{\prime} respectively.

For 0<c<1/2,0<c<1/2, we say that a binary product distribution is cc-balanced if the expectation of each coordinate is in [c,1c].[c,1-c].

For cc-balanced binary product distributions, we have the following corollary of Lemma 2.17:

We start by defining a condition on the uncorrupted set of samples S,S, under which our algorithm will succeed.

The following simple lemma shows that a sufficiently large set of independent samples from PP is ε\varepsilon-good (with respect to PP) with high probability.

Let PP be an arbitrary distribution on {0,1}d\{0,1\}^{d} and ε,τ>0.\varepsilon,\tau>0. If the multiset SS is obtained by taking Ω((d4+d2log(1/τ))/ε2)\Omega((d^{4}+d^{2}\log(1/\tau))/\varepsilon^{2}) independent samples from P,P, it is ε\varepsilon-good with respect to PP with probability at least 1τ.1-\tau.

Recall (see Definition 5.4) that Δ(S,S)\Delta(S,S^{\prime}) is the size of the symmetric difference of SS and SS^{\prime} divided by the cardinality of S.S.

Our agnostic learning algorithm establishing Theorem 7.1 is obtained by repeated application of the efficient procedure whose performance guarantee is given in the following proposition:

Let PP be a binary product distribution with mean vector pp and ε>0\varepsilon>0 be sufficiently small. Let SS be ε\varepsilon-good with respect to PP, and SS^{\prime} be any multiset with Δ(S,S)2ε.\Delta(S,S^{\prime})\leq 2\varepsilon. There exists a polynomial time algorithm Filter-Balanced-Product that, given SS^{\prime} and ε>0,\varepsilon>0, returns one of the following:

A mean vector pp^{\prime} such that pp2=O(εlog(1/ε)).\|p-p^{\prime}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).

A multiset SSS^{\prime\prime}\subseteq S^{\prime} such that Δ(S,S)Δ(S,S)2ε/d.\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d.

We start by showing how Theorem 7.1 follows easily from Proposition 7.7.

The proof of Theorem 7.1 is very similar to that of Theorem 5.1, however, we include it here for completeness. By the definition of Δ(S,S),\Delta(S,S^{\prime}), since SS^{\prime} has been obtained from SS by corrupting an ε\varepsilon-fraction of the points in S,S, we have that Δ(S,S)2ε.\Delta(S,S^{\prime})\leq 2\varepsilon. By Lemma 7.6, the set SS of uncorrupted samples is ε\varepsilon-good with respect to PP with probability at least 1τ.1-\tau. We henceforth condition on this event.

Our algorithm iteratively applies the Filter-Balanced-Product procedure of Proposition 7.7 until it terminates returning a mean vector pp^{\prime} with pp2=O(εlog(1/ε)).\|p-p^{\prime}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}). We claim that we need at most d+1d+1 iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets S0=S,S1,,S_{0}=S^{\prime},S_{1}^{\prime},\ldots, such that Δ(S,Si)Δ(S,S)i(2ε/d).\Delta(S,S_{i}^{\prime})\leq\Delta(S,S^{\prime})-i\cdot(2\varepsilon/d). Thus, if the algorithm does not terminate in the first dd iterations, we have Sd=S,S^{\prime}_{d}=S, and in the next iteration we output the sample mean of SS. ∎

In this section, we describe the efficient procedure establishing Proposition 7.7 followed by its proof of correctness. Our algorithm Filter-Balanced-Product is very simple: We consider the empirical distribution defined by the (corrupted) sample multiset S.S^{\prime}. We calculate its mean vector μS\mu^{S^{\prime}} and covariance matrix MM. If the matrix MM has no large eigenvalues, we return μS.\mu^{S^{\prime}}. Otherwise, we use the eigenvector vv^{\ast} corresponding to the maximum magnitude eigenvalue λ\lambda^{\ast} of MM and the mean vector μS\mu^{S^{\prime}} to define a filter. We zero out the diagonal elements of the covariance matrix for the following reason: The diagonal elements could contribute up to Ω(1)\Omega(1) to the spectral norm, even without noise. This would prevent us from obtaining the desired error of O~(ε).\widetilde{O}(\varepsilon). Our efficient filtering procedure is presented in detailed pseudocode below.

Tightness of our Analysis. We remark that the analysis of our filter-based algorithm is tight, and more generally our bound of O(εlog(1/ε))O(\varepsilon\sqrt{\log(1/\varepsilon)}) is a bottleneck for filter-based approaches.

More specifically, we note that our algorithm will never successfully add points back to SS after they have been removed by the adversary. Therefore, if an ε\varepsilon-fraction of the points in SS are changed, our algorithm may be able to remove these outliers from S,S^{\prime}, but will not be able to replace them with their original values. These changed values can alter the sample mean by as much as Ω(εlog(1/ε)).\Omega(\varepsilon\sqrt{\log(1/\varepsilon)}).

The rest of this section is devoted to the proof of correctness of algorithm Filter-Balanced-Product.

1.2 Setup and Basic Structural Lemmas

By definition, there exist disjoint multisets L,E,L,E, of points in {0,1}d\{0,1\}^{d}, where LS,L\subset S, such that S=(SL)E.S^{\prime}=(S\setminus L)\cup E. With this notation, we can write Δ(S,S)=L+ES.\Delta(S,S^{\prime})=\frac{|L|+|E|}{|S|}. Our assumption Δ(S,S)2ε\Delta(S,S^{\prime})\leq 2\varepsilon is equivalent to L+E2εS,|L|+|E|\leq 2\varepsilon\cdot|S|, and the definition of SS^{\prime} directly implies that (12ε)SS(1+2ε)S.(1-2\varepsilon)|S|\leq|S^{\prime}|\leq(1+2\varepsilon)|S|. Throughout the proof, we assume that ε\varepsilon is a sufficiently small constant. Our analysis will make essential use of the following matrices:

Our first claim follows from the Chernoff bound and the definition of a good set:

The following sequence of lemmata bound from above the spectral norms of the associated matrices. Our first simple lemma says that the (diagonally reduced) empirical covariance matrix MSM_{S}, where SS is the set of uncorrupted samples drawn from the binary product distribution P,P, is a good approximation to the matrix MP,M_{P}, in spectral norm.

If SS is ε\varepsilon-good, MPMS2O(ε).\|M_{P}-M_{S}\|_{2}\leq O(\varepsilon).

It suffices to show that (MP)i,j(MS)i,jO(ε/d)|(M_{P})_{i,j}-(M_{S})_{i,j}|\leq O(\varepsilon/d) for all iji\neq j. Then, we have that

A similar expression holds for MSM_{S} except with probabilities for XuS.X\in_{u}S. Since SS is ε\varepsilon-good with respect to PP, we have (MP)i,j(MS)i,jε/d+μiSε/d+μjSε/d3ε/d|(M_{P})_{i,j}-(M_{S})_{i,j}|\leq\varepsilon/d+\mu^{S^{\prime}}_{i}\varepsilon/d+\mu^{S^{\prime}}_{j}\varepsilon/d\leq 3\varepsilon/d. This completes the proof. ∎

As a simple consequence of the above lemma, we obtain the following:

If SS is ε\varepsilon-good, M(1/S)(SMP+EMELML)2=O(ε).\|M-(1/|S^{\prime}|)(|S|M_{P}+|E|M_{E}-|L|M_{L})\|_{2}=O(\varepsilon).

First note that we can write SM=SMS+EME0LML0,|S^{\prime}|M=|S|M_{S}+|E|M_{E}^{0}-|L|M_{L}^{0}, where ME0M_{E}^{0} and ML0M_{L}^{0} are obtained from MEM_{E} and MLM_{L} by zeroing out the diagonal. Observe that E+L=O(ε)S.|E|+|L|=O(\varepsilon)|S^{\prime}|. This follows from the assumption that Δ(S,S)2ε\Delta(S,S^{\prime})\leq 2\varepsilon and the definition of S.S^{\prime}. Now note that the matrices MEME0M_{E}-M_{E}^{0} and MLML0M_{L}-M_{L}^{0} are diagonal with entries at most 1,1, and thus have spectral norm at most 1.1. The claim now follows from Lemma 7.9. ∎

We have that MP2μSp22.\|M_{P}\|_{2}\leq\|\mu^{S^{\prime}}-p\|_{2}^{2}.

Note that (MP)i,j=(μiSpi)(μjSpj)(M_{P})_{i,j}=(\mu^{S^{\prime}}_{i}-p_{i})(\mu^{S^{\prime}}_{j}-p_{j}) for iji\neq j and otherwise. Therefore, MPM_{P} is the difference of (μSp)(μSp)T(\mu^{S^{\prime}}-p)(\mu^{S^{\prime}}-p)^{T} and the diagonal matrix with entries (μiSpi)2.(\mu^{S^{\prime}}_{i}-p_{i})^{2}. This in turn implies that

Note that both bounding matrices have spectral norm at most μSp22\|\mu^{S^{\prime}}-p\|_{2}^{2}, hence so does MP.M_{P}.

The following lemma, bounding from above the spectral norm of ML,M_{L}, is the main structural result of this section. This is the core result needed to establish that the subtractive error cannot change the sample mean by much:

We have that ML2=O(log(S/L)+μSp22+εS/L),\|M_{L}\|_{2}=O(\log(|S|/|L|)+\|\mu^{S^{\prime}}-p\|_{2}^{2}+\varepsilon\cdot|S|/|L|), hence

Since LSL\subseteq S, for any x{0,1}dx\in\{0,1\}^{d}, we have that

The RHS is in turn bounded from above as follows:

where the second line follows from (35) and the third line follows from Claim 7.8. This establishes the first part of the lemma.

The bound (L/S)ML2=O(εlog(1/ε)+εμSp22)(|L|/|S|)\|M_{L}\|_{2}=O(\varepsilon\log(1/\varepsilon)+\varepsilon\|\mu^{S^{\prime}}-p\|_{2}^{2}) follows from the previously established bound using the monotonicity of the function xlog(1/x),x\log(1/x), and the fact that L/S2ε.|L|/|S|\leq 2\varepsilon. The observation S/S1+2ε2|S|/|S^{\prime}|\leq 1+2\varepsilon\leq 2 completes the proof of the second part of the lemma. ∎

Claim 7.10 combined with Lemmas 7.11 and 7.12 and the triangle inequality yield the following:

We have that M(E/S)ME2=O(εlog(1/ε)+μSp22).\|M-(|E|/|S^{\prime}|)M_{E}\|_{2}=O(\varepsilon\log(1/\varepsilon)+\|\mu^{S^{\prime}}-p\|_{2}^{2}).

We are now ready to analyze the two cases of the algorithm Filter-Balanced-Product.

1.3 The Case of Small Spectral Norm

We start by analyzing the case where the mean vector μS\mu^{S^{\prime}} is returned. This corresponds to the case that the spectral norm of MM is appropriately small, namely M2O(εlog(1/ε)).\|M\|_{2}\leq O(\varepsilon\log(1/\varepsilon)). We start with the following simple claim:

Let μE,μL\mu^{E},\mu^{L} be the mean vectors of EE and LL respectively. Then, μEμS22ME2\|\mu^{E}-\mu^{S^{\prime}}\|_{2}^{2}\leq\|M_{E}\|_{2} and μLμS22ML2.\|\mu^{L}-\mu^{S^{\prime}}\|_{2}^{2}\leq\|M_{L}\|_{2}.

We prove the first inequality, the proof of the second being identical. Note that MEM_{E} is a symmetric matrix, so ME2=maxv2=1vTMEv.\|M_{E}\|_{2}=\max_{\|v\|_{2}=1}|v^{T}M_{E}v|. Moreover, for any vector vv we have that

Let w=μEμSw=\mu^{E}-\mu^{S^{\prime}} and take v=w/w2.v=w/\|w\|_{2}. We conclude that ME2w22,\|M_{E}\|_{2}\geq\|w\|_{2}^{2}, as desired. ∎

The following crucial lemma, bounding from above the distance μSp2\|\mu^{S^{\prime}}-p\|_{2} as a function of ε\varepsilon and M2,\|M\|_{2}, will be important for both this and the following subsections.

We have that μSp22εM2+O(εlog(1/ε)).\|\mu^{S^{\prime}}-p\|_{2}\leq 2\sqrt{\varepsilon\|M\|_{2}}+O(\varepsilon\sqrt{\log(1/\varepsilon)}).

First we observe that the mean vector μS\mu^{S} of the uncorrupted sample set SS is close to p.p. Since SS is ε\varepsilon-good, this follows from the fact that for any i[d],i\in[d], we have

Therefore, we get that μSp2ε/d.\|\mu^{S}-p\|_{2}\leq\varepsilon/\sqrt{d}.

Consider μE\mu^{E} and μL\mu^{L}, the mean vectors of EE and LL, respectively. By definition, we have that

and thus by the triangle inequality we obtain

Therefore, we have the following sequence of inequalities:

where the first line follows from the triangle inequality, the second uses Claim 7.14, while the third uses Lemma 7.12 and Corollary 7.13. Finally, the last couple of lines assume that ε\varepsilon is sufficiently small. The proof of Lemma 7.15 is now complete. ∎

We can now deduce the correctness of Step 7 of the algorithm Filter-Balanced-Product, since for M2O(εlog(1/ε)),\|M\|_{2}\leq O(\varepsilon\log(1/\varepsilon)), Lemma 7.15 directly implies that μSp2=O(εlog(1/ε)).\|\mu^{S^{\prime}}-p\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).

1.4 The Case of Large Spectral Norm

We next show the correctness of the algorithm Filter-Balanced-Product if it returns a filter (rejecting an appropriate subset of SS^{\prime}) in Step 8. This corresponds to the case that M2Cεlog(1/ε),\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon), for a sufficiently large universal constant C>0.C>0. We will show that the multiset SSS^{\prime\prime}\subset S^{\prime} computed in Step 8 satisfies Δ(S,S)Δ(S,S)2ε/d.\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d.

We start by noting that, as a consequence of Lemma 7.15, we have the following:

We have that μSp2δ:=3εM2.\|\mu^{S^{\prime}}-p\|_{2}\leq\delta:=3\sqrt{\varepsilon\|M\|_{2}}.

By Lemma 7.15, we have that μSp22δ/3+O(εlog(1/ε)).\|\mu^{S^{\prime}}-p\|_{2}\leq 2\delta/3+O(\varepsilon\sqrt{\log(1/\varepsilon)}). Recalling that M2Cεlog(1/ε),\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon), if C>0C>0 is sufficiently large, the term O(εlog(1/ε))O(\varepsilon\sqrt{\log(1/\varepsilon)}) is at most δ/3.\delta/3.

By construction, vv^{\ast} is the unit eigenvector corresponding to the maximum magnitude eigenvalue of M.M. Thus, we have (v)TMv=M2=δ2/(9ε).(v^{\ast})^{T}Mv^{\ast}=\|M\|_{2}=\delta^{2}/(9\varepsilon). We thus obtain that

where the equality holds by definition, and the inequality follows from Corollary 7.13 and Claim 7.16 using the fact that ε\varepsilon is sufficiently small and the constant CC is sufficiently large (noting that the constant in the RHS of Corollary 7.13 does not depend on CC).

We show that (36) implies the existence of a T>0T>0 with the properties specified in Step 8 of the algorithm Filter-Balanced-Product. More specifically, we have the following crucial lemma:

If M2Cεlog(1/ε),\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon), for a sufficiently large constant C>0,C>0, there exists a T>0T>0 satisfying the property in Step 8 of the algorithm Filter-Balanced-Product, i.e., such that

Assume for the sake of contradiction that this is not the case, i.e., that for all T>0T>0 we have that

Since ES,E\subseteq S^{\prime}, for all x{0,1}dx\in\{0,1\}^{d}, we have that SPrXuS[X=x]EPrYuE[Y=x].|S^{\prime}|\Pr_{X\in_{u}S^{\prime}}[X=x]\geq|E|\Pr_{Y\in_{u}E}[Y=x]. This fact combined with (37) implies that for all T>0T>0

Using (36) and (38), we have the following sequence of inequalities:

This yields the desired contradiction recalling that the assumption M2Cεlog(1/ε)\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon) and the definition of δ\delta imply that δCεlog(1/ε)\delta\geq C^{\prime}\varepsilon\sqrt{\log(1/\varepsilon)} for an appropriately large C>0.C^{\prime}>0.

The following simple claim completes the proof of Proposition 7.7:

We have that Δ(S,S)Δ(S,S)2ε/d  .\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d\;.

Recall that S=(SL)E,S^{\prime}=(S\setminus L)\cup E, with EE and LL disjoint multisets such that LS.L\subset S. We can similarly write S=(SL)E,S^{\prime\prime}=(S\setminus L^{\prime})\cup E^{\prime}, with LLL^{\prime}\supseteq L and EE.E^{\prime}\subset E. Since

it suffices to show that EELL+2εS/d.|E\setminus E^{\prime}|\geq|L^{\prime}\setminus L|+2\varepsilon|S|/d. Note that LL|L^{\prime}\setminus L| is the number of points rejected by the filter that lie in SS.S\cap S^{\prime}. By Claim 7.8 and Claim 7.16, it follows that the fraction of elements xSx\in S that are removed to produce SS^{\prime\prime} (i.e., satisfy v(xμS)>T+δ|v^{\ast}\cdot(x-\mu_{S^{\prime}})|>T+\delta) is at most 2exp(T2/2)+ε/d.2\exp(-T^{2}/2)+\varepsilon/d. Hence, it holds that LL(2exp(T2/2)+ε/d)S.|L^{\prime}\setminus L|\leq(2\exp(-T^{2}/2)+\varepsilon/d)|S|. On the other hand, Step 8 of the algorithm ensures that the fraction of elements of SS^{\prime} that are rejected by the filter is at least 8exp(T2/2)+8ε/d.8\exp(-T^{2}/2)+8\varepsilon/d. Note that EE|E\setminus E^{\prime}| is the number of points rejected by the filter that lie in SS.S^{\prime}\setminus S. Therefore, we can write:

where the second line uses the fact that SS/2|S^{\prime}|\geq|S|/2 and the last line uses the fact that LL/S(2exp(T2/2)+ε/d).|L^{\prime}\setminus L|/|S|\leq(2\exp(-T^{2}/2)+\varepsilon/d). This completes the proof of the claim. ∎

2 Agnostically Learning Arbitrary Binary Product Distributions

In this subsection, we build on the approach of the previous subsection to show the following:

Similarly to the case of balanced product distributions, we will require a notion of a “good” set for our distribution. For technical reasons, the definition in this setting turns out to be more complicated. Roughly speaking, this is to allow us to ignore coordinates for which the small fraction of errors is sufficient to drastically change the sample mean.

We note that a sufficiently large set of samples from PP will satisfy the above properties with high probability:

If SS is obtained by taking Ω(d6log(1/τ)/ε3)\Omega(d^{6}\log(1/\tau)/\varepsilon^{3}) independent samples from P,P, it is (ε,1/5)(\varepsilon,1/5)-good with respect to PP with probability at least 9/10.9/10.

The proof of this lemma is deferred to Section D.

We will also require a notion of the number of coordinates on which SS non-trivially depends:

Similarly to the balanced case, our algorithm is obtained by repeated application of an efficient filter procedure, whose precise guarantee is described below.

Let PP be a binary product distribution in dd dimensions and ε>0.\varepsilon>0. Suppose that SS is an (ε,η)(\varepsilon,\eta)-good multiset with respect to PP with η>10ε\eta>10\varepsilon and SS^{\prime} be any multiset with Δ(S,S)20ε.\Delta(S,S^{\prime})\leq 20\varepsilon. There exists a polynomial time algorithm which, given ε\varepsilon and SS^{\prime}, returns one of the following:

A multiset SSS^{\prime\prime}\subset S^{\prime} of elements of {0,1}d\{0,1\}^{d} such that there exists a product distribution P~\widetilde{P} with mean p~{\widetilde{p}} and a multiset S~\widetilde{S} that is (ε,ηpp~1)(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})-good with respect to P~\widetilde{P} such that

Our agnostic learning algorithm is then obtained by iterating this procedure. We can prove Theorem 7.19 given Proposition 7.23.

When τ1/5\tau\leq 1/5, we will need to draw fresh ε\varepsilon-corrupted samples and repeat this procedure O(log(1/τ))O(\log(1/\tau)) times, and then one of the resulting output distributions is within total variation distance O(εlog(1/ε))O(\sqrt{\varepsilon\log(1/\varepsilon)}) with probability at least 1τ/21-\tau/2. Then we use the agnostic hypothesis selection procedure of Lemma 2.23. ∎

In this section, we describe and analyze the efficient routine establishing Proposition 7.23. Our efficient filtering procedure is presented in detailed pseudocode below.

This completes the description of the algorithm. We now proceed to prove correctness.

2.2 Chi-Squared Distance and Basic Reductions

As previously mentioned, our algorithm will use the χ2\chi^{2}-distance between the mean vectors as a proxy for the total variation distance between two binary product distributions. Since the mean vector of the target distribution is not known to us, we will not be able to use the symmetric definition of the χ2\chi^{2}-distance used in Lemma 2.17 We will instead require the following asymmetric version of the χ2\chi^{2}-distance:

The following fact follows directly from Lemma 2.17.

There are two problems with using the χ2\chi^{2} distance between the mean vectors as a proxy for the total variation distance. The first is that the χ2\chi^{2}-distance between the means is a very loose approximation of the total variational distance when the means are close to or 11 in some coordinate. To circumvent this obstacle, we remove such coordinates via an appropriate pre-processing in Step 8. The second is that the above asymmetric notion of the χ2\chi^{2}-distance may be quite far from the symmetric definition. To overcome this issue, it suffices to have that qi=O(pi)q_{i}=O(p_{i}) and 1qi=O(1pi).1-q_{i}=O(1-p_{i}). To ensure this condition is satisfied, we appropriately modify the target product distribution (that we aim to be close to). Next, we will show how we deal with these problems in detail.

The next reduction will be slightly more complicated and depends on the following idea: Suppose that there is a new product distribution P~{\widetilde{P}} with mean p~{\widetilde{p}} and an (ε,ηpp~1)(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})-good multiset S~{\widetilde{S}} for P~{\widetilde{P}} such that

Then, it suffices to show that our algorithm works for P~{\widetilde{P}} and S~{\widetilde{S}} instead of PP and SS (note that the input to the algorithm, SS^{\prime} and ε\varepsilon in the same in either case). This is because the conditions imposed by the output in this case would be strictly stronger. In particular, we may assume that μiSpi/3\mu^{S^{\prime}}_{i}\geq p_{i}/3 for all ii:

There is a product distribution P~{\widetilde{P}} whose mean vector p~{\widetilde{p}} satisfies μiSp~i/3\mu^{S^{\prime}}_{i}\geq{\widetilde{p}}_{i}/3 and 1μiS(1p~i/3)1-\mu^{S^{\prime}}_{i}\geq(1-{\widetilde{p}}_{i}/3) for all ii, and a set S~S{\widetilde{S}}\subseteq S that is (ε,ηpp~1)(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})-good for P~{\widetilde{P}} and satisfies

If all coordinates ii have μiSpi/3\mu^{S^{\prime}}_{i}\geq p_{i}/3 and 1μiS(1pi/3),1-\mu^{S^{\prime}}_{i}\geq(1-p_{i}/3), then we can take P~=P{\widetilde{P}}=P and S~=S.{\widetilde{S}}=S.

Suppose that the ithi^{th} coordinate has μiS<pi/3.\mu^{S^{\prime}}_{i}<p_{i}/3. Let P~{\widetilde{P}} be the product whose mean vector p~{\widetilde{p}} has p~i=0{\widetilde{p}}_{i}=0 and p~j=pj{\widetilde{p}}_{j}=p_{j} for ji.j\neq i. Let S~{\widetilde{S}} be obtained by removing from SS all of the entries with 11 in the ithi^{th}-coordinate. Then, we claim that S~{\widetilde{S}} is (ε,ηpi)(\varepsilon,\eta-p_{i})-good for P~{\widetilde{P}} and has Δ(S~,S)+pi/5Δ(S,S).\Delta({\widetilde{S}},S^{\prime})+p_{i}/5\leq\Delta(S,S^{\prime}). Note that here we have pp~1=pi.\|p-{\widetilde{p}}\|_{1}=p_{i}.

First, we show that S~{\widetilde{S}} is (ε,ηpi)(\varepsilon,\eta-p_{i})-good for P~.{\widetilde{P}}. For any affine function L(x)L(x) and set T[d]T\subseteq[d] with jTp~j(1p~j)ηpi,\sum_{j\in T}{\widetilde{p}}_{j}(1-{\widetilde{p}}_{j})\leq\eta-p_{i}, we need to show that

Let T~=T{i}.{\widetilde{T}}=T\cup\{i\}. We may or may not have iTi\in T but, from the definition of p~{\widetilde{p}},

Moreover, note that ST~=S~TS_{{\widetilde{T}}}={\widetilde{S}}_{T} and PT~=P~T.P_{{\widetilde{T}}}={\widetilde{P}}_{T}. Thus, S~{\widetilde{S}} is (ε,ηpi)(\varepsilon,\eta-p_{i})-good for P~.{\widetilde{P}}.

Next, we show that Δ(S~,S)+pi/5Δ(S,S).\Delta({\widetilde{S}},S^{\prime})+p_{i}/5\leq\Delta(S,S^{\prime}). We write S=S~L~E~.S={\widetilde{S}}\setminus{\widetilde{L}}\cup{\widetilde{E}}. We write S1,L1,S1S_{1},L_{1},S^{\prime}_{1} for the subset of S,L,SS,L,S^{\prime} respectively, where the ithi^{th} coordinate is 1.1. Since SS is (ε,η)(\varepsilon,\eta)-good for P,P, we have that μiSpiε3/2/d2.|\mu^{S}_{i}-p_{i}|\leq\varepsilon^{3/2}/d^{2}. Recall that we are already assuming that p~iε/d.{\widetilde{p}}_{i}\geq\varepsilon/d. Thus, μiS29pi/30.\mu^{S}_{i}\geq 29p_{i}/30. Therefore, we have that S129piS/30.|S_{1}|\geq 29p_{i}|S|/30. On the other hand, we have that S1piS/311piS/30.|S^{\prime}_{1}|\leq p_{i}|S^{\prime}|/3\leq 11p_{i}|S|/30. Thus, L1=S1S118piS/30.|L_{1}|=|S_{1}\setminus S^{\prime}_{1}|\geq 18p_{i}|S|/30. This means that pi=O(Δ(S~,S))=O(ε).p_{i}=O(\Delta({\widetilde{S}},S^{\prime}))=O(\varepsilon). However, E~=ES1{\widetilde{E}}=E\cup S^{\prime}_{1} and L~=LL1{\widetilde{L}}=L\setminus L_{1}. This gives

Similarly, suppose that instead the ithi^{th}-coordinate has 1μiS<(1pi)/3.1-\mu^{S^{\prime}}_{i}<(1-p_{i})/3. Let P~{\widetilde{P}} be the product whose mean vector p~{\widetilde{p}} has p~i=1{\widetilde{p}}_{i}=1 and p~j=pj{\widetilde{p}}_{j}=p_{j} for ji.j\neq i. Let S~{\widetilde{S}} be obtained by removing from SS all of the entries with in the ithi^{th}-coordinate. Then, by a similar proof we have that S~{\widetilde{S}} is (ε,η(1pi))(\varepsilon,\eta-(1-p_{i}))-good for P~{\widetilde{P}} and has Δ(S~,S)+(1pi)/5Δ(S,S).\Delta({\widetilde{S}},S^{\prime})+(1-p_{i})/5\leq\Delta(S,S^{\prime}). Note that here we have pp~1=1pi.\|p-{\widetilde{p}}\|_{1}=1-p_{i}.

By an easy induction, we can set all coordinates ii with μiSp~i/3\mu^{S^{\prime}}_{i}\geq{\widetilde{p}}_{i}/3 and 1μiS(1p~i/3)1-\mu^{S^{\prime}}_{i}\geq(1-{\widetilde{p}}_{i}/3) to or 11 respectively, giving an S~{\widetilde{S}} and P~{\widetilde{P}} such that S~{\widetilde{S}} is (ε,ηpp~1)(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})-good for P~{\widetilde{P}} and

In conclusion, throughout the rest of the proof we may and will assume that for all i,i,

ε/dμiS1ε/d.\varepsilon/d\leq\mu^{S^{\prime}}_{i}\leq 1-\varepsilon/d.

μiSpi/3\mu^{S^{\prime}}_{i}\geq p_{i}/3 and 1μiS(1pi)/3.1-\mu^{S^{\prime}}_{i}\geq(1-p_{i})/3.

2.3 Setup and Basic Structural Lemmas

As in the balanced case, we can write S=(SL)ES^{\prime}=(S\setminus L)\cup E for disjoint multisets LL and E.E. Similarly, we define the following matrices:

Note that we no longer zero-out the diagonals of MPM_{P} and MSM_{S}. This will turn out to allow us to more naturally relate spectral properties of these matrices to the χ2\chi^{2}-distance between the means. We start with the following simple claim:

VarXP[vX]9\operatorname*{Var}_{X\sim P}[v\cdot X]\leq 9 and v(pμS)χ2(μS,p),|v\cdot(p-\mu^{S^{\prime}})|\leq\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}, and

PrXP(vXμST+χ2(μS,p))9/T2  .\Pr_{X\sim P}\left(|v\cdot X-\mu^{S^{\prime}}|\geq T+\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\right)\leq 9/T^{2}\;.

where we used that pi3μiS,p_{i}\leq 3\mu^{S^{\prime}}_{i}, (1pi)3(1μiS)(1-p_{i})\leq 3(1-\mu^{S^{\prime}}_{i}) and the assumption in the claim statement. For the second part of (i), note that

where the first inequality is Cauchy-Schwarz, and the second follows from the assumption in the claim statement that i=1dvi2μiS(1μiS)1.\sum_{i=1}^{d}v_{i}^{2}\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})\leq 1. This proves (i).

To prove (ii), we note that Chebyshev’s inequality gives

where the second inequality follows from (i). To complete the proof note the inequality

where we used the triangle inequality and the second part of (i). ∎

Let \mboxCov[S]\mbox{Cov}[S] denote the sample covariance matrix with respect to S,S, and \mboxCov[P]\mbox{Cov}[P] denote the covariance matrix of P.P. We will need the following lemma:

χ2(μS,μS)χ2(μS,p)O(ε/d),\left|\sqrt{\chi^{2}(\mu^{S^{\prime}},\mu^{S})}-\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\right|\leq O(\varepsilon/d), and

D(\mboxCov[S]\mboxCov[P])D2O(ε)  .\|D\left(\mbox{Cov}[S]-\mbox{Cov}[P]\right)D\|_{2}\leq O(\sqrt{\varepsilon})\;.

For (i): Since SS is good, for any i[d],i\in[d], we have

Therefore, by the triangle inequality we get

where the second inequality uses the fact that μiS(1μiS)ε/(2d).\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})\geq\varepsilon/(2d).

For (ii): Since SS is good, for any i,j[d],i,j\in[d], we have

Combined with the bound μiSpiε3/2/d2|\mu^{S}_{i}-p_{i}|\leq\varepsilon^{3/2}/d^{2} above, this gives

Note that D2=maxi(1/μiS(1μiS))2d/ε.\|D\|_{2}=\max_{i}\left(1/\sqrt{\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})}\right)\leq\sqrt{2d/\varepsilon}. Therefore,

Combining Claim 7.27 and Lemma 7.28 we obtain:

VarXuS[vX]10\operatorname*{Var}_{X\in_{u}S}[v\cdot X]\leq 10 and v(μSμS)χ2(μS,p)+O(ε/d),|v\cdot(\mu^{S}-\mu^{S^{\prime}})|\leq\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}+O(\varepsilon/d), and

PrXuS(vXμST+χ2(μS,p))9/T2+ε3/2/d2  .\Pr_{X\in_{u}S}\left(|v\cdot X-\mu^{S^{\prime}}|\geq T+\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\right)\leq 9/T^{2}+\varepsilon^{3/2}/d^{2}\;.

where the second line uses Lemma 7.28 (ii), and the assumption D1v22=i=1dvi2μiS(1μiS)1,\|D^{-1}v\|_{2}^{2}=\sum_{i=1}^{d}v_{i}^{2}\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})\leq 1, and the third line holds for small enough ε.\varepsilon. Thus, using Claim 7.27 (i), we get that

By the Cauchy-Schwarz inequality and Lemma 7.28, we get

Part (ii) follows directly from Claim 7.27 (ii) and the assumption that SS is good for P.P.

We have that D(MSMP)D2O(ε)\|D(M_{S}-M_{P})D\|_{2}\leq O(\sqrt{\varepsilon}).

We can show that (MS)i,j(MP)i,jO(ε3/2/d2)|(M_{S})_{i,j}-(M_{P})_{i,j}|\leq O(\varepsilon^{3/2}/d^{2}) for all i,j[d]i,j\in[d], by expanding the LHS in terms of the differences of linear threshold functions on SS and PP in the same way as the proof of Lemma 7.28. Thus,

Finally note that D2=maxi(1/μiS(1μiS))2d/ε\|D\|_{2}=\max_{i}\left(1/\sqrt{\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})}\right)\leq\sqrt{2d/\varepsilon}, and so

We have that D(SMSMPEME+LML)D2=O(Sε)  .\left\|D(|S^{\prime}|M-|S|M_{P}-|E|M_{E}+|L|M_{L})D\right\|_{2}=O(|S^{\prime}|\cdot\sqrt{\varepsilon})\;.

This follows from Lemma 7.30 combined with the fact that SM=SMS+EMELML|S^{\prime}|M=|S|M_{S}+|E|M_{E}-|L|M_{L} and the observation SS/(12ε)2S.|S|\leq|S^{\prime}|/(1-2\varepsilon)\leq 2|S^{\prime}|.

We have that DMPD29+χ2(μS,p).\|DM_{P}D\|_{2}\leq 9+\chi^{2}(\mu^{S^{\prime}},p).

The following crucial lemma bounds from above the contribution to the error from LL:

The spectral norm DMLD2=O(S/L+χ2(μS,p)).\|DM_{L}D\|_{2}=O(|S^{\prime}|/|L|+\chi^{2}(\mu^{S^{\prime}},p)).

where the first line uses the triangle inequality, the second line uses Claim 7.27 (i), the third line follows from the fact that LSL\subseteq S, the fourth line uses Corollary 7.29 (i), and the last line uses the fact that ε\varepsilon is small enough. ∎

The above lemmata and the triangle inequality yield the following:

We have that D(M(E/S)ME)D2=O(1+χ2(μS,p))  .\left\|D\left(M-(|E|/|S^{\prime}|)M_{E}\right)D\right\|_{2}=O(1+\chi^{2}(\mu^{S^{\prime}},p))\;.

We are now ready to analyze the two cases of the algorithm Filter-Product.

2.4 The Case of Small Spectral Norm

We begin by bounding various χ2\chi^{2}-distances by the spectral norm of appropriate matrices.

Let μE,μL\mu^{E},\mu^{L} be the mean vector of EE and LL, respectively. Then, χ2(μS,μE)DMED2\chi^{2}(\mu^{S^{\prime}},\mu^{E})\leq\|DM_{E}D\|_{2} and χ2(μS,μL)DMLD2.\chi^{2}(\mu^{S^{\prime}},\mu^{L})\leq\|DM_{L}D\|_{2}.

We prove the first inequality, the proof of the second being very similar.

We can now prove that the output in Step 10 has the desired guarantee:

We have that χ2(μS,p)2εDMD2+O(ε).\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq 2\sqrt{\varepsilon\|DMD\|_{2}}+O(\sqrt{\varepsilon}).

Since S=(SL)E,S^{\prime}=(S\setminus L)\cup E, we have that SμS=SμS+EμELμL.|S^{\prime}|\mu^{S^{\prime}}=|S|\mu^{S}+|E|\mu^{E}-|L|\mu^{L}. Recalling that L,EL,E are disjoint, the latter implies that

First note that, by Lemma 7.28, χ2(μS,μS)χ2(μS,p)O(ε/d).|\sqrt{\chi^{2}(\mu^{S^{\prime}},\mu^{S})}-\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}|\leq O(\varepsilon/d). Lemma 7.35 and Corollary 7.34 give that

For ε\varepsilon sufficiently small, we have that the χ2(μS,p)\sqrt{\chi^{2}(\mu^{S^{\prime}},p)} terms satisfy

Recalling that E/SΔ(S,S)S/S(5/2)ε|E|/|S^{\prime}|\leq\Delta(S,S^{\prime})|S|/|S^{\prime}|\leq(5/2)\varepsilon, we now have:

Let δ:=3ελ\delta:=3\sqrt{\varepsilon|\lambda|}. For some universal constant CC, if δCεlog(1/ε),\delta\leq C\sqrt{\varepsilon\log(1/\varepsilon)}, then χ2(μS,p)O(εlog(1/ε)).\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq O(\sqrt{\varepsilon\log(1/\varepsilon)}). Otherwise, we have χ2(μS,p)δ.\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq\delta.

If CC is sufficiently large, when δ>Cεlog(1/ε),\delta>C\sqrt{\varepsilon\log(1/\varepsilon)}, this O(ε)O(\sqrt{\varepsilon}) is at most Cεlog(1/ε)/6.C\sqrt{\varepsilon\log(1/\varepsilon)}/6.

By Corollary 7.37, we have that χ2(μS,p)O(εlog(1/ε)).\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq O(\sqrt{\varepsilon\log(1/\varepsilon)}). Thus, by Corollary 7.25, the total variation distance between the product distributions with means pp and μS\mu^{S^{\prime}} is O(εlog(1/ε)).O(\sqrt{\varepsilon\log(1/\varepsilon)}).

2.5 The Case of Large Spectral Norm

We next need to show the correctness of the algorithm if it returns a filter, If we reach this step, then we have DMD2=Ω(1),\|DMD\|_{2}=\Omega(1), indeed vDMDvT=Ω(1),|v^{\prime}DMDv^{\prime T}|=\Omega(1), and by Corollary 7.37, it follows that χ2(μS,p)δ\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq\delta where δ:=3εDMD2.\delta:=3\sqrt{\varepsilon\|DMD\|_{2}}.

Since v2=1,\|v^{\prime}\|_{2}=1, DvDv^{\prime} satisfies i=1d(Dv)i2μiS(1μiS)=i=1mvi2=1.\sum_{i=1}^{d}(Dv^{\prime})_{i}^{2}\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})=\sum_{i=1}^{m}v_{i}^{\prime 2}=1. Thus, we can apply Corollary 7.29 to it.

Let a=maxxSvxμSa=\max_{x\in S^{\prime}}|v^{\ast}\cdot x-\mu^{S^{\prime}}|. Firstly, we look at the expected number of samples we reject:

Next, we look at the expected number of false positive samples we reject. If we write S=SLES^{\prime\prime}=S\cup L^{\prime}\setminus E^{\prime} for disjoint multisets LL^{\prime} and EE^{\prime}, then these are the elements of LLL^{\prime}\setminus L. We have:

Agnostically Learning Mixtures of Two Balanced Binary Products, via Filters

Unfortunately, bounds on the top absolute eigenvalue do not translate as well into bounds on the total variation distance of our estimate to the true distribution, as they did in all previous cases (e.g., if the top absolute eigenvalue is small in the case of learning the mean of a Gaussian with identity covariance, we can just use the empirical mean, etc). In fact, an eigenvalue λ\lambda could just mean that pp and qq differ by λ\sqrt{\lambda} along the direction vv. However, we can proceed by zeroing out the diagonals. If uuTuu^{T} has any large value along the diagonal, this operation can itself produce large eigenvalues. So, this strategy only works when u\|u\|_{\infty} is appropriately bounded. When u\|u\|_{\infty} is large, there is a separate strategy to deal with large entries in uu by guessing a coordinate whose value is large and conditioning on it, and once again setting up a modified eigenvalue problem. Our overall algorithm then follows from balancing all of these different cases, and we describe the technical components in more detail in the next subsection.

This section is devoted to the proof of the following theorem:

Recall that our overall approach is based on two strategies that succeed under different assumptions. Our first algorithm (Section 8.2) assumes that there exists a coordinate in which the means of the two component product distributions differ by a substantial amount. Under this assumption, we can use the empirical mean vectors conditioned on this coordinate being and 11. We show that the difference between these conditional mean vectors is almost parallel to the difference between the mean vectors of the product distributions. Considering eigenvectors perpendicular to this difference will prove a critical part of the analysis of this case. Our second algorithm (Section 8.3) succeeds under the assumption that the mean vectors of the two product distributions are close in all coordinates. This assumption allows us to zero out the diagonal of the covariance matrix without introducing too much error.

We note that together these algorithms will cover all possible cases. Our final algorithm runs all of these procedures in parallel, obtaining a polynomial number of candidate hypothesis distributions, such that at least one is sufficiently close to Π\Pi. We then run the tournament described by Lemma 2.23 in order to find a particular candidate that is sufficiently close to the target. To ensure that all the distributions returned are in some finite set M\mathcal{M}, we round each of the probabilities of each of the products to the nearest multiple of ε/d\varepsilon/d, and similarly round the mixing weight to the nearest multiple of ε\varepsilon. This introduces at most O(ε)O(\varepsilon) additional error.

2 Mixtures of Products Whose Means Differ Significantly in One Coordinate

We will use the following notation. Let Π\Pi be a mixture of two cc-balanced binary product distributions. We will write Π\Pi as αP+(1α)Q,\alpha P+(1-\alpha)Q, where P,QP,Q are binary product distributions with mean vectors p,qp,q, and α\alpha\in. In this subsection, we prove the following theorem:

For simplicity of the analysis, we will assume without loss of generality that i=d,i^{\ast}=d, unless otherwise specified. First, we determine some conditions under which our sample set will be sufficient. We start by recalling our condition of a good set for a balanced binary product distribution:

We will also need this to hold after conditioning on the last coordinate.

Finally, we define the notion of a good set for a mixture of two balanced products.

Let Π=αP+(1α)Q\Pi=\alpha P+(1-\alpha)Q be a mixture of two binary product distributions. We say that a multiset SS of elements of {0,1}d\{0,1\}^{d} is (ε,i)(\varepsilon,i)-good with respect to Π\Pi if we can write S=SPSQS=S_{P}\cup S_{Q}, where SPS_{P} is (ε,i)(\varepsilon,i)-good with respect to PP, SQS_{Q} is (ε,i)(\varepsilon,i)-good with respect to QQ, and SPSαε/d2|\frac{|S_{P}|}{|S|}-\alpha|\leq\varepsilon/d^{2}.

We now show that taking random samples from Π\Pi produces such a set with high probability.

Let Π=αP+(1α)Q\Pi=\alpha P+(1-\alpha)Q be a mixture of binary product distributions, where P,QP,Q are binary product distributions with mean vectors p,qp,q. Let SS be a set obtained by taking Ω(d4log(1/τ)/ε13/6)\Omega(d^{4}\log(1/\tau)/\varepsilon^{13/6}) independent samples from Π\Pi. Then, with probability at least 1τ1-\tau, SS is (ε,i)(\varepsilon,i)-good with respect to Π\Pi for all i[d]i\in[d].

The proof of this lemma is deferred to Section E.

We claim that given a good set with an ε\varepsilon-fraction of its entries corrupted, we can still determine Π\Pi from it. In particular, this is achieved by iterating the following proposition.

We note that iteratively applying this algorithm until it outputs a set RR of mixtures gives Theorem 8.2.

Notation. All vectors in this section should be assumed to be over the first d1d-1 coordinates only. We will write pdp_{-d} and qdq_{-d} for the first d1d-1 coordinates of pp and qq, but for other vectors we will use the similar notation to that used elsewhere to denote (d1)(d-1)-dimensional vectors.

The algorithm, written in terms of ii^{\ast} instead of dd for generality, is as follows:

We now proceed to prove correctness. We note that given S=SPSQS=S_{P}\cup S_{Q}, we can write

where SPSPS^{\prime}_{P}\subseteq S_{P}, SQSQS^{\prime}_{Q}\subseteq S_{Q} and EE is disjoint from SPSPS_{P}\setminus S^{\prime}_{P} and SQSQS_{Q}\setminus S^{\prime}_{Q}. Thus, we have

We next need some basic lemmata relating the means of some of these distributions.

We next show that μ(1)μ(0)\mu^{(1)}-\mu^{(0)} is approximately parallel to pdqdp_{-d}-q_{-d}. Note that if we had S=SS=S^{\prime} and μSP=pd,μSQ=qd\mu^{S_{P}}=p_{-d},\mu^{S_{Q}}=q_{-d}, then μ(1)μ(0)\mu^{(1)}-\mu^{(0)} would be a multiple of pdqdp_{-d}-q_{-d}. Since SS is ε\varepsilon-good, we can bound the error introduced by μSPp\mu^{S_{P}}-p, μSQqd\mu^{S_{Q}}-q_{-d} and Lemma 8.8 allow us to bound the error in taking μSP,μSQ\mu^{S^{\prime}_{P}},\mu^{S^{\prime}_{Q}} instead of pd,qdp_{-d},q_{-d}. However, we still have terms in the conditional means of EE:

For some scalars a=O(ε)a=O(\varepsilon), b0=O(E0/S)b^{0}=O(|E^{0}|/|S^{\prime}|), b1=O(E1/S)b^{1}=O(|E^{1}|/|S^{\prime}|), we have

where EjE^{j} is the subset of EE with last entry jj, μEj\mu^{E^{j}} is the mean of EjE^{j} with dthd^{th} coordinate removed.

Let SPj,SQj,Ej,SjS_{P}^{\prime j},S_{Q}^{\prime j},E^{j},S^{\prime j} denote the subset of the appropriate set in which the last coordinate is jj. Let μSPj,μSQj,μEj\mu^{{S^{\prime}_{P}}^{j}},\mu^{{S^{\prime}_{Q}}^{j}},\mu^{E^{j}} denote the means of SPj,SQjS_{P}^{\prime j},S_{Q}^{\prime j}, and EjE^{j} with the last entry truncated, respectively.

Taking the means of the subsets of SjS^{\prime j}, we find that

Therefore, using this and Lemma 8.8, we have that

Since Sj=SPj+SQj+Ej|S^{\prime j}|=|S_{P}^{\prime j}|+|S_{Q}^{\prime j}|+|E^{j}|, we have:

Thus, the sum of the coefficients of the pdp_{-d} and qdq_{-d} terms in Equation (40) is E0S1E1S0|E^{0}||S^{\prime 1}|-|E^{1}||S^{\prime 0}|, which is bounded in absolute value by ESO(εS2)|E||S^{\prime}|\leq O(\varepsilon|S|^{2}). Meanwhile, the pdp_{-d} coefficient of Equation (40) has:

Noting that (E1S0E0S1)α=O(εS2)(|E^{1}||S^{\prime 0}|-|E^{0}||S^{\prime 1}|)\alpha=O(\varepsilon|S^{\prime}|^{2}) and (E1S0E0S1)(1α)=O(εS2)(|E^{1}||S^{\prime 0}|-|E^{0}||S^{\prime 1}|)(1-\alpha)=O(\varepsilon|S^{\prime}|^{2}), we can write Equation (40) as:

We write μΠ=αpd+(1α)qd\mu^{\Pi}=\alpha p_{-d}+(1-\alpha)q_{-d} and so, dividing by S2|S^{\prime}|^{2} and recalling that E/SO(ε)|E|/|S^{\prime}|\leq O(\varepsilon), we get

If μΠ=μ\mu^{\Pi}=\mu, then we would be done. So, we must bound the error introduced by making this substitution. We can express μ\mu as

Substituting this into Equation (41), gives the lemma. ∎

We now show that, for any vector vv perpendicular to uu, if the variance of SS^{\prime} in the vv-direction is small, then vpdv\cdot p_{-d} and vqdv\cdot q_{-d} are both approximately vμv\cdot\mu.

For any vv with v2=1\|v\|_{2}=1, vu=0v\cdot u=0, we have that v(pdμ)δ|v\cdot(p_{-d}-\mu)|\leq\delta and v(qdμ)δ|v\cdot(q_{-d}-\mu)|\leq\delta for δ:=C(ε1/6Σ2+ε2/3log(1/ε))\delta:=C(\varepsilon^{1/6}\|\Sigma\|_{2}+\varepsilon^{2/3}\log(1/\varepsilon)) for a sufficiently large constant CC as defined in the algorithm.

Next, since vu=0v\cdot u=0, we have by Lemma 8.9 that

However, we have that Sμ=SPμSp+SQμSq+EμE+SO(εlog(1/ε))|S^{\prime}|\mu=|S^{\prime}_{P}|\mu^{S^{\prime}_{p}}+|S^{\prime}_{Q}|\mu^{S^{\prime}_{q}}+|E|\mu^{E}+|S^{\prime}|O(\varepsilon\log(1/\varepsilon)), and so

Inserting our assumptions that α(1α)ε1/6/2\alpha(1-\alpha)\geq\varepsilon^{1/6}/2, and pdqdε1/6p_{d}-q_{d}\geq\varepsilon^{1/6} gives

We can now show that if we return R,R, some distribution returned is close to Π\Pi. First, we show that there are points on LL close to pdp_{-d} and qdq_{-d}.

It is clear that even discretizing cc and dd, we can still find such a pair that satisfies this condition.

There are p,qLp^{\prime},q^{\prime}\in L such that pdp2,qdq2δ+O(ε1/6)  .\|p_{-d}-p^{\prime}\|_{2},\|q_{-d}-q^{\prime}\|_{2}\leq\delta+O(\varepsilon^{1/6})\;.

Similarly, we show that there is a qLq^{\prime}\in L such that qq2δ+O(ε1/6)\|q-q^{\prime}\|_{2}\leq\delta+O(\varepsilon^{1/6}). ∎

Now, we are ready to analyze the second part of our algorithm. The basic idea will be to show that if λ\lambda is large, then a large fraction of the variance in the vv-direction is due to points in EE.

Since SPS_{P} and SQS_{Q} are ε\varepsilon-good, we have that

Since λ\lambda is larger than a sufficiently large constant, this completes the proof. ∎

We next show that the threshold T>0T>0 required by our algorithm exists.

If λΩ(1)\lambda\geq\Omega(1), there exists a T>0T>0 such that

Assume for the sake of contradiction that this is not the case, i.e., that for all T>0T>0 we have that

Using the fact that ESE\subset S^{\prime}, this implies that for all T>0T>0

Finally, we show that SS^{\prime\prime} is closer to SS than SS^{\prime} was.

If the algorithm returns SS^{\prime\prime} then Δ(S,S)Δ(S,S)2ε/d\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d.

Since SSS^{\prime\prime}\subset S, we can write S=SPSQES^{\prime\prime}=S^{\prime\prime}_{P}\cup S^{\prime\prime}_{Q}\cup E^{\prime\prime} for SPSPS^{\prime\prime}_{P}\subseteq S^{\prime}_{P}, SQSQS^{\prime\prime}_{Q}\subseteq S_{Q} and EEE^{\prime\prime}\subset E, where EE^{\prime\prime} has disjoint support from SPSPS^{\prime\prime}_{P}\setminus S_{P} and SQSQS^{\prime\prime}_{Q}\setminus S_{Q}. Thus, we need to show that

By Lemma 8.10, we have that v(μpd)δ|v^{\ast}\cdot(\mu-p_{-d})|\leq\delta and so

Since SS is (ε,d)(\varepsilon,d)-good, we have

We get the same bound for XuSQX\in_{u}S_{Q}, and so

Since LPLQSL^{\prime\prime}_{P}\cup L^{\prime\prime}_{Q}\subseteq S but any x(SPSP)(SQSQ)x\in(S^{\prime}_{P}\setminus S^{\prime\prime}_{P})\cup(S^{\prime}_{Q}\setminus S^{\prime\prime}_{Q}) has v(xμ)T+δv^{\ast}\cdot(x-\mu)\geq T+\delta, we have that

3 Mixtures of Products Whose Means Are Close in Every Coordinate

In this section, we prove the following theorem:

We will assume without loss of generality that α1/2.\alpha\leq 1/2. We may also assume that α>10δ10ε\alpha>10\delta\geq 10\varepsilon since otherwise, we can make use of our algorithm for learning a single product distribution.

In this context, we require the following slightly different definition of a good set:

Let SS be a multiset in {0,1}d\{0,1\}^{d}. We say that SS is ε\varepsilon-good for the mixture Π\Pi if there exists a partition S=SPSQS=S_{P}\cup S_{Q} such that SPSαε\left|\frac{|S_{P}|}{|S|}-\alpha\right|\leq\varepsilon and that SPS_{P} and SQS_{Q} are ε/6\varepsilon/6-good for the component product distributions PP and QQ, respectively.

If Π\Pi has mixing weights δα1δ\delta\leq\alpha\leq 1-\delta, with probability at least 1τ1-\tau, a set SS of Ω(d4log1/τ/(ε2δ))\Omega(d^{4}\log 1/\tau/(\varepsilon^{2}\delta)) samples drawn from Π\Pi is good for Π\Pi.

Our theorem will follow from the following proposition:

Before we present the algorithm, we give one final piece of notation. For SS a set of points, we let \mboxCov(S)\mbox{Cov}(S) be the sample covariance matrix of SS and \mboxCov0(S)\mbox{Cov}_{0}(S) be the sample covariance matrix with zeroed out diagonal. Our algorithm is presented in detailed pseudocode in Algorithm 16.

To analyze this algorithm, we begin with a few preliminaries. Firstly, we recall that S=SPSQS=S_{P}\cup S_{Q}. We can write S=SPSQES^{\prime}=S^{\prime}_{P}\cup S^{\prime}_{Q}\cup E, where SPSPS^{\prime}_{P}\subset S_{P}, SQSQS^{\prime}_{Q}\subset S_{Q}, and

Let μSP\mu^{S^{\prime}_{P}} and μSQ\mu^{S^{\prime}_{Q}} be the sample means of SPS^{\prime}_{P} and SQS^{\prime}_{Q}, respectively.

We have that αpμSP2,(1α)qμSQ2=O(εlog(1/ε))  .\alpha\|p-\mu^{S^{\prime}_{P}}\|_{2},(1-\alpha)\|q-\mu^{S^{\prime}_{Q}}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)})\;.

We will require that the matrix \mboxCov0(S)\mbox{Cov}_{0}(S^{\prime}) is close to being PSD. The proof of this fact is rather technical and we defer it to the appendix.

Let TT be the multiset obtained from SS^{\prime} by replacing all points of SPS^{\prime}_{P} with copies of μSP\mu^{S^{\prime}_{P}} and all points of SQS^{\prime}_{Q} with copies of μSQ\mu^{S^{\prime}_{Q}}. Then, \mboxCov0(S)\mboxCov(T)2=O(δ2)  .\|\mbox{Cov}_{0}(S^{\prime})-\mbox{Cov}(T)\|_{2}=O(\delta^{2})\;.

We are now prepared to show that the first return condition outputs a correct answer. We begin by showing that vectors uu with large inner products with μSPμ\mu^{S^{\prime}_{P}}-\mu or μSQμ\mu^{S^{\prime}_{Q}}-\mu correspond to large eigenvectors of \mboxCov0(S)\mbox{Cov}_{0}(S^{\prime}).

Using Lemma 8.22, we have uT\mboxCov0(S)u=VarXuT(uX)+O(δ2)u22u^{T}\mbox{Cov}_{0}(S^{\prime})u=\operatorname*{Var}_{X\in_{u}T}(u\cdot X)+O(\delta^{2})\|u\|_{2}^{2}. From the definition of TT it follows that

Next, we show that, if there is only one large eigenvalue of \mboxCov0(S)\mbox{Cov}_{0}(S^{\prime}), the means in question are both close to a given line.

Next, we analyze the second case of the algorithm. We must show that Step 8 will find an rr and tt. First, we claim that there is a θ\theta which makes rr nearly perpendicular to μSpμSQ\mu^{S^{\prime}_{p}}-\mu^{S^{\prime}_{Q}}.

There exists a r=(cosθ)u+(sinθ)vr=(\cos\theta)u^{\ast}+(\sin\theta)v^{\ast}, with θ\theta a multiple of δ2/d\delta^{2}/d, that has

Let z=(μSPμSQ)z=(\mu^{S^{\prime}_{P}}-\mu^{S^{\prime}_{Q}}). If uz=0u^{\ast}\cdot z=0, then θ=0\theta=0 suffices. Otherwise, we take θ=cot1(vzuz).\theta^{\prime}=\cot^{-1}(\frac{v^{\ast}\cdot z}{u^{\ast}\cdot z}). Then, let θ\theta be the nearest multiple of δ2/d\delta^{2}/d to θ\theta^{\prime}. Note that cosθcosθ,sinθsinθθθ|\cos\theta-\cos\theta^{\prime}|,|\sin\theta-\sin\theta^{\prime}|\leq|\theta-\theta^{\prime}| and uz,vzz2d|u^{\ast}\cdot z|,|v^{\ast}\cdot z|\leq\sqrt{\|z\|_{2}}\leq\sqrt{d}. Then, we have

We now need to show that for this rr, Step 8 will find a tt. For this rr, rμSPr\cdot\mu^{S^{\prime}_{P}} and rμSQr\cdot\mu^{S^{\prime}_{Q}} are close. We need to show that EE contains many elements xx whose rxr\cdot x is far from these. We can express this in terms of TT:

Let rr be a unit vector in ru,vr\in\langle u^{\ast},v^{\ast}\rangle with r(μSPμSQ)δ2/d.|r\cdot(\mu^{S^{\prime}_{P}}-\mu^{S^{\prime}_{Q}})|\leq\delta^{2}/\sqrt{d}. Then, there is a t>1t>1 such that

for sufficiently large CC and we also have that rT\mboxCov0(S)rr^{T}\mbox{Cov}_{0}(S^{\prime})r is positive.

Suppose for a contradiction that this lemma does not hold. Then, since ETE\subset T, we have

Since we assumed that δ2Ω(εlog(1/ε)\delta^{2}\geq\Omega(\varepsilon\log(1/\varepsilon), this is a contradiction.

To get a similar result for SS^{\prime}, we first need to show that SPS^{\prime}_{P} and SQS^{\prime}_{Q} are suitably concentrated about their means:

If t1+2log6/εt\geq 1+\sqrt{2\log 6/\varepsilon}, this is strictly less than 2ε/3.2\varepsilon/3.

Since r(μSQμSP)δ2/d1/2,|r\cdot(\mu^{S_{Q}}-\mu^{S_{P}})|\leq\delta^{2}/\sqrt{d}\leq 1/2, we have

Noting that 1(SP+SQ)/S=E/S4ε/31-(|S^{\prime}_{P}|+|S^{\prime}_{Q}|)/|S^{\prime}|=|E|/|S^{\prime}|\geq 4\varepsilon/3, we have

for ε\varepsilon sufficiently small. If t1+2log6/εt\geq 1+\sqrt{2\log 6/\varepsilon}, this expression is (5/2)(ε/6)+ε/5d2ε/3(5/2)(\varepsilon/6)+\varepsilon/5d\leq 2\varepsilon/3. ∎

Now we can finally show that a tt exists for this rr, so Step 8 will succeed:

There is a t1+2log(9/ε)t\geq 1+2\sqrt{\log(9/\varepsilon)} such that

By Lemma 8.27, there exists a t1t\geq 1 such that

Using the definition of TT, the points when x=μSPx=\mu^{S^{\prime}_{P}} or x=μSQx=\mu^{S^{\prime}_{Q}} do not contribute to this probability so all points in TT that satisfy r(xμSP)>2tr\cdot(x-\mu^{S^{\prime}_{P}})>2t come from EE. Since ESE\subset S^{\prime} and S=T|S^{\prime}|=|T|, we have

Noting that E/S4ε/3|E|/|S^{\prime}|\leq 4\varepsilon/3, all except a 4ε/34\varepsilon/3 fraction of points xTx\in T have r(xμSP)=O(δ2)r\cdot(x-\mu^{S^{\prime}_{P}})=O(\delta^{2}). So, 4ε/312exp((t1)2/4)4\varepsilon/3\geq 12\exp(-(t-1)^{2}/4). Therefore, t1+2log(9/ε)t\geq 1+2\sqrt{\log(9/\varepsilon)}.

Thus, by Lemma 8.28, we have (1E/S)PrXuSPSQ(r(XμSP)>t)<2ε/3(1-|E|/|S^{\prime}|)\Pr_{X\in_{u}S^{\prime}_{P}\cup S^{\prime}_{Q}}\left(r\cdot(X-\mu^{S^{\prime}_{P}})>t\right)<2\varepsilon/3. Again, using that E/S4ε/3|E|/|S^{\prime}|\leq 4\varepsilon/3, we have that

Consequently, if xx satisfies r(xμSP)>2tr\cdot(x-\mu^{S^{\prime}_{P}})>2t, then it satisfies PrYuS(r(xY)t)<2ε\Pr_{Y\in_{u}S^{\prime}}\left(r\cdot(x-Y)\leq t\right)<2\varepsilon. Substituting this condition into Equation (43) gives the lemma. ∎

Again we need to show that any filter does not remove too many points of SS. We need to show this for an arbitrary rr, not just one nearly parallel to μSPμSQ\mu^{S^{\prime}_{P}}-\mu^{S^{\prime}_{Q}}.

For any unit vector rr^{\prime} and t2log(1/ε)t\geq 2\sqrt{\log(1/\varepsilon)}, we have

Every point xx with r(xp)t/2|r^{\prime}\cdot(x-p)|\leq t/2 has r(xy)t|r^{\prime}\cdot(x-y)\leq t| for all yy with r(yp)t/2|r^{\prime}\cdot(y-p)|\leq t/2. Thus, for xx with r(xp)t/2|r^{\prime}\cdot(x-p)|\leq t/2, we have

When t2log(1/ε)t\geq 2\sqrt{\log(1/\varepsilon)}, we have

Thus, we have PrYuS(r(xY)t)4ε>2ε.\Pr_{Y\in_{u}S^{\prime}}(r\cdot(x-Y)\leq t)\geq 4\varepsilon>2\varepsilon.

But inequality (44) gives a bound on the number of xx in SPS_{P} that do not satisfy this condition. That is,

Similarly, every point xx with r(xq)t/2|r^{\prime}\cdot(x-q)|\leq t/2 has

Dividing by S|S^{\prime}| and noting that S(1+2ε)S(3/2)S|S|\leq(1+2\varepsilon)|S^{\prime}|\leq(3/2)|S^{\prime}| completes the proof. ∎

Now, we can show that the filter improves Δ(S,S),\Delta(S,S^{\prime\prime}), and such that the algorithm is correct in the filter case.

If we reach Step 9 and return SS^{\prime\prime}, then Δ(S,S)Δ(S,S)2ε/d.\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d.

We can write S=SPSQES^{\prime\prime}=S^{\prime\prime}_{P}\cup S^{\prime\prime}_{Q}\cup E^{\prime\prime}, where EE^{\prime\prime} has disjoint support from SPSPS_{P}\setminus S^{\prime\prime}_{P} and SQSQS_{Q}\setminus S^{\prime\prime}_{Q}. Note that, since we have SSS^{\prime\prime}\subset S^{\prime}, we can define these sets such that SPSPS^{\prime\prime}_{P}\subseteq S^{\prime}_{P}, SQSQS^{\prime\prime}_{Q}\subseteq S^{\prime}_{Q} and EEE^{\prime\prime}\subseteq E. We assume that we do. Now we have that

In Step 8, we found a vector rr and t1+2log(1/ε)t\geq 1+2\sqrt{\log(1/\varepsilon)} such that

Then in Step 9, we remove at least a 12exp(t2/4)+3ε/d12\exp(-t^{2}/4)+3\varepsilon/d fraction of points. That is,

The fact that t1+2log(1/ε)t\geq 1+2\sqrt{\log(1/\varepsilon)} allows us to use Lemma 8.30, with r=rr^{\prime}=r, yielding that:

since SS(1Δ(S,S))(12ε)S5S/6|S^{\prime}|\geq|S|(1-\Delta(S,S^{\prime}))\geq(1-2\varepsilon)|S|\geq 5|S|/6. ∎

Acknowledgements

J.L. would like to thank Michael B. Cohen and Samuel B. Hopkins for some very helpful discussions. We thank the reviewers for their detailed feedback, and Lili Su for pointing out an error in the proof of Lemma 5.3.

References

Appendix A Deferred Proofs from Section 4

This section contains deferred proofs of several concentration inequalities.

Therefore, by the triangle inequality, we have

Observe that the first term on the right hand side does not depend on the choice of JJ. Let E1E_{1} denote the event that

By Lemma 2.22, this happens with probability 1τ1-\tau so long as

For any J[n]J\subset[n] so that J=(1ε)n|J|=(1-\varepsilon)n, let E2(J)E_{2}(J) denote the event that

Fix any such JJ. By multiplying both sides by ρ=(1ε)/ε\rho=(1-\varepsilon)/\varepsilon, the event E2(J)E_{2}(J) is equivalent to the event that

Let A,BA,B be as in Lemma 2.22. Observe that ρδ1=Ω(log1/ε)1\rho\delta_{1}=\Omega(\log 1/\varepsilon)\geq 1 for ε\varepsilon sufficiently small. Then, by Lemma 2.22, we have that for any fixed JJ,

Let H(ε)H(\varepsilon) denote the binary entropy function. We now have

as claimed, where (a) follows by a union bound over all sets JJ of size (1ε)N(1-\varepsilon)N, (b) follows from the bound log(nεn)εH(ε)\log\binom{n}{\varepsilon n}\leq\varepsilon H(\varepsilon), (c) follows since H(ε)=O(εlog1/ε)H(\varepsilon)=O(\varepsilon\log 1/\varepsilon) as ε0\varepsilon\rightarrow 0, (d) follows from our choice of δ1\delta_{1}, and (e) follows from our choice of nn. This completes the proof. \hfill\qed\hfill\qed

Proof of Theorem 4.12: We first recall Isserlis’ theorem, which we will require in this proof.

where the \sum\prod is over all matchings of {1,,k}\{1,\ldots,k\}.

Since AA is symmetric, it has a eigenvalue expansion A=i=1dλiuiuiTA=\sum_{i=1}^{d}\lambda_{i}u_{i}u_{i}^{T}, which immediately implies that v=i=1dλiuiuiv=\sum_{i=1}^{d}\lambda_{i}u_{i}\otimes u_{i}. Let XN(0,Σ)X\sim\operatorname{\mathcal{N}}(0,\Sigma). We compute the quadratic form:

where the last line follows by invoking Isserlis’s theorem. We now manage both sums individually. We have

Proof of Corollary 4.9: Let Sm={S[N]:S=m}\mathfrak{S}_{m}=\{S\subseteq[N]:|S|=m\} denote the set of subsets of [N][N] of size mm. The same Bernstein-style analysis as in the proof of Lemma 4.3 yields that there exist universal constants A,BA,B so that:

Thus, union bounding over all m{1,,εN}m\in\{1,\ldots,\varepsilon N\} yields that

by the same manipulations as in the proof of Lemma 4.3. \hfill\qed\hfill\qed

This follows immediately from Lemmas 5.17 and 5.20.

Appendix B Deferred Proofs from Section 5

Proof of Lemma 5.3: Let N=Ω((d/ε2)polylog(d/ετ))N=\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau)) be the number of samples drawn from GG. For (i), the probability that a coordinate of a sample is at least 2νlog(Nd/3τ)\sqrt{2\nu\log(Nd/3\tau)} is at most τ/3dN\tau/3dN by Fact 5.6. By a union bound, the probability that all coordinates of all samples are smaller than 2νlog(Nd/3τ)\sqrt{2\nu\log(Nd/3\tau)} is at least 1τ/31-\tau/3. In this case, x22νdlog(Nd/3τ)=O(dνlog(Nν/τ))\|x\|_{2}\leq\sqrt{2\nu d\log(Nd/3\tau)}=O(\sqrt{d\nu\log(N\nu/\tau)}).

After translating by μG\mu^{G}, we note that (iii) follows immediately from Lemma 2.21 and (iv) follows from Theorem 5.50 of [Ver10], as long as N=Ω(ν4dlog(1/τ)/ε2)N=\Omega(\nu^{4}d\log(1/\tau)/\varepsilon^{2}), with probability at least 1τ/31-\tau/3. It remains to show that, conditioned on (i), (ii) holds with probability at least 1τ/31-\tau/3.

To simplify some expressions, let δ:=ε/(log(dlogd/ετ))\delta:=\varepsilon/(\log(d\log d/\varepsilon\tau)) and R=Cdlog(S/τ)R=C\sqrt{d\log(|S|/\tau)}. We need to show that for all unit vectors vv and all 0TR0\leq T\leq R that

Firstly, we show that for all unit vectors vv and T>0T>0

with probability at least 1τ/61-\tau/6. Since the VC-dimension of the set of all halfspaces is d+1d+1, this follows from the VC inequality [DL01], since we have more than Ω(d/(δ/(10νlog(1/δ))2)\Omega(d/(\delta/(10\nu\log(1/\delta))^{2}) samples. We thus only need to consider the case when T10νln(1/δ)T\geq\sqrt{10\nu\ln(1/\delta)}.

For any fixed unit vector vv and T>10νln(1/δ)T>\sqrt{10\nu\ln(1/\delta)}, except with probability exp(Nδ/(6Cν))\exp(-N\delta/(6C\nu)), we have that

Let EE be the event that v(XμG)>T|v\cdot(X-\mu^{G})|>T. Since GG is sub-gaussian, Fact 5.6 yields that PrG[E]=PrYG[v(XμG)>T]exp(T2/(2ν))\Pr_{G}[E]=\Pr_{Y\sim G}[|v\cdot(X-\mu^{G})|>T]\leq\exp(-T^{2}/(2\nu)). Note that, thanks to our assumption on TT, we have that Texp(T2/(4ν))/2CT\leq\exp(T^{2}/(4\nu))/2C, and therefore T2PrG[E]exp(T2/(4ν))/2Cδ/2CT^{2}\Pr_{G}[E]\leq\exp(-T^{2}/(4\nu))/2C\leq\delta/2C.

where (a) follows from sub-gaussianity, (b) follows from our choice of TT, and (c) comes from the fact that 1+xex1+x\leq e^{x} for all xx.

Thus, if δ\delta is a sufficiently small constant and CC is sufficiently large, this yields the desired bound. ∎

Now let C\mathcal{C} be a 1/21/2-cover in Euclidean distance for the set of unit vectors of size 2O(d)2^{O(d)}. By a union bound, for all vCv^{\prime}\in\mathcal{C} and TT^{\prime} a power of 2 between 4νln(1/δ)\sqrt{4\nu\ln(1/\delta)} and RR, we have that

Then, by a union bound, (46) holds simultaneously for all unit vectors vv and all 0TR0\leq T\leq R, with probability a least 1τ/31-\tau/3. This completes the proof. \hfill\qed\hfill\qed

B.2 Proof of Lemma 5.16

Let P2=UTΛUP_{2}=U^{T}\Lambda U, where UU is orthogonal and Λ\Lambda is diagonal be an eigen-decomposition of the symmetric matrix P2P_{2}. Then, p(x)=(UΣ1/2x)TP2(UΣ1/2x)p(x)=(U\Sigma^{-1/2}x)^{T}P_{2}(U\Sigma^{-1/2}x). Let XGX\sim G and Y=UΣ1/2XY=U\Sigma^{-1/2}X. Then, YN(0,I)Y\sim\mathcal{N}(0,I) and p(X)=iλiYi2+p0p(X)=\sum_{i}\lambda_{i}Y_{i}^{2}+p_{0} for independent Gaussians YiY_{i}. Thus, p(X)p(X) follows a generalized χ2\chi^{2}-distribution.

Let Z=iaiYi2Z=\sum_{i}a_{i}Y_{i}^{2}, where YiY_{i} are independent random variables distributed as N(0,1)\mathcal{N}(0,1). Let aa be the vector with coordinates aia_{i}. Then,

Noting that 2a=1+a(1a)21+a2\sqrt{a}=1+a-(1-\sqrt{a})^{2}\leq 1+a for a>0a>0, we have

Applying this for p(x)-p(x) instead of p(x)p(x) and putting these together, we get

Substituting t=T/3P2F1/3t=T/3\|P_{2}\|_{F}-1/3, and 2P2F2=VarXG(p(X))2\|P_{2}\|_{F}^{2}=\operatorname*{Var}_{X\sim G}(p(X)) gives:

The final property is a consequence of the following anti-concentration inequality:

B.3 Proof of Lemma 5.17

Proof of Lemma 5.17: Firstly, we note that it suffices to prove this for the case Σ=I\Sigma=I, since for XN(0,Σ)X\sim\operatorname{\mathcal{N}}(0,\Sigma), Y=Σ1/2XY=\Sigma^{-1/2}X is distributed as N(0,I)\operatorname{\mathcal{N}}(0,I), and all the conditions transform to those for G=N(0,I)G=\operatorname{\mathcal{N}}(0,I) under this transformation.

Condition 1 follows by standard concentration bounds on x22\|x\|_{2}^{2}. Condition 2 follows by estimating the entry-wise error between \mboxCov(S)\mbox{Cov}(S) and II. These two conditions hold by Lemma 5.3, since they follow from (i), (iii), and (iv) of (ε,τ)(\varepsilon,\tau) goodness in the sense of Definition 5.2.

Condition 4 is substantially the most difficult of these conditions to prove. Naively, we would want to find a cover of all possible pp and all possible TT, and bound the probability that the desired condition fails. Unfortunately, the best a priori bound on Pr(p(G)>T)\Pr(|p(G)|>T) are on the order of exp(T)\exp(-T). As our cover would need to be of size 2d22^{d^{2}} or so, to make this work with T=dT=d, we would require on the order of d3d^{3} samples in order to make this argument work.

However, we will note that this argument is sufficient to cover the case of T<10log(1/ε)log2(d/ε)T<10\log(1/\varepsilon)\log^{2}(d/\varepsilon).

Let Pk\mathcal{P}_{k} be the set of even, mean-, degree-22 polynomials, such that the associated matrix AA satisfies:

Note that for pPkp\in\mathcal{P}_{k} that p(x)x2/k+k|p(x)|\leq|x|^{2}/\sqrt{k}+\sqrt{k}.

Importantly, any polynomial can be written in terms of these sets.

Let AA be the associated matrix to pp. Note that AF=Varp=1\|A\|_{F}=\operatorname*{Var}{p}=1. Let AkA_{k} be the matrix corresponding to the top kk eigenvalues of AA. We now let p1p_{1} be the polynomial associated to A1/2A_{1}/2, p2p_{2} be associated to (A2A1)/2(A_{2}-A_{1})/2, p4p_{4} be associated to (A4A2)/2(A_{4}-A_{2})/2, and so on. It is clear that p=2(p1+p2++p2t+pd)p=2(p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d}). It is also clear that the matrix associated to pkp_{k} has rank at most kk. If the matrix associated to pkp_{k} had an eigenvalue more than 1/k1/\sqrt{k}, it would need to be the case that the k/2ndk/2^{nd} largest eigenvalue of AA had size at least 2/k2/\sqrt{k}. This is impossible since the sum of the squares of the eigenvalues of AA is at most 11.

We will also need covers of each of these sets Pk\mathcal{P}_{k}. We will assume that condition (1) holds, i.e., that x2R\|x\|_{2}\leq\sqrt{R}, where R=O(dlog(d/ετ))R=O(d\log(d/\varepsilon\tau)). Under this condition, p(x)p(x) cannot be too large and this affects how small a variance polynomial we can ignore.

For each kk, there exists a set CkPk\mathcal{C}_{k}\subset\mathcal{P}_{k} such that

For each pPkp\in\mathcal{P}_{k} there exists a qCkq\in\mathcal{C}_{k} such that Var(p(G)q(G))1/R2d2\operatorname*{Var}(p(G)-q(G))\leq 1/R^{2}d^{2}.

We note that any such pp is associated to a matrix AA of the form A=i=1kλiviviTA=\sum_{i=1}^{k}\lambda_{i}v_{i}v_{i}^{T}, for λi[0,1/k]\lambda_{i}\in[0,1/\sqrt{k}] and viv_{i} orthonormal. It suffices to let qq correspond to the matrix A=i=1kμiwiwiTA^{\prime}=\sum_{i=1}^{k}\mu_{i}w_{i}w_{i}^{T} for with λiμi<1/R2d3|\lambda_{i}-\mu_{i}|<1/R^{2}d^{3} and viwi<1/R2d3|v_{i}-w_{i}|<1/R^{2}d^{3} for all ii. It is easy to let μi\mu_{i} and wiw_{i} range over covers of the interval and the sphere with appropriate errors. This gives a set of possible qq’s of size 2O(dklogR)2^{O(dk\log R)} as desired. Unfortunately, some of these qq will not be in Pk\mathcal{P}_{k} as they will have eigenvalues that are too large. However, this is easily fixed by replacing each such qq by the closest element of Pk\mathcal{P}_{k}. This completes our proof. ∎

We next will show that these covers are sufficient to express any polynomial.

Combining the above two lemmata we have that any such pp can be written as

where qkq_{k} above is in Ck\mathcal{C}_{k} and Var[pk(G)]<1/R2d2\operatorname*{Var}[p_{k}(G)]<1/R^{2}d^{2}. Thus, p=p1+p2++p2t+pdp^{\prime}=p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d} has Var[p(G)]O(1/R2)\operatorname*{Var}[p^{\prime}(G)]\leq O(1/R^{2}). This completes the proof. ∎

The key observation now is that if p(x)T|p(x)|\geq T for x2d/ε\|x\|_{2}\leq\sqrt{d/\varepsilon}, then writing p=q1+q2+q4++qd+pp=q_{1}+q_{2}+q_{4}+\ldots+q_{d}+p^{\prime} as above, it must be the case that qk(x)>(T1)/(2log(d))|q_{k}(x)|>(T-1)/(2\log(d)) for some kk. Therefore, to prove our main result, it suffices to show that, with high probability over the choice of SS, for any T10log(1/ε)log2(d/ε)T\geq 10\log(1/\varepsilon)\log^{2}(d/\varepsilon) and any qCkq\in\mathcal{C}_{k} for some kk, that PrxuS(q(x)>T/(2log(d)))<ε/(2T2log2(T)log(d))\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log(d)). Equivalently, it suffices to show that for T10log(1/ε)log(d/ε)T\geq 10\log(1/\varepsilon)\log(d/\varepsilon) it holds PrxuS(q(x)>T/(2log(d)))<ε/(2T2log2(T)log2(d))\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log^{2}(d)). Note that this holds automatically for T>RT>R, as p(x)p(x) cannot possibly be that large for x2R\|x\|_{2}\leq\sqrt{R}. Furthermore, note that losing a constant factor in the probability, it suffices to show this only for TT a power of 22.

Therefore, it suffices to show for every kdk\leq d, every qCkq\in\mathcal{C}_{k} and every R/kTlog(1/ε)logRR/\sqrt{k}\gg T\gg\log(1/\varepsilon)\log R that with probability at least 1τ2Ω(dklogR)1-{\tau}2^{-\Omega(dk\log R)} over the choice of SS we have that PrxuS(q(x)>T)ε/(T2log4(R))\Pr_{x\in_{u}S}(|q(x)|>T)\ll\varepsilon/(T^{2}\log^{4}(R)). However, by the Hanson-Wright inequality, we have that

Therefore, by Chernoff bounds, the probability that more than a ε/(T2log4R)\varepsilon/(T^{2}\log^{4}R)-fraction of the elements of SS satisfy this property is at most

Appendix C Deferred Proofs from Section 6

Construct an auxiliary graph on kk vertices, where we put an edge between nodes rir_{i} and ri+1r_{i+1}. By the above, there must be a path from jj to jj^{\prime} in this graph. Since this graph has kk nodes, there must be a path of length at most kk from jj to jj^{\prime}; moreover, by the above, we know that this implies that μjμj22O(kdlogk/ε)\|\mu_{j}-\mu_{j^{\prime}}\|_{2}^{2}\leq O(kd\log k/\varepsilon).

Finally, the fourth property follows from the same argument as the proof of the third. \hfill\qed\hfill\qed

Proof of Lemma 6.9: Let C=i=1Nwi(Xiμ)(Xiμ)TIC=\sum_{i=1}^{N}w_{i}(X_{i}-\mu)(X_{i}-\mu)^{T}-I. Let vv be the top eigenvector of

In particular, since vT(i=1Nwi(Xiμ)(Xiμ)T(I+Q))vckh(k,γ,δ)\left|v^{T}\left(\sum_{i=1}^{N}w_{i}(X_{i}-\mu)(X_{i}-\mu)^{T}-(I+Q)\right)v\right|\geq ckh(k,\gamma,\delta), we must have

Let U=[v,u1,,ud1]U=[v,u_{1},\ldots,u_{d-1}] be an d×kd\times k matrix with orthonormal columns, where the columns span the set of vectors {(μjμ) : j[k]}{v}\left\{(\mu_{j}-\mu)\ :\ j\in[k]\right\}\cup\{v\}. We note the rank of this set is at most kk due to the definition of μ\mu.

Using the dual characterization of the Schatten top-kk norm, we have that

Observe that since \mboxspan(Q)\mboxspan(U)\mbox{span}(Q)\subseteq\mbox{span}(U), we have

Proof of Lemma 6.10: By Fact 4.2 and (34) we have i=GwiwgXiμ2k1/2δ2\|\sum_{i=G}\frac{w_{i}}{w_{g}}X_{i}-\mu\|_{2}\leq k^{1/2}\delta_{2}. Now, by the triangle inequality, we have

Using the fact that variance is nonnegative we have

where in the last inequality we have used Fact 4.2 and (33). Hence altogether this implies that

Once more, let Δ=μμ^\Delta=\mu-\hat{\mu} and expand the formula for MM:

Suppose that w=ww=w^{*}. Then MTki[k]γj+ckh(k,γ,δ1)2\|M\|_{T_{k}}\leq\sum_{i\in[k]}\gamma_{j}+\frac{ckh(k,\gamma,\delta_{1})}{2}.

ww^{\ast} are the weights that are uniform on the uncorrupted points. Because E2εNE\leq 2\varepsilon N, we have that wSN,εw^{\ast}\in S_{N,\varepsilon}. Using (33), this implies that wCf(k,γ,δ1)w^{*}\in\mathcal{C}_{f(k,\gamma,\delta_{1})}. By Corollary 6.11, Δ2O(εlog1/ε)\|\Delta\|_{2}\leq O(\varepsilon\sqrt{\log 1/\varepsilon}).

Suppose that w∉Cckh(k,γ,δ)w\not\in\mathcal{C}_{ckh(k,\gamma,\delta)}. Then MTk>i[k]γj+ckh(k,γ,δ1)2\|M\|_{T_{k}}>\sum_{i\in[k]}\gamma_{j}+\frac{ckh(k,\gamma,\delta_{1})}{2}.

We split into two cases. In the first case, Δ22ckh(k,γ,δ)10\|\Delta\|_{2}^{2}\leq\frac{ckh(k,\gamma,\delta)}{10}. By Lemma 6.9, we have that

In the other case, Δ22ckh(k,γ,δ)10\|\Delta\|_{2}^{2}\geq\frac{ckh(k,\gamma,\delta)}{10}. Recall that Q=j[k]αj(μjμ)(μjμ)TQ=\sum_{j\in[k]}\alpha_{j}(\mu_{j}-\mu)(\mu_{j}-\mu)^{T} from (29). Write MM as follows:

Now taking the Schatten top-kk norm of MM, we have

The first inequality is the triangle inequality, the second is by (33) and Fact 4.2, the third is because the summed matrices are positive semidefinite, the fourth follows from Lemma 6.10, and the fifth holds for all cc sufficiently large. ∎

If Δ22ckh(k,γ,δ)10\|\Delta\|_{2}^{2}\leq\frac{ckh(k,\gamma,\delta)}{10}, then

But this is true for cc sufficiently large, as h(k,γ,δ)δ\sqrt{h(k,\gamma,\delta)}\geq\sqrt{\delta}. Therefore,

where the second inequality holds since Λ>j[k]γj+ckh(k,γ,δ)2\Lambda>\sum_{j\in[k]}\gamma_{j}+\frac{ckh(k,\gamma,\delta)}{2}.

On the other hand, consider when Δ22ckh(k,γ,δ)10\|\Delta\|_{2}^{2}\geq\frac{ckh(k,\gamma,\delta)}{10}. By (47), we know that

The first and third terms are immediately dominated by Ω(Δ22ε)\Omega\left(\frac{\|\Delta\|^{2}_{2}}{\varepsilon}\right), it remains to show that

Or equivalently, k1/2δε=O(Δ2).k^{1/2}\delta\varepsilon=O\left(\|\Delta\|_{2}\right). This follows since

Appendix D Deferred Proofs from Section 7

Proof of Lemma 7.21: By Lemma 7.6 applied with ε:=ε3/2/10d\varepsilon^{\prime}:=\varepsilon^{3/2}/10d in place of ε,\varepsilon, since we have Ω(d4log(1/τ)/ε2)\Omega(d^{4}\log(1/\tau)/\varepsilon^{\prime 2}) samples from P,P, with probability at least 1τ,1-\tau, the set SS is such that for all affine functions L,L, it holds PrXuS(L(X)0)PrXP(L(X)0)ε/d.|\Pr_{X\in_{u}S}(L(X)\geq 0)-\Pr_{X\sim P}(L(X)\geq 0)|\leq\varepsilon^{\prime}/d. We henceforth condition on this event.

Let CTC_{T} be the event that all coordinates in TT take their most common value. For a single coordinate i,i, the probability that it does not take its most common value, min{pi,1pi},\min\{p_{i},1-p_{i}\}, satisfies

Thus, by a union bound, we have that PrP(CT)3/5.\Pr_{P}(C_{T})\geq 3/5. Let #T(x)\#_{T}(x) be the number of coordinates of xx in TT which do not have their most common value, and observe that #T(x)\#_{T}(x) is an affine function of x.x. Noting that for x{0,1}dx\in\{0,1\}^{d}, we have that 1#T(x)>01-\#_{T}(x)>0 if and only if CTC_{T} holds for x,x, it follows that PrS(CT)PrP[CT]ε/d.|\Pr_{S}(C_{T})-\Pr_{P}[C_{T}]|\leq\varepsilon^{\prime}/d. Hence, we have that PrS(CT)1/2.\Pr_{S}(C_{T})\geq 1/2.

Note that for x{0,1}dx\in\{0,1\}^{d}, we have that LT(x)>0L_{T}(x)>0 if and only if L(x)>0L(x)>0 and CTC_{T} holds for x.x. Therefore, we can write

This completes the proof of Lemma 7.21. \hfill\qed\hfill\qed

Appendix E Deferred Proofs from Section 8

Proof of Lemma 8.6: Let SPSS_{P}\subseteq S be the set of samples drawn from PP and SQSS_{Q}\subseteq S be the set of samples drawn from QQ. Firstly, we note that by a Chernoff bound, SP/SαO(ε/d2)\left||S_{P}|/|S|-\alpha\right|\leq O(\varepsilon/d^{2}) with probability 1τ/31-\tau/3. Assuming this holds, it follows that SP(α/2)S(ε1/6/2)S=Ω(d4log(1/τ)/ε2)|S_{P}|\geq(\alpha/2)|S|\geq(\varepsilon^{1/6}/2)|S|=\Omega(d^{4}\log(1/\tau)/\varepsilon^{2}). Similarly, SQ(1α)S/2Ω(d4log(1/τ)/ε2)|S_{Q}|\geq(1-\alpha)|S|/2\geq\Omega(d^{4}\log(1/\tau)/\varepsilon^{2}). By Lemma 7.6 applied with ε:=(c2/4)ε\varepsilon^{\prime}:=(c^{2}/4)\cdot\varepsilon in place of ε\varepsilon, since we have Ω((d4+d2log(τ))/ε2)\Omega((d^{4}+d^{2}\log(\tau))/\varepsilon^{\prime 2}) samples, with probability 1τ/31-\tau/3, the set SPS_{P} is ε\varepsilon^{\prime}-good for PP, i.e., it satisfies that for all affine functions LL, PrXuSP(L(X)>0)PrXP(L(X)>0)ε/d.|\Pr_{X\in_{u}S_{P}}(L(X)>0)-\Pr_{X\sim P}(L(X)>0)|\leq\varepsilon^{\prime}/d. We show that assuming SS is ε\varepsilon^{\prime}-good, it is (ε,i)(\varepsilon,i) good for each 1id1\leq i\leq d.

Note that PrXP[Xi=1]=pic\Pr_{X\sim P}[X_{i}=1]=p_{i}\geq c and PrXP[Xi=0]=1pic\Pr_{X\sim P}[X_{i}=0]=1-p_{i}\geq c. Since PrXP[Xi=1]PrXuSP[Xi=1]c2ε/(4d)|\Pr_{X\sim P}[X_{i}=1]-\Pr_{X\in_{u}S_{P}}[X_{i}=1]|\leq c^{2}\varepsilon/(4d), it follows that PrXuS[Xi=1]c/2\Pr_{X\in_{u}S}[X_{i}=1]\geq c/2. For any affine function LL, define L(0)(x):=L(x)xi(maxyL(y))L^{(0)}(x):=L(x)-x_{i}(\max_{y}|L(y)|) and L(1)(x):=L(x)(1xi)(maxyL(y))L^{(1)}(x):=L(x)-(1-x_{i})(\max_{y}|L(y)|). Then, we have the following:

So, we have that SPS_{P} is (ε,i)(\varepsilon,i) good for PP for all 1id1\leq i\leq d with probability 1τ/31-\tau/3 . Similarly, SQS_{Q} is (ε,i)(\varepsilon,i) good for QQ for all 1id1\leq i\leq d with probability 1τ/31-\tau/3. Thus, we have that SP/Sαε/d2||S_{P}|/|S|-\alpha|\leq\varepsilon/d^{2}, SPS_{P} is (ε,i)(\varepsilon,i) good for PP and SQS_{Q} is (ε,i)(\varepsilon,i) good for QQ for all 1id1\leq i\leq d with probability 1τ1-\tau. That is, SS is (ε,i)(\varepsilon,i)-good for Π\Pi for all 1id1\leq i\leq d with probability at least 1τ1-\tau. \hfill\qed\hfill\qed

Proof of Lemma 8.19: Let SPSS_{P}\subseteq S be the set of samples drawn from PP and SQSS_{Q}\subseteq S be the set of samples drawn from QQ. Firstly, we note that by a Chernoff bound, SP/SαO(ε/d2)||S_{P}|/|S|-\alpha|\leq O(\varepsilon/d^{2}) with probability at least 1τ/31-\tau/3. Assuming this holds, SP(α/2)SδS=Ω(d4log(1/τ)/ε2)|S_{P}|\geq(\alpha/2)|S|\geq\delta|S|=\Omega(d^{4}\log(1/\tau)/\varepsilon^{2}). Similarly, SQ(1α)S/2Ω(d4log(1/τ)/ε2)|S_{Q}|\geq(1-\alpha)|S|/2\geq\Omega(d^{4}\log(1/\tau)/\varepsilon^{2}).

By Lemma 7.6 applied with ε:=ε/6\varepsilon^{\prime}:=\varepsilon/6, since we have Ω(d4log(1/τ)/ε2)\Omega(d^{4}\log(1/\tau)/\varepsilon^{\prime 2}) samples, with probability at least 1τ/31-\tau/3, the set SPS_{P} is ε\varepsilon-good for PP. Similarly, with probability at least 1τ/31-\tau/3, the set SQS_{Q} is ε\varepsilon-good for QQ. Thus, with probability 1τ1-\tau, we have that SPSαε\left|\frac{|S_{P}|}{|S|}-\alpha\right|\leq\varepsilon and that SPS_{P} and SQS_{Q} are ε\varepsilon-good for PP and QQ respectively. \hfill\qed\hfill\qed

Noting that the mean of TT is μ\mu and T=S|T|=|S^{\prime}|, we have:

Since PP and QQ are product distributions, \mboxCov(SP)\mbox{Cov}(S^{\prime}_{P}) and \mboxCov(SQ)\mbox{Cov}(S^{\prime}_{Q}) can have large diagonal elements but small off-diagonal ones. On the other hand, we bound the elements on the diagonal of \mboxCov(T)\mbox{Cov}(T), but \mboxCov(T)2\|\mbox{Cov}(T)\|_{2} may still be large due to off-diagonal elements.

By the triangle inequality, and Equation (48) with zeroed diagonal, we have:

We will bound each of these terms separately.

Note that \mboxCov0(T)\mboxCov(T)\mbox{Cov}_{0}(T)-\mbox{Cov}(T) is a diagonal matrix and its non-zero entries are

Note that μ\mu satisfies Sμ=SPμSP+SQμSQ+EμE|S^{\prime}|\mu=|S^{\prime}_{P}|\mu^{S^{\prime}_{P}}+|S^{\prime}_{Q}|\mu^{S^{\prime}_{Q}}+|E|\mu^{E}. Since SE=SP+SQ,|S^{\prime}|-|E|=|S^{\prime}_{P}|+|S^{\prime}_{Q}|, we have (SE)(μμSP)=SQ(μSQμSP)+E(μEμ).(|S^{\prime}|-|E|)(\mu-\mu^{S^{\prime}_{P}})=|S^{\prime}_{Q}|(\mu^{S^{\prime}_{Q}}-\mu^{S^{\prime}_{P}})+|E|(\mu^{E}-\mu). Using that SE=(1+O(ε))S|S^{\prime}|-|E|=(1+O(\varepsilon))|S|, SQ=(1α)SO(ε)|S_{Q}^{\prime}|=(1-\alpha)|S|-O(\varepsilon), EO(ε)S|E|\leq O(\varepsilon)|S|, we have

Since SS is ε\varepsilon-good for Π\Pi, it follows that μSPpε/d\|\mu^{S_{P}}-p\|_{\infty}\leq\varepsilon/d and μSQqε/d.\|\mu^{S_{Q}}-q\|_{\infty}\leq\varepsilon/d. Also,

Finally, pqδ.\|p-q\|_{\infty}\leq\delta. Thus, by the triangle inequality, we get

We have the following sequence of inequalities:

It remains to bound the (SPS)\mboxCov0(SP)2+(SQS)\mboxCov0(SQ)2\left(\frac{|S^{\prime}_{P}|}{|S^{\prime}|}\right)\|\mbox{Cov}_{0}(S^{\prime}_{P})\|_{2}+\left(\frac{|S^{\prime}_{Q}|}{|S^{\prime}|}\right)\|\mbox{Cov}_{0}(S^{\prime}_{Q})\|_{2} terms in (49). To analyze the first of these terms, note that \mboxCov0(P)=0.\mbox{Cov}_{0}(P)=\mathbf{0}. We have that

Note that since SS is good, the (i,j)(i,j)-th entry of \mboxCov(SP)\mboxCov(P)\mbox{Cov}(S_{P})-\mbox{Cov}(P) has absolute value at most ε/d.\varepsilon/d. Thus,

Therefore, combining the above we have that

By the assumption on δ\delta in Theorem 8.17, δ2=Ω(εlog(1/ε))\delta^{2}=\Omega(\varepsilon\log(1/\varepsilon)), and the proof is complete. ∎