Robust Estimators in High Dimensions without the Computational Intractability

Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, Alistair Stewart

Introduction

A central goal of machine learning is to design efficient algorithms for fitting a model to a collection of observations. In recent years, there has been considerable progress on a variety of problems in this domain, including algorithms with provable guarantees for learning mixture models [FOS08, KMV10, MV10, BS10, HK13], phylogenetic trees [CGG02, MR05], HMMs [AHK12], topic models [AGM12, AGHK13], and independent component analysis [AGMS15]. These algorithms crucially rely on the assumption that the observations were actually generated by a model in the family. However, this simplifying assumption is not meant to be exactly true, and it is an important direction to explore what happens when it holds only in an approximate sense. In this work, we study the following family of questions:

This is a natural formalization of the problem of designing robust and efficient algorithms for distribution estimation. We refer to it as (proper) agnostic distribution learning and we refer to the samples as being $\varepsilon$ -corrupted. This family of problems has its roots in many fields, including statistics, machine learning, and theoretical computer science. Within computational learning theory, it is related to the agnostic learning model of Haussler [Hau92] and Kearns, Schapire, and Sellie [KSS94], where the goal is to learn a labeling function whose agreement with some underlying target function is close to the best possible, among all functions in some given class. In the even more challenging malicious noise model [Val85, KL93], an adversary is allowed to corrupt both the labels and the samples. A major difference with our setting is that these models apply to supervised learning problems, while here we will work in an unsupervised setting.

Within statistics and machine learning, inference problems like Question 1.1 are often termed “estimation under model misspecification.” The usual prescription is to use the maximum likelihood estimator [Hub67, Whi82], which is unfortunately hard to compute in general. Even ignoring computational considerations, the maximum likelihood estimator is only guaranteed to converge to the distribution $P^{\prime}$ in $\mathcal{D}$ that is closest (in Kullback-Leibler divergence) to the distribution from which the observations are generated. This is problematic because such a distribution is not necessarily close to $P$ at all.

A branch of statistics – called robust statistics [HR09, HRRS86] – aims to tackle questions like the one above. The usual formalization is in terms of breakdown point, which (informally) is the fraction of observations that an adversary would need to control to be able to completely corrupt an estimator. In low-dimensions, this leads to the prescription that one should use the empirical median instead of the empirical mean to robustly estimate the mean of a distribution, and interquartile range for robust estimates of the variance. In high-dimensions, the Tukey median [Tuk75] is a high-dimensional analogue of the median that, although provably robust, is hard to compute [JP78]. Similar hardness results have been shown [Ber06, HM13] for essentially all known estimators in robust statistics.

Is high-dimensional agnostic distribution learning even possible, algorithmically? The difficulty is that corruptions are often hard to detect in high dimensions, and could bias the natural estimator by dimension-dependent factors. In this work, we study agnostic distribution learning for a number of fundamental classes of distributions: $(1)$ a single Gaussian, $(2)$ a product distribution on the hypercube $\{0,1\}^{d}$ , $(3)$ mixtures of two product distributions (under a natural balancedness condition), and $(4)$ mixtures of $k$ Gaussians with spherical covariances. Prior to our work, all known efficient algorithms (e.g., [LT15, BD15]) for these classes required the error guarantee, $f(\varepsilon,d)$ , to depend polynomially in the dimension $d$ . Hence, previous efficient estimators could only tolerate at most a $1/\operatorname{poly}(d)$ fraction of errors. In this work, we obtain the first efficient algorithms for the aforementioned problems, where $f(\varepsilon,d)$ is completely independent of $d$ and depends polynomially (often, nearly linearly) in the fraction $\varepsilon$ of corrupted samples. Our work is just a first step in this direction, and there are many exciting questions left to explore.

2 Our Techniques

All of our algorithms are based on a common recipe. The first question to address is the following: Even if we were given a candidate hypothesis $P^{\prime}$ , how could we test if it is $\varepsilon$ -close in total variation distance to $P$ ? The usual way to certify closeness is to exhibit a coupling between $P$ and $P^{\prime}$ that marginally samples from both distributions, where the samples produced from each agree with probability $1-\varepsilon$ . However, we have no control over the process by which samples are generated from $P$ , in order to produce such a coupling. And even then, the way that an adversary decides to corrupt samples can introduce complex statistical dependencies.

Unfortunately, in our agnostic setting, we cannot afford for (1) to depend on the dimension $d$ at all. Any such dependence would appear in the error guarantee of our algorithm. Instead, the starting point of our algorithms is a notion of parameter distance that satisfies

Given our notion of parameter distance satisfying (2), our main ingredient is an efficient method for robustly estimating the parameters. We provide two algorithmic approaches which are based on similar principles. Our first approach is faster, requiring only approximate eigenvalue computations. Our second approach relies on convex programming and achieves slightly better sample complexity, in some cases matching the information-theoretic limit. Notably, either approach can be used to give all of our concrete learning applications with nearly identical sample complexity and error guarantees. In what follows, we specialize to the problem of robustly learning the mean $\mu$ of a Gaussian whose covariance is promised to be the identity, which we will use to illustrate how both approaches operate. We emphasize that what is needed to learn the parameters in more general settings requires many additional ideas.

Our second algorithmic approach relies on convex programming. Here, instead of rejecting corrupted samples, we compute appropriate weights $w_{i}$ for the samples $X_{i}$ , such that the weighted empirical average $\widehat{\mu}_{w}=\mathop{\textstyle\sum}_{i=1}^{N}w_{i}X_{i}$ is close to $\mu$ . We work with the convex set:

We prove that any set of weights in $\mathcal{C}_{\delta}$ yields a good estimate $\widehat{\mu}_{w}=\mathop{\textstyle\sum}_{i=1}^{N}w_{i}X_{i}$ in the obvious way. The catch is that the set $\mathcal{C}_{\delta}$ is defined based on $\mu$ , which is unknown. Nevertheless, it turns out that we can use the same type of spectral arguments that underlie the filtering approach to design an approximate separation oracle for $\mathcal{C}_{\delta}$ . Combined with standard results in convex optimization, this yields an algorithm for robustly estimating $\mu$ .

The third and final ingredient is some new concentration bounds. In both of the approaches above, at best we are hoping that we can remove all of the corrupted points and be left with only the uncorrupted ones, and then use standard estimators (e.g., the empirical average) on them. However, an adversary could have removed an $\varepsilon$ -fraction of the samples in a way that biases the empirical average of the remaining uncorrupted samples. What we need are concentration bounds that show for sufficiently large $N$ , for samples $X_{1},X_{2},\ldots,X_{N}$ from a Gaussian with mean $\mu$ and identity covariance, that every set of $(1-\varepsilon)N$ samples produces a good estimate for $\mu$ . In some cases, we can derive such concentration bounds by appealing to known concentration inequalities and taking a union bound. However, in other cases (e.g., concentration bounds for degree-two polynomials of Gaussian random variables) the existing concentration bounds are not strong enough, and we need other arguments to prove that every set of $(1-\varepsilon)N$ samples produces a good estimate.

3 Our Results

We give the first efficient algorithms for agnostically learning several important distribution classes with dimension-independent error guarantees. Our first main result is for a single arbitrary Gaussian with mean $\mu$ and covariance $\Sigma$ , which we denote by $\operatorname{\mathcal{N}}(\mu,\Sigma)$ . In the previous subsection, we described our convex programming approach for learning the mean vector when the covariance is promised to be the identity. A technically more involved version of the technique can handle the case of zero mean and unknown covariance. More specifically, consider the following convex set, where $\Sigma$ is the unknown covariance matrix and $\left\|\cdot\right\|_{F}$ is the Frobenius norm:

We design an approximate separation oracle for this unknown convex set, by analyzing the spectral properties of the fourth moment tensor of a Gaussian. Combining these two intermediate results, we obtain our first main result (below). Throughout this paper, we will abuse notation and write $N\geq\widetilde{\Omega}(f(d,\varepsilon,\tau))$ when referring to our sample complexity, to signify that our algorithm works if $N\geq Cf(d,\varepsilon,\tau)\mbox{polylog}(f(d,\varepsilon,\tau))$ for a large enough universal constant $C$ .

We can alternatively establish Theorem 1.2 via our filtering technique. See Section 5. In the first version of our paper, our analysis required $N\gtrsim d^{3}\log^{2}(1/\tau)/\varepsilon^{2}$ samples. In [DKK+17], we showed that a simple adaptation of our algorithm and analysis achieves the improved sample complexity above, which is information-theoretically optimal up to logarithmic factors. We have incorporated this modification (along with the analysis) into this version of the paper, for the sake of completeness.

For the sake of simplicity in the presentation, we did not make an effort to optimize the sample complexity of our robust estimators in the above setting. We note that methods similar to the analysis of the Gaussian setting can lead to near-optimal sample complexity in this setting as well. We also remark that for the case of balanced binary product distributions, our algorithm achieves an error of $O(\varepsilon\sqrt{\log(1/\varepsilon)}).$

Interestingly enough, the above two distribution classes are trivial to learn in the noiseless case, but in the agnostic setting the learning problem turns out to be surprisingly challenging. Using additional ideas, we are able to generalize our agnostic learning algorithms to mixtures of the above classes under some natural conditions. We note that even in the noiseless case, learning mixtures of the above families is non-trivial. First, we study $2$ -mixtures of $c$ -balanced products, which stipulates that the coordinates of the mean vector of each component are in the range $(c,1-c)$ . We prove:

This generalizes the algorithm of Freund and Mansour [FM99] to the agnostic setting. An interesting open question is to improve the $\varepsilon$ -dependence in the above bound to (nearly) linear, or to remove the assumption of balancedness and obtain an agnostic algorithm for mixtures of two arbitrary product distributions.

Finally, we give an agnostic learning algorithm for mixtures of spherical Gaussians.

In total, these results give new robust and computationally efficient estimators for several well-studied distribution learning problems that can tolerate a constant fraction of errors independent of the dimension. This points to an interesting new direction of making robust statistics algorithmic. The general recipe we have developed here gives us reason to be optimistic about many other problems in this domain.

4 Discussion and Related Work

Our results fit in the framework of density estimation and parameter learning which are both classical problems in statistics with a rich history (see e.g., [BBBB72, DG85, Sil86, Sco92, DL01]). While these problems have been studied for several decades by different communities, the computational complexity of learning is still not well understood, even for some surprisingly simple distribution families. Most textbook estimators are hard to compute in general, especially in high-dimensional settings. In the past few decades, a rich body of work within theoretical computer science has focused on designing computationally efficient distribution learning algorithms. In a seminal work, Kearns, Mansour, Ron, Rubinfeld, Schapire, and Sellie [KMR+94] initiated a systematic investigation of the computational complexity of distribution learning. Since then, efficient learning algorithms have been developed for a wide range of distributions in both low and high-dimensions [Das99, FM99, AK01, VW02, CGG02, MR05, BV08, KMV10, MV10, BS10, DDS12, CDSS13, DDO+13, CDSS14a, CDSS14b, HP15, ADLS17, DDS15b, DDKT16, DKS16b, DKS16a].

We will be particularly interested in efficient learning algorithms for mixtures of high-dimensional Gaussians and mixtures of product distributions, as this is the focus of our algorithmic results in the agnostic setting. In a pioneering work, Dasgupta [Das99] introduced the problem of parameter estimation of a Gaussian mixture to theoretical computer science, and gave the first provably efficient algorithms under the assumption that the components are suitably well-separated. Subsequently, a number of works improved these separation conditions [AK01, VW02, BV08] and ultimately removing them entirely [KMV10, MV10, BS10]. In another line of work, Freund and Mansour [FM99] gave the first polynomial time algorithm for properly learning mixtures of two binary product distributions. This algorithm was substantially generalized to phylogenetic trees [CGG02] and to mixtures of any constant number of discrete product distributions [FOS08]. Given the vast body of work on high-dimensional distribution learning, there is a plethora of problems where one could hope to reconcile robustness and computational efficiency. Thus far, the only setting where robust and efficient algorithms are known is on one-dimensional distribution families, where brute-force search or some form of polynomial regression often works. In contrast, essentially nothing is known about efficient agnostic distribution learning in the high-dimensional setting that we study here.

Question 1.1 also resembles learning in the presence of malicious errors [Val85, KL93]. There, an algorithm is given samples from a distribution along with their labels according to an unknown target function. The adversary is allowed to corrupt an $\varepsilon$ -fraction of both the samples and their labels. A sequence of works studied the problem of learning a homogeneous halfspace with malicious noise in the setting where the underlying distribution is a Gaussian [Ser01, Ser03, KLS09], culminating in the work of Awasthi, Balcan, and Long [ABL17], who gave an efficient algorithm that finds a halfspace with agreement $O(\varepsilon)$ . There is no direct connection between their problem and ours, especially since one is a supervised learning problem and the other is unsupervised. We note however that there is an interesting technical parallel in that the work [KLS09] also uses spectral methods to detect outliers. Both their work and our algorithm for agnostically learning the mean are based on the intuition that an adversary can only substantially bias the empirical mean if the corruptions are correlated along some direction. More specifically, [KLS09] produce a “hard” filter which leads to errors that scale logarithmically with the dimension, even in a weaker corruption model than ours. Our algorithms need to handle many significant conceptual and technical complications that arise when working with higher moments or other distribution families.

Another connection is to the work on robust principal component analysis (PCA). PCA is a transformation that (among other things) is often justified as being able to find the affine transformation $Y=\Sigma^{-1/2}(X-\mu)$ that would place a collection of Gaussian random variables in isotropic position. One can think of our results on agnostically learning a Gaussian as a type of robust PCA that tolerates gross corruptions, where entire samples are corrupted. This is different than other variants of the problem where random sets of coordinates of the points are corrupted [CLMW11], or where the uncorrupted points were assumed to lie in a low-dimensional subspace to begin with [ZL14, LMTZ15]. Finally, Brubaker [Bru09] studied the problem of clustering samples from a well-separated mixture of Gaussians in the presence of adversarial noise. The goal of [Bru09] was to separate the Gaussian components from each other, while the adversarial points are allowed to end up in any of clusters. Our work is orthogonal to [Bru09], since even if such a clustering is given, the problem still remains to estimate the parameters of each component.

5 Concurrent and Subsequent Work

After the initial publication of our results [DKK+16], there has been a flurry of recent work on robust high-dimensional estimation. Diakonikolas, Kane, and Stewart [DKS16c] studied the problem of learning the parameters of a graphical model in the presence of noise, when given its graph theoretic structure. Charikar, Steinhardt, and Valiant [CSV17] developed algorithms that can tolerate a fraction of corruptions greater than a half, under the weaker goal of outputting a small list of candidate hypotheses that contains a parameter set close to the true values. Balakrishnan, Du, Li, and Singh [Li17, DBS17, BDLS17] studied sparse mean and covariance estimation in the presence of noise obtaining computationally efficient robust algorithms with sample complexity sublinear in the dimension. Diakonikolas, Kane, and Stewart [DKS17] proved statistical query lower bounds providing evidence that the error guarantees of our robust mean and covariance estimation algorithms are best possible, within constant factors, for efficient algorithms. In a subsequent paper [DKK+17], we obtained improved bounds on the sample complexity of our algorithms, which are optimal up to polylogarithmic factors. For the sake of completeness, we include these improved sample bounds in the present version of this paper. In the same work [DKK+17], we showed that our algorithmic approach easily extends to obtain dimension-independent robustness guarantees under much weaker distributional assumptions, and gave a practical demonstration of the efficacy of our robust algorithms on both real and synthetic data.

Since the initial submission of the journal version of this paper, there has been a substantial amount of work on robust high-dimensional estimation in a variety of settings. Diakonikolas, Kane, and Stewart [DKS18a] studied PAC learning of geometric concept classes (including low-degree polynomial threshold functions and intersections of halfspaces) in the same corruption model as ours, obtaining the first dimension-independent error guarantees for these classes. Steinhardt, Charikar, and Valiant [SCV18] focused on deterministic conditions of a dataset which allow robust estimation to be possible. In our initial publication, we gave explicit deterministic conditions in various settings; by focusing directly on this goal, [SCV18] somewhat relaxed some of these assumptions. Meister and Valiant [MV17] studied learning in a crowdsourcing model, where the fraction of honest workers may be very small (similar to [CSV17]). Qiao and Valiant [QV18] considered robust estimation of discrete distributions in a setting where we have several sources (a fraction of which are adversarial) who each provide a batch of samples. A number of simultaneous works [KSS18, HL18, DKS18b] investigated robust mean estimation in even more general settings, and apply their techniques to learning mixtures of spherical Gaussians under minimal separation conditions. Finally, several concurrent results study robustness in supervised learning tasks [PSBR18, KKM18, DKK+18], including regression and SVM problems. Despite all of this rapid progress, there are still many interesting theoretical and practical questions left to explore.

6 Organization

The structure of this paper is as follows: In Section 2, we introduce basic notation and a number of useful facts that will be required throughout the paper, as well as the formal definition of our adversary model. In Section 3, we discuss several natural approaches to high-dimensional agnostic learning, all of which lose polynomial factors that depend on the dimension, in terms of their error guarantee.

The main body of the paper is in Sections 4–8. Sections 4 and 6 illustrate our convex programming framework, while Sections 5, 7, and 8 illustrate our filter framework. More specifically, in Sections 4 and 5, we analyze the setting of a single Gaussian with unknown mean and unknown covariance, using our convex programming and filter frameworks, respectively. In Section 6, we generalize the convex programming method to obtain an agnostic algorithm for mixtures of spherical Gaussians with unknown means. In Section 7, we apply our filter techniques to a binary product distribution, and generalize these in Section 8 to obtain an agnostic learning algorithm for a mixture of two balanced binary product distributions.

We note that for some of the more advanced applications of our frameworks, the technical details can get in the way of the fundamental ideas. For the reader who is interested in seeing the details of our most basic application of the convex programming framework, we recommend reading the case a Gaussian with unknown mean, in Section 4.3. Similarly, for the filter framework, we suggest either the Gaussian with unknown mean in Section 5.1 or the balanced product distribution in Section 7.1.

Preliminaries

Throughout this paper, if $v$ is a vector, we will let $\|v\|_{2}$ denote its Euclidean norm. If $M$ is a matrix, we will let $\|M\|_{2}$ denote its spectral norm, and $\|M\|_{F}$ denote its Frobenius norm. We will also let $\preceq$ and $\succeq$ denote the PSD ordering on matrices. For a discrete distribution $P$ , we will denote by $P(x)$ the probability mass at point $x$ . For a continuous distribution, let it denote the probability density function at $x$ . Let $S$ be a multiset over $\{0,1\}^{d}$ . We will write $X\in_{u}S$ to denote that $X$ is drawn from the empirical distribution defined by $S$ . Throughout the paper, we let $\otimes$ denote the Kronecker product of matrices.

As a measure of distance between distributions, we will use the notion of total variation distance:

2 Types of Adversaries

In this paper, we will consider a powerful model for agnostic distribution learning that generalizes many other existing models. The standard setup involves an oblivious adversary who chooses a distribution that is close in total variation distance to an unknown distribution in some class $\mathcal{D}$ .

Given $\varepsilon>0$ , and a class of distributions $\mathcal{D}$ , the full adversary operates as follows: The algorithm specifies some number of samples $m$ . The adversary generates $m$ samples $X_{1},X_{2},\ldots,X_{m}$ from some (unknown) distribution $D\in\mathcal{D}$ . It then draws $m^{\prime}$ from an appropriate distribution. This distribution is allowed to depend on $X_{1},X_{2},\ldots,X_{m}$ , but when marginalized over the $m$ samples satisfies $m^{\prime}\sim\mbox{Bin}(m,\varepsilon)$ . The adversary is allowed to inspect the samples, removes $m^{\prime}$ of them, and replaces them with arbitrary points. The set of $m$ points is given (in any order) to the algorithm.

We rely on the following well-known fact:

Now we can describe how the full adversary can corrupt samples from $D$ to get samples distributed according to $P$ .

The full adversary can simulate any oblivious adversary.

We draw $m$ samples $X_{1},X_{2},\ldots,X_{m}$ from $D$ . We delete each sample $X_{i}$ independently with probability $\varepsilon_{2}$ and replace it with an independent sample from $N_{2}$ . This gives a set of samples $Y_{1},Y_{2},\ldots,Y_{m}$ that are independently sampled from $(1-\varepsilon_{2})D+\varepsilon_{2}N_{2}$ . Since the distributions $(1-\varepsilon_{1})P+\varepsilon_{1}N_{1}$ and $(1-\varepsilon_{2})D+\varepsilon_{2}N_{2}$ are identical, we can couple them to independent samples $Z_{1},Z_{2},\ldots,Z_{m}$ from $(1-\varepsilon_{1})P+\varepsilon_{1}N_{1}$ . Now each sample $Z_{i}$ that came from $N_{1}$ , we can delete and replace with an independent sample from $P$ . The result is a set of samples that are independently sampled from $P$ where we have made $m^{\prime}$ edits and marginally $m^{\prime}\sim\mbox{Bin}(m,\varepsilon_{1}+\varepsilon_{2})$ , although $m^{\prime}$ has and needs to have some dependence on the original samples from $D$ . ∎

The challenge in working with the full adversary is that even the samples that came from $D$ can have biases. The adversary can now choose how to remove uncorrupted points in a careful way so as to compensate for certain other biases that he introduces using the corrupted points.

Throughout this paper, we will make use of the following notation and terminology:

We say a set of samples $X_{1},X_{2},\ldots,X_{m}$ is an $\varepsilon$ -corrupted set of samples generated by the oblivious (resp. full) adversary if it is generated by the process described above in the definition of the oblivious (resp. full) adversary. If it was generated by the full adversary, we let $G\subseteq[m]$ denote the indices of the uncorrupted samples, and we let $E\subseteq[m]$ denote the indices of the corrupted samples.

In this paper, we will give a number of algorithms for agnostic distribution learning that work in the full adversary model. In our analysis, we will identify a set of events that ensure the algorithm succeeds and will bound the probability that any of these events does not occur when $m$ is suitably large. We will often explicitly invoke the assumption that $|E|\leq 2\varepsilon m$ . We can do this even though the number of points that are corrupted is itself a random variable, because by the Chernoff bound, as long as $m\geq O\left(\frac{\log 1/\tau}{\varepsilon}\right)$ , we know that $|E|\leq 2\varepsilon m$ holds with probability at least $1-O(\tau)$ . Thus, making the assumption that $|E|\leq 2\varepsilon m$ costs us an additional additive $O(\tau)$ term in our union bound, when bounding the failure probability of our algorithms.

3 Distributions of Interest

One object of study in this paper is the Gaussian (or Normal) distribution.

A Gaussian distribution $\mathcal{N}(\mu,\Sigma)$ with mean $\mu$ and covariance $\Sigma$ is the distribution with probability density function

We will also be interested in binary product distributions.

A (binary) product distribution is a probability distribution over $\{0,1\}^{d}$ whose coordinate random variables are independent. Note that a binary product distribution is completely determined by its mean vector.

We will also be interested in mixtures of such distributions.

A mixture $P$ of distributions $P_{1},\dots,P_{k}$ with mixing weights $\alpha_{1},\dots,\alpha_{k}$ is the distribution defined by

where $\alpha_{j}\geq 0$ for all $j$ and $\sum_{j\in[k]}\alpha_{j}=1$ .

4 Bounds on TV Distance

The Kullback-Liebler divergence (also known as relative entropy, information gain, or information divergence) is a well-known measure of distance between two distributions.

The primary interest we have in this quantity is the fact that (1) the KL divergence between two Gaussians has a closed form expression, and (2) it can be related (often with little loss) to the total variation distance between the Gaussians. The first statement is expressed in the fact below:

Let $\operatorname{\mathcal{N}}(\mu_{1},\Sigma_{1})$ and $\operatorname{\mathcal{N}}(\mu_{2},\Sigma_{2})$ be two Gaussians such that $\det(\Sigma_{1}),\det(\Sigma_{2})\neq 0$ . Then

The second statement is encapsulated in the well-known Pinsker’s inequality:

With this we can show the following two useful lemmas, which allow us to relate parameter distance between two Gaussians to their total variation distance. The first bounds the total variation distance between two Gaussians with identity covariance in terms of the Euclidean distance between the means:

In the case where $\Sigma_{1}=\Sigma_{2}=I$ , (3) simplifies to

Pinsker’s inequality (Theorem 2.12) then implies that

The second bounds the total variation distance between two mean zero Gaussians in terms of the Frobenius norm of the difference between their covariance matrices:

Let $\delta>0$ be sufficiently small. Let $\Sigma_{1},\Sigma_{2}$ such that $\|I-\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}\|_{F}=\delta$ . Then,

Let $M=\Sigma_{2}^{-1/2}\Sigma_{1}\Sigma_{2}^{-1/2}$ . Then (3) simplifies to

Since both terms in the last line are rotationally invariant, we may assume without loss of generality that $M$ is diagonal. Let $M=\mbox{diag}(1+\lambda_{1},\ldots,1+\lambda_{d})$ . Thus, the KL divergence between the two distributions is given exactly by $\frac{1}{2}\sum_{i=1}^{d}\left(\lambda_{i}-\log(1+\lambda_{i})\right),$ where we are guaranteed that $(\sum_{i=1}^{d}\lambda_{i}^{2})^{1/2}=\delta$ . By the second order Taylor approximation to $\ln(1+x)$ , for $x$ small, we have that for $\delta$ sufficiently small,

Our algorithm for agnostically learning an arbitrary Gaussian will be based on solving two intermediate problems: (1) We are given samples from $\operatorname{\mathcal{N}}(\mu,I)$ and our goal is to learn $\mu$ , and (2) We are given samples from $\operatorname{\mathcal{N}}(0,\Sigma)$ and our goal is to learn $\Sigma$ . The above bounds on total variation distance will allow us to conclude that our estimate is close in total variation distance to the unknown Gaussian distribution in each of the two settings.

We note the following folklore sample complexity bounds for learning a Gaussian in the non-agnostic setting.

$N=\Theta\left(\frac{d+\log(1/\tau)}{\varepsilon^{2}}\right)$ samples are both necessary and sufficient to learn a $d$ -dimensional Gaussian with unknown mean and known covariance to total variation distance $\varepsilon$ with probability $1-\tau$ .

$N=\Theta\left(\frac{d^{2}+\log(1/\tau)}{\varepsilon^{2}}\right)$ samples are both necessary and sufficient to learn a $d$ -dimensional Gaussian with unknown mean and covariance to total variation distance $\varepsilon$ with probability $1-\tau$ .

We will also need the following lemma bounding the total variation distance between two product distributions:

Let $P,Q$ be binary product distributions with mean vectors $p,q\in(0,1)^{d}.$ We have that

The elementary inequality $2\sqrt{ab}=a+b-(\sqrt{a}-\sqrt{b})^{2},$ $a,b>0,$ gives that

where the last inequality follows from the union bound. ∎

5 Additional Concentration Lemmata

In this section, we list a number of standard concentration inequalities for nice random variables which we will frequently use throughout this paper. The proofs of these results are standard and omitted, see e.g., [Ver10] for a more thorough treatment of these results.

The first is a Chernoff bound for bounded random variables.

Let $Z_{1},\ldots,Z_{d}$ be independent random variables with $Z_{i}$ supported on $[a_{i},b_{i}].$ Let $Z=\sum_{i=1}^{d}Z_{i}$ . Then for any $T>0$ ,

We will also require the following tail bounds for Gaussians and quadratic forms of Gaussians:

By standard union bound arguments (see e.g. [Ver10]), we obtain the following concentration results for the empirical mean and covariance of a set of Gaussian vectors:

Let $n$ be a positive integer. Let $D$ be a sub-gaussian distribution with mean and covariance $I$ . Let $Y_{i}\sim D$ be independent, for $i=1,\ldots,n$ . Then, there exist universal constants $A,B>0$ so that for all $t>0$ , we have

With the same setup as in Lemma 2.21, there exist universal constants $A,B>0$ so that for all $t>0$ , we have

6 Agnostic Hypothesis Selection

Several of our algorithms will return a polynomial-sized list of hypotheses at least one of which is guaranteed to be close to the target distribution. Usually (e.g., in a non-agnostic setting), one could use a polynomial number of additional samples to run a tournament to identify the candidate hypothesis that is (roughly) the closest to the target distribution. In the discussion that follows, we will refer to these additional samples as test samples. Such hypothesis selection algorithms have been extensively studied [Yat85, DL96, DL97, DL01, DK14, AJOS14, SOAJ14, DDS15a, DDS15b]. Unfortunately, against a strong adversary we run into a serious technical complication: the training samples and test samples are not necessarily independent. Moreover even if we randomly partition our samples in training and test, a priori there are an unbounded set of possible hypotheses that the training phase could output, and when we analyze the tournament we cannot condition on the list of hypotheses and assume that the test samples are sampled anew. Our approach is to require our original algorithm to return only hypotheses from some finite set of possibilities, and as we will see this mitigates the problem.

As a simple corollary of the agnostic tournament, observe that this allows us to do agnostic learning without knowing the precise error rate $\varepsilon$ . Throughout the paper, we assume the algorithm knows $\varepsilon$ , and guarantees that the output will have error which is at most $O(f(\varepsilon))$ . However, if the algorithm is not given this information, and instead given an $\eta$ and asked to return something with error at most $O(f(\varepsilon+\eta))$ , we may simply grid over $\{\eta,(1+\gamma)\eta,(1+\gamma)^{2}\eta,\ldots,1\}$ (here $\gamma$ is some arbitrary constant that governs a tradeoff between runtime and accuracy), run our algorithm with $\varepsilon$ set to each element in this set, and perform hypothesis selection via Tournament. Then it is not hard to see that we are guaranteed to output something which has error at most $O(f(\varepsilon+(1+\gamma)\eta))$ .

Firstly, we randomly choose a subset of $N$ of our samples and a disjoint subset of $C(\log(|\mathcal{M}|)+\log(1/\tau))/\varepsilon^{2}$ of our samples for some sufficiently large $C$ . Note that with high probability over our randomization, at most a $2\varepsilon$ -fraction of samples from each subset are corrupted. Thus, we may instead consider the stronger adversary who sees a set $S_{1}$ of $N$ independent samples from $\Pi$ and another set, $S_{2}$ , of $C(\log(|\mathcal{M}|)+\log(1/\tau))/\varepsilon^{2}$ samples from $\Pi$ and can arbitrary corrupt a $2\varepsilon$ -fraction of each, giving sets $S^{\prime}_{1}$ , $S^{\prime}_{2}$ .

With probability at least $1-\tau/2$ over $S_{1}$ , the original algorithm run on $S^{\prime}_{1}$ returns a set $\mathcal{L}$ satisfying the desired properties.

We claim that with probability at least $1-\tau/2$ over the choice of $S_{2}$ , we have for any $P,Q\in\mathcal{M}$ :

This follows by Chernoff bounds and a union bound over the $|\mathcal{M}|^{2}$ possibilities for $P$ and $Q$ . Since the total variation distance between the uniform distributions over $S_{2}$ and $S^{\prime}_{2}$ is at most $2\varepsilon$ , we also have for $S^{\prime}_{2}$ that

Therefore, if we throw away any $Q$ in our list for which there is a $P$ in our list such that

Some Natural Approaches, and Why They Fail

Many of the agnostic distribution learning problems that we study are so natural that one would immediately wonder why simpler approaches do not work. Here we detail some other plausible approaches, and what causes them to lose dimension-dependent factors (if they have any guarantees at all!). For the discussion that follows, we note that by Fact 2.13 in order to achieve an estimate that is $O(\varepsilon)$ -close in total variation distance (for a Gaussian when $\mu$ is unknown and $\Sigma=I$ ) it is necessary and sufficient that $\|\hat{\mu}-\mu\|_{2}=O(\varepsilon)$ .

One plausible approach for robust mean estimation in high dimensions is to agnostically learn along each coordinate separately. For instance, if our goal is to agnostically learn the mean of a Gaussian with known covariance $I$ , we could try to learn each coordinate of the mean separately. But since an $\varepsilon$ -fraction of the samples are corrupted, our estimate can be off by $\varepsilon$ in each coordinate and would be off by $\varepsilon\sqrt{d}$ in high dimensions.

Maximum Likelihood

Given a set of samples $X_{1},\ldots,X_{N}$ , and a class of distributions $\mathcal{D}$ , the maximum likelihood estimator (MLE) is the distribution $F\in\mathcal{D}$ that maximizes $\prod_{i=1}^{N}F(X_{i})$ . Equivalently, $F$ minimizes the negative log likelihood (NLL), which is given by

and so the $\mu$ which minimizes $\textrm{NLL}(\operatorname{\mathcal{N}}(\mu,I),X_{1},\ldots,X_{N})$ is the mean of the samples $X_{i}$ , since for any set of vectors $v_{1},\ldots,v_{N}$ , the average of the $v_{i}$ ’s is the minimizer of the function $h(x)=\sum_{i=1}^{N}\|v_{i}-x\|_{2}^{2}$ . Hence, if an adversary places an $\varepsilon$ -fraction of the points at some very large distance, then the estimate for the mean would need to move considerably in that direction. By placing the corruptions further and further away, the MLE can be an arbitrarily bad estimate. That is, even though it is well known [Hub67, Whi82] that the MLE converges to the distribution $F\in\mathcal{D}$ that is closest in KL-divergence to the distribution from which our samples were generated (i.e., after the adversary has added corruptions), $F$ is not necessarily close to the uncorrupted distribution.

Geometric Median

In one dimension, it is well-known that the median provides a provably robust estimate for the mean in a number of settings. The mean of a set of points $a_{1},\ldots,a_{N}$ is the minimizer of the function $f(x)=\sum_{i=1}^{N}(a_{i}-x)^{2}$ , and in contrast the median is the minimizer of the function $f(x)=\sum_{i=1}^{N}|a_{i}-x|$ . In higher dimensions, there are many natural definitions for the median that generalize the one-dimensional case. The Tukey median is one such notion, but as we discussed it is hard to compute [JP78] and the best known algorithms run in time exponential in $d$ . Motivated by this, the geometric median is another high-dimensional notion of a median. It often achieves better robustness than the mean, and can be computed quickly [CLM+16]. The formal definition is:

Unfortunately, this notion of median still incurs an error containing a factor of $O(\sqrt{d})$ :

Given a set $S$ of $N=\Omega\left(\frac{d+\log(1/\tau)}{\varepsilon^{2}}\right)$ samples from $\operatorname{\mathcal{N}}(0,I)$ , then with probability at least $1-\tau$ , there exists a corruption $S^{\prime}$ of $S$ , such that:

Agnostically Learning a Gaussian, via Convex Programming

In this section we give a polynomial time algorithm to agnostically learn a single Gaussian up to error $\widetilde{O}(\varepsilon)$ . Our approach is based on the following ingredients: First, in Section 4.1, we define the set $S_{N,\varepsilon}$ , which will be a key algorithmic object in our framework. In Section 4.2 we give key, new concentration bounds on certain statistics of Gaussians. We will make crucial use of these concentration bounds throughout this section. In Section 4.3 we give an algorithm to agnostically learn a Gaussian with unknown mean and whose covariance is promised to be the identity via convex programming. This will be an important subroutine in our overall algorithm, and it also helps to illustrate our algorithmic approach without many of the additional complications that arise in our later applications. In Section 4.4 we show how to robustly learn a Gaussian with mean zero and unknown covariance again via convex programming. Finally, in Section 4.5 we show how to combine these two intermediate results to get our overall algorithm.

An important algorithmic object for us will be the following set:

For any $\frac{1}{2}>\varepsilon>0$ and any integer $N$ , let

and so we see that this set is designed to capture the notion of selecting a set of $(1-2\varepsilon)N$ samples from $N$ samples.

Given $w\in S_{N,\varepsilon}$ we will use the following notation

to denote the total weight on good and bad points respectively. The following facts are immediate from $|E|\leq 2\varepsilon N$ and the properties of $S_{N,\varepsilon}$ .

If $w\in S_{N,\varepsilon}$ and $|E|\leq 2\varepsilon N$ , then $w_{b}\leq\frac{2\varepsilon}{1-2\varepsilon}$ . Moreover, the renormalized weights $w^{\prime}$ on good points given by $w^{\prime}_{i}=\frac{w_{i}}{w_{g}}$ for all $i\in G$ , and $w^{\prime}_{i}=0$ otherwise, satisfy $w^{\prime}\in S_{N,4\varepsilon}$ .

2 Concentration Inequalities

Throughout this section and in Section 6 we will make use of various concentration bounds on low moments of Gaussian random variables. Some are well-known, and others are new but follow from known bounds and appropriate union bound arguments.

We will also be interested in how well various statistics of Gaussians concentrate around their expectation, when we take the worst-case set of weights in $S_{N,\varepsilon}$ . This is more subtle than standard settings such as Lemma 2.21 or Lemma 2.22 because as we take more samples, any fixed statistic (e.g. taking the uniform distribution over the samples) concentrates better but the size of $S_{N,\varepsilon}$ (e.g. the number of sets of $(1-2\varepsilon)N$ samples) grows too. We defer the proofs to Appendix A. The first concerns the behavior of the empirical covariance:

Fix $\varepsilon\leq 1/2$ and $\tau\leq 1$ . There is a $\delta_{1}=O(\varepsilon\log 1/\varepsilon)$ such that if $Y_{1},\ldots,Y_{N}$ are independent samples from $\operatorname{\mathcal{N}}(0,I)$ and $N=\Omega\left(\frac{d+\log(1/\tau)}{\delta_{1}^{2}}\right)$ , then

A nearly identical argument (Using Hoeffding instead of Bernstein in the proof of Theorem 5.50 in [Ver10]) yields:

Fix $\varepsilon$ and $\tau$ as above. There is a $\delta_{2}=O(\varepsilon\sqrt{\log 1/\varepsilon})$ such that if $Y_{1},\ldots,Y_{N}$ are independent samples from $\operatorname{\mathcal{N}}(0,I)$ and $N=\Omega\left(\frac{d+\log(1/\tau)}{\delta_{2}^{2}}\right)$ , then

Note that by Cauchy-Schwarz, this implies:

Fix $\tau>0$ . Let $X_{1},\ldots,X_{N}\sim\operatorname{\mathcal{N}}(0,I)$ . Then, with probability $1-\tau$ , we have that $\|X_{i}\|_{2}\leq O\left(\sqrt{d\log(N/\tau)}\right)$ for all $i=1,\ldots,N$ .

2.2 Estimation Error in the Frobenius Norm

Let $X_{1},...,X_{N}$ be $N$ i.i.d. samples from $\operatorname{\mathcal{N}}(0,I)$ . In this section we demonstrate a tight bound on how many samples are necessary such that the sample covariance is close to $I$ in Frobenius norm. Let $\widehat{\Sigma}$ denote the empirical covariance, defined to be

By self-duality of the Frobenius norm, we know that

Since there is a $1/4$ -net over all PSD matrices with Frobenius norm $1$ of size $9^{d^{2}}$ (see e.g. Lemma 1.18 in [RH17]), the Vershynin-type union bound argument combined with Lemma 2.20 immediately gives us the following:

There exist universal constants $A,B>0$ so that for all $t>0$ , we have

By the argument as used in the proof of Lemma 4.3, we obtain:

Fix $\varepsilon,\tau>0$ . There is a $\delta_{1}=O(\varepsilon\log 1/\varepsilon)$ such that if $X_{1},\ldots,X_{N}$ are independent samples from $\operatorname{\mathcal{N}}(0,I)$ , with

Since the proof is essentially identical to the proof of Lemma 4.3, we omit the proof. However, we note that in fact, the proof technique there can be used to show something slightly stronger, which we will require later. The technique actually shows that if we take any set of size at most $\varepsilon N$ , and take the uniform weights over that set, then the empirical covariance is not too far away from the truth. More formally:

Fix $\varepsilon,\tau>0$ . There is a $\delta_{2}=O(\varepsilon\log 1/\varepsilon)$ such that if $X_{1},\ldots,X_{N}$ are independent samples from $\operatorname{\mathcal{N}}(0,I)$ , with

2.3 Understanding the Fourth Moment Tensor

Our algorithms will be based on understanding the behavior of the fourth moment tensor of a Gaussian when restricted to various subspaces. Let $\otimes$ denote the Kronecker product on matrices. We will make crucial use of the following definition:

We will also require the following definitions:

and let $\Pi_{S}$ and $\Pi_{\operatorname{{\mathcal{S}}}^{\perp}}$ denote the projection operators onto $\operatorname{{\mathcal{S}}}$ and $\operatorname{{\mathcal{S}}}^{\perp}$ respectively. Finally let

In fact, the projection of $v=M^{\flat}$ onto $\operatorname{{\mathcal{S}}}$ where $M$ is symmetric can be written out explicitly. Namely, it is given by

The key result in this section is the following:

It is important to note that the two terms above are not the same; the first term is high rank, but the second term is rank one. The proof of this theorem will require Isserlis’ theorem, and is deferred to Appendix A.

2.4 Concentration of the Fourth Moment Tensor

We also need to show that the fourth moment tensor concentrates:

Fix $\varepsilon,\tau>0$ . Let $Y_{i}\sim\operatorname{\mathcal{N}}(0,I)$ be independent, for $i=1,\ldots,N$ , where we set

To do so will require somewhat more sophisticated techniques than the ones used so far to bound spectral deviations. At a high level, this is because fourth moments of Gaussians have a sufficiently larger variance that the union bound techniques used so far are insufficient. However, we will show that the tails of degree four polynomials of Gaussians still sufficiently concentrate such that removing points cannot change the mean by too much. The proof requires slightly fancy machinery and appears in Appendix B.

3 Finding the Mean, Using a Separation Oracle

In this section, we consider the problem of approximating $\mu$ given $N$ samples from $\operatorname{\mathcal{N}}(\mu,I)$ in the full adversary model. Our algorithm will be based on working with the following convex set:

It is not hard to show that $\mathcal{C}_{\delta}$ is non-empty for reasonable values of $\delta$ (and we will show this later). Moreover we will show that for any set of weights $w$ in $\mathcal{C}_{\delta}$ , the empirical average

will be a good estimate for $\mu$ . The challenge is that since $\mu$ itself is unknown, there is not an obvious way to design a separation oracle for $\mathcal{C}_{\delta}$ even though it is convex. Our algorithm will run in two basic steps. First, it will run a very naive outlier detection to remove any points which are more than $O(\sqrt{d})$ away from the good points. These points are sufficiently far away that a very basic test can detect them. Then, with the remaining points, it will use the approximate separation oracle given below to approximately optimize with respect to $C_{\delta}$ . It will then take the outputted set of weights and output the empirical mean with these weights. We will explain these steps in detail below.

Our results will hold under the following deterministic conditions:

The concentration bounds we gave earlier were exactly bounds on the failure probability of either of these conditions, albeit for $S_{N,\varepsilon}$ instead of $S_{N,4\varepsilon}$ .

The first step of our algorithm will be to remove points which have distance which is much larger than $O(\sqrt{d})$ from the mean. Our algorithm is very naive: it computes all pairwise distances between points, and throws away all points which have distance more than $O(\sqrt{d})$ from more than a $2\varepsilon$ -fraction of the remaining points.

Suppose that (7) holds. Then NaivePrune removes no uncorrupted points, and moreover, if $X_{i}$ is not removed by NaivePrune, we have $\|X_{i}-\mu\|_{2}\leq O\left(\sqrt{d\log(N/\tau)}\right)$ .

That no uncorrupted point is removed follows directly from (7) and the fact that there can be at most $2\varepsilon N$ corrupted points. Similarly, if $X_{i}$ is not removed by NaivePrune, that means there must be an uncorrupted $X_{j}$ such that $\|X_{i}-X_{j}\|_{2}\leq O(\sqrt{d\log(N/\tau)})$ . Then the desired property follows from (7) and a triangle inequality. ∎

Henceforth, for simplicity we shall assume that no point was removed by NaivePrune, and that for all $i=1,\ldots,N$ , we have $\|X_{i}-\mu\|_{2}<O(\sqrt{d\log(N/\tau)})$ . Otherwise, we can simply work with the pruned set, and it is evident that nothing changes.

3.2 The Separation Oracle

Our main result in this section is an approximate separation oracle for $\mathcal{C}_{\delta}$ . Throughout this section, let $w\in S_{N,\varepsilon}$ and set $\widehat{\mu}=\sum_{i=1}^{N}w_{i}X_{i}$ . Moreover, let $\Delta=\mu-\widehat{\mu}$ . Our first step is to show that any set of weights that does not yield a good estimate for $\mu$ cannot be in the set $\mathcal{C}_{\delta}$ :

Suppose that (8)-(9) holds. Suppose that $\|\Delta\|_{2}=\Omega(\sqrt{\varepsilon\delta_{1}})=\Omega(\varepsilon\log 1/\varepsilon)$ . Then

By Fact 4.2 and (9) we have $\|\sum_{i\in G}\frac{w_{i}}{w_{g}}X_{i}-\mu\|_{2}\leq\delta_{2}$ . Now by the triangle inequality we have

Using the fact that the variance is nonnegative we have

where in the last inequality we have used Fact 4.2 and (8). Hence altogether this implies that

As a corollary, we find that any set of weights in $\mathcal{C}_{\delta}$ immediately yields a good estimate for $\mu$ :

Suppose that (8) and (9) hold. Let $w\in\mathcal{C}_{\delta}$ for $\delta=O(\varepsilon\log 1/\varepsilon)$ . Then

Our main result in this section is an approximate separation oracle for $\mathcal{C}_{\delta}$ with $\delta=O(\varepsilon\log 1/\varepsilon)$ .

Fix $\varepsilon>0$ , and let $\delta=O(\varepsilon\log 1/\varepsilon).$ Suppose that (8) and (9) hold. Let $w^{\ast}$ denote the weights which are uniform on the uncorrupted points. Then there is a constant $c$ and an algorithm such that:

(Completeness) If $w=w^{\ast}$ , then it outputs “YES”.

We remark that these two facts imply that for any $\tau>0$ , the ellipsoid method with this separation oracle will output a $w^{\prime}$ such that $\|w-w^{\prime}\|_{\infty}<\varepsilon/(N\sqrt{d\log(N/\tau)})$ , for some $w\in\mathcal{C}_{c\delta}$ in $\operatorname{poly}(d,1/\varepsilon,\log 1/\tau)$ steps.

The conditions that the separation oracle given here satisfy are slightly weaker than the traditional guarantees, given, for instance, in [GLS88]. However, the correctness of the ellipsoid algorithm with this separation oracle follows because outside $C_{c\delta}$ , the separation oracle acts exactly as a separation oracle for $w^{*}$ . Thus, as long as the algorithm continues to query points outside of $C_{c\delta}$ , the action of the algorithm is equivalent to one with a separation oracle for $w^{*}$ . Moreover, the behavior of the algorithm is such that it will never exclude $w^{*}$ , even if queries are made within $C_{c\delta}$ . From these two conditions, it is clear from the classical theory presented in [GLS88] that the ellipsoid method satisfies the guarantees given above.

The separation oracle is given in Algorithm 2. Next, we prove correctness for our approximate separation oracle:

Again, let $\Delta=\mu-\widehat{\mu}$ , and let $M=\sum_{i=1}^{N}w_{i}Y_{i}Y_{i}^{T}-I$ . By expanding out the formula for $M$ , we get:

Suppose $w=w^{\ast}$ . Then $\|M\|_{2}<\frac{c}{2}\delta$ .

Recall that $w^{\ast}$ are the weights that are uniform on the uncorrupted points. Because $|E|\leq 2\varepsilon N$ we have that $w^{\ast}\in S_{N,\varepsilon}$ . We can now use (8) to conclude that $w^{\ast}\in\mathcal{C}_{\delta_{1}}$ . Now by Corollary 4.16 we have that $\|\Delta\|_{2}\leq O(\varepsilon\sqrt{\log 1/\varepsilon})$ . Thus

Suppose that $w\not\in C_{c\delta}$ . Then $|\lambda|>\frac{c}{2}\delta$ .

Let us now split into two cases. If $\|\Delta\|_{2}\leq\sqrt{c\delta/10}$ , then the first term above is at least $c\delta$ by definition and we can conclude that $|\lambda|>c\delta/2$ . On the other hand, if $\|\Delta\|_{2}\geq\sqrt{c\delta/10}$ , by Lemma 4.15, we have that

which for sufficiently small $\varepsilon$ also yields $|\lambda|>c\delta/2$ . ∎

Suppose $\|\Delta\|_{2}\leq\sqrt{c\delta/10}$ . By (11) we immediately have:

since $\lambda>c\delta/2$ . On the other hand, if $\|\Delta\|_{2}\geq\sqrt{c\delta/10}$ then by (10) we have $\lambda=\Omega\left(\frac{\|\Delta\|_{2}^{2}}{\varepsilon}\right)$ . Putting it all together we have:

where in the last line we used the fact that $\lambda>\Omega\left(\frac{\|\Delta\|_{2}^{2}}{\varepsilon}\right)$ , and $\|\Delta\|_{2}^{2}\geq\Omega(\varepsilon^{2}\log 1/\varepsilon)$ . This now completes the proof. ∎

3.3 The Full Algorithm

This separation oracle, along with the classical theory of convex optimization [GLS88], implies that we have shown the following:

Fix $\varepsilon,\tau>0$ , and let $\delta=O(\varepsilon\sqrt{\log 1/\varepsilon})$ . Let $X_{1},\ldots,X_{N}$ be an $\varepsilon$ -corrupted set of points satisfying (8)-(9), for $\delta_{1}\leq\delta$ and $\delta_{2}\leq\delta\sqrt{\log 1/\varepsilon}$ . Let $c$ be a sufficiently large constant. Then, there is an algorithm $\textsc{LearnApproxMean}(\varepsilon,\tau,X_{1},\ldots,X_{N})$ which runs in time $\operatorname{poly}(N,d,1/\varepsilon,\log 1/\tau)$ , and outputs a set of weights $w^{\prime}\in S_{N,\varepsilon}$ such that there is a $w\in C_{c\delta}$ such that $\|w-w^{\prime}\|_{\infty}\leq\varepsilon/(N\sqrt{d\log(N/\tau)})$ .

This algorithm, while an extremely powerful primitive, is technically not sufficient. However, given this, the full algorithm is not too difficult to state: simply run NaivePrune, then optimize over $C_{c\delta}$ using this separation oracle, and get some $w$ which is approximately in $C_{c\delta}$ . Then, output $\sum_{i=1}^{N}w_{i}X_{i}$ . For completeness, the pseudocode for the algorithm is given below. In the pseudocode, we assume that $\textsc{Ellipsoid}(\textsc{SeparationOracleUnknownMean},\varepsilon^{\prime})$ is a convex optimization routine, which given the SeparationOracleUnknownMean separation oracle and a target error $\varepsilon^{\prime}$ , outputs a $w^{\prime}$ such that $\|w-w^{\prime}\|_{\infty}\leq\varepsilon^{\prime}$ . From the classical theory of optimization, we know such a routine exists and runs in polynomial time.

Fix $\varepsilon,\tau>0$ , and let $\delta=O(\varepsilon\sqrt{\log 1/\varepsilon})$ . Let $X_{1},\ldots,X_{N}$ be an $\varepsilon$ -corrupted set of samples, where

Let $\widehat{\mu}$ be the output of $\textsc{LearnMean}(\varepsilon,\tau,X_{1},\ldots,X_{N})$ . Then with probability $1-\tau$ , we have $\|\widehat{\mu}-\mu\|_{2}\leq\delta$ .

By Fact 4.6, Lemma 4.3, and Lemma 4.4, we know that (7)-(9) hold with probability $1-\tau$ , with $\delta_{1},\delta_{2}\leq\delta$ . Condition on the event that this event holds. After NaivePrune, by Fact 4.14 we may assume that no uncorrupted points are removed, and all points satisfy $\|X_{i}-\mu\|_{2}\leq O(\sqrt{d\log(N/\tau)})$ . Let $w^{\prime}$ be the output of the algorithm, and let $w\in C_{c\delta}$ be such that $\|w-w^{\prime}\|_{\infty}<\varepsilon/(N\sqrt{d\log(N/\tau)})$ . By Corollary 4.16, we know that $\|\sum_{i=1}^{N}w_{i}X_{i}-\mu\|_{2}\leq O(\delta)$ . Hence, we have

so the entire error is at most $O(\delta)$ , as claimed. ∎

4 Finding the Covariance, Using a Separation Oracle

In this section, we consider the problem of approximating $\Sigma$ given $N$ samples from $\operatorname{\mathcal{N}}(0,\Sigma)$ in the full adversary model. Let $U_{i}=\Sigma^{-1/2}X_{i}$ such that if $X_{i}\sim\operatorname{\mathcal{N}}(0,\Sigma)$ then $U_{i}\sim\operatorname{\mathcal{N}}(0,I)$ . Moreover let $Z_{i}=U_{i}^{\otimes 2}$ . Our approach will parallel the one given earlier in Section 4.3. Again, we will work with a convex set

and our goal is to design an approximate separation oracle. Our results in this section will rely on the following deterministic conditions:

for all $w\in S_{N,\varepsilon}$ , and all sets $T\subseteq G$ of size $|T|\leq 2\varepsilon N$ . As before, by Fact 4.2, the renormalized weights over the uncorrupted points are in $S_{N,4\varepsilon}$ . Hence, we can appeal to Fact 4.6, Corollary 4.8, Corollary 4.9, and Theorem 4.13 with $S_{N,4\varepsilon}$ instead of $S_{N,\varepsilon}$ to bound the probability that this event does not hold. Let $w^{\ast}$ be the set of weights which are uniform over the uncorrupted points; by (13) for $\delta\geq\Omega(\varepsilon\sqrt{\log 1/\varepsilon})$ we have that $w^{\ast}\in C_{\delta}$ .

Let $\delta=O(\varepsilon\log 1/\varepsilon)$ . Suppose that (13), (14), and 15 hold for $\delta_{1},\delta_{2}\leq O(\delta)$ and $\delta_{3}\leq O(\delta\log 1/\varepsilon)$ . Then, there is a constant $c$ and an algorithm such that, given any input $w\in S_{N,\varepsilon}$ we have:

(Completeness) If $w=w^{\ast}$ , the algorithm outputs “YES”.

As in the case of learning an unknown mean, by the classical theory of convex optimization this implies that we will find a point $w$ such that $\|w-w^{\prime}\|_{\infty}\leq\frac{\varepsilon}{\operatorname{poly}(N)}$ for some $w^{\prime}\in C_{c\delta}$ , using polynomially many calls to this oracle. We make this more precise in the following subsubsection.

The pseudocode for the (approximate) separation oracle is given in Algorithm 4. Observe briefly that this algorithm does indeed run in polynomial time. Lines 2-7 require only taking top eigenvalues and eigenvectors, and so can be done in polynomial time. For any $\xi\in\{-1,+1\}$ , line 8 can be run by sorting the samples by $w_{i}\left(\frac{\|Y_{i}\|^{2}}{\sqrt{d}}-\sqrt{d}\right)$ and seeing if there is a subset of the top $2\varepsilon N$ samples satisfying the desired condition, and line 9 can be executed similarly.

We now turn our attention to proving the correctness of this separation oracle. We require the following technical lemmata.

Fix $\delta<1$ and suppose that $M$ is symmetric. If $\|M-I\|_{F}\geq\delta$ then $\|M^{-1}-I\|_{F}\geq\frac{\delta}{2}$ .

We will prove this lemma in the contrapositive, by showing that if $\|M^{-1}-I\|_{F}<\frac{\delta}{2}$ then $\|M-I\|_{F}<\delta$ . Since the Frobenius norm is rotationally invariant, we may assume that $M^{-1}=\mbox{diag}(1+\nu_{1},\ldots,1+\nu_{d})$ , where by assumption $\sum\nu_{i}^{2}<\delta^{2}/4$ . By our assumption that $\delta<1$ , we have $|\nu_{i}|\leq 1/2$ for all $i$ . Thus

where we have used the inequality $|1-\frac{1}{1+x}|\leq|2x|$ which holds for all $|x|\leq 1/2$ . This completes the proof. ∎

Let $N_{1},\ldots,N_{d}$ be the columns of $N$ . Then

so the desired result follows by taking square roots of both sides. ∎

By the definition of $\|\cdot\|_{\operatorname{{\mathcal{S}}}}$ , we have

By self duality of the Frobenius norm, we know that

since $I^{\flat}\in\operatorname{{\mathcal{S}}}^{\perp}$ . The result now follows. ∎

Let us first prove completeness. Observe that by Theorem 4.12, we know that restricted to $\operatorname{{\mathcal{S}}}$ , we have that $M_{4}=2I$ . Therefore, by (15) we will not output a hyperplane in line 7. Moreover, by (14), we will not output a hyperplane in line 8. This proves completeness.

Thus it suffices to show soundness. Suppose that $w\not\in\mathcal{C}_{c\delta}$ . We will make use of the following elementary fact:

Let $A=\Sigma^{-1/2}\widehat{\Sigma}\Sigma^{-1/2}$ and $B=\widehat{\Sigma}^{-1/2}\Sigma\widehat{\Sigma}^{-1/2}$ . Then

In particular $A^{-1}=\Sigma^{1/2}\widehat{\Sigma}^{-1}\Sigma^{1/2}$ . Using this expression and the fact that all the matrices involved are symmetric, we can write

where in the third line we have used the fact that the trace of a product of matrices is preserved under cyclic shifts. ∎

Assume (13) holds with $\delta_{1}\leq O(\delta)$ and assume furthermore that $\|A-I\|_{F}\geq c\delta$ . Then, if we let $\delta^{\prime}=\frac{(1-\varepsilon)c}{2}\delta=\Theta(\delta)$ , we have

Let $A,B$ be as in Fact 4.28. Combining Lemma 4.25 and Fact 4.28 we have

We can rewrite (13) as the expression $\sum_{i\in G}w_{i}X_{i}X_{i}^{T}=w_{g}\Sigma^{1/2}(I+R)\Sigma^{1/2}$ where $R$ is symmetric and satisfies $\|R\|_{F}\leq\delta_{1}$ . By the definition of $\widehat{\Sigma}$ we have that $\sum_{i=1}^{N}w_{i}Y_{i}Y_{i}^{T}=I$ , and so

by applying Lemma 4.26. And putting it all together we have

It is easily verified that for $c>10$ , we have that for all $\delta$ , if $\|\widehat{\Sigma}^{-1/2}\Sigma\widehat{\Sigma}^{-1/2}-I\|_{F}\geq c\delta$ , then

where $\delta^{\prime}=\frac{c(1-\varepsilon)}{2}\delta=\Theta(\delta)$ . The desired result then follows from the Pythagorean theorem. ∎

Claim 4.29 tells us that if $w\not\in C_{c\delta},$ we know that one of the terms in (17) must be at least $\frac{1}{2}\delta^{\prime}$ . We first show that if the first term is large, then the algorithm outputs a separating hyperplane:

Assume that (13)-(15) hold with $\delta_{1},\delta_{2}\leq O(\delta)$ and $\delta_{3}\leq O(\delta\log 1/\varepsilon)$ . Moreover, suppose that

Then the algorithm outputs a hyperplane in line 7, and moreover, it is a separating hyperplane.

Let us first show that given these conditions, then the algorithm indeed outputs a hyperplane in line 7. Since $I^{\flat}\in S^{\perp}$ , the first term is just equal to $\left\|\sum_{i\in E}w_{i}Z_{i}\right\|_{S}$ . But this implies that there is some $M^{\flat}\in S$ such that $\|M^{\flat}\|_{2}=\|M\|_{F}=1$ and such that

The $w_{i}/w_{b}$ are a set of weights satisfying the conditions of Claim 4.24 and so this implies that

Let $\widetilde{\Sigma}=\widehat{\Sigma}^{-1}\Sigma$ . By Theorem 4.12 and (15), we have that

since it is easily verified that $\delta\|\widetilde{\Sigma}\|^{2}\leq O(\|\widetilde{\Sigma}-I\|_{F})$ as long as $\|\widetilde{\Sigma}-I\|_{F}\geq\Omega(\delta)$ , which it is by (17).

Equations 18 and 19 then together imply that

Now let us assume that the first term on the LHS is less than $\frac{1}{2}\delta^{\prime}$ , such that the algorithm does not necessarily output a hyperplane in line 7. Thus, the second term on the LHS of Equation 16 is at least $\frac{1}{2}\delta^{\prime}$ . We now show that this implies that this implies that the algorithm will output a separating hyperplane in line 9.

Assume that (13)-(15) hold. Moreover, suppose that

Then the algorithm outputs a hyperplane in line 9, and moreover, it is a separating hyperplane.

By the definition of $\operatorname{{\mathcal{S}}}^{\perp}$ , the assumption implies that

which is equivalent to the condition that

for some $\xi\in\{-1,1\}$ . In particular, the algorithm will output a hyperplane

in Step 9, where $S$ is some set of size at most $\varepsilon N$ , and $\lambda=O(\delta^{\prime})$ . Since it will not affect anything, for without loss of generality let us assume that $\xi=1$ . The other case is symmetrical.

where $\|A\|_{F}=O\left(\delta\frac{N}{|T|}\right)$ . Hence,

as long as $\delta^{\prime}\geq O(\delta)$ . By self-duality of the Frobenius norm, using the test matrix $\frac{1}{\sqrt{d}}I$ , this implies that

These two claims in conjunction directly imply the correctness of the theorem. ∎

As before, this separation oracle and the classical theory of convex optimization [GLS88] shows that we have demonstrated an algorithm FindApproxCovariance with the following properties:

Fix $\varepsilon,\tau>0$ , and let $\delta=O(\varepsilon\log 1/\varepsilon)$ . Let $c>0$ be a universal constant which is sufficiently large. Let $X_{1},\ldots,X_{N}$ be an $\varepsilon$ -corrupted set of points satisfying (13-(15), for $\delta_{1},\delta_{2}\leq O(\delta)$ and $\delta_{3}\leq O(\delta\log 1/\varepsilon)$ . Then $\textsc{FindApproxCovariance}(\varepsilon,\tau,X_{1},\ldots,X_{N})$ runs in time $\operatorname{poly}(N,d,1/\varepsilon,\log 1/\tau)$ , and outputs a $u$ such that there is some $w\in C_{c\delta}$ such that $\|w-u\|_{\infty}\leq\varepsilon/(Nd\log(N/\tau))$ .

As before, this is not quite sufficient to actually recover the covariance robustly. Naively, we would just like to output $\sum_{i=1}^{N}u_{i}X_{i}X_{i}^{T}$ . However, this can run into issues if there are points $X_{i}$ such that $\|\Sigma^{-1/2}X_{i}\|_{2}$ is extremely large. We show here that we can postprocess the $u$ such that we can weed out these points. First, observe that we have the following lemma:

Assume $X_{1},\ldots,X_{N}$ satisfy (13). Let $w\in S_{N,\varepsilon}$ . Then

This follows since by (13), we have that $\sum_{i\in G}w_{i}X_{i}X_{i}^{T}\succeq w_{g}(1-\delta_{1})\Sigma\succeq(1-O(\delta_{1}))\Sigma$ . The lemma then follows since $\sum_{i\in E}w_{i}X_{i}X_{i}^{T}\succeq 0$ always. ∎

Let $u$ be such that there is $w\in C_{c\delta}$ such that $\|u-w\|_{\infty}\leq\varepsilon/(Nd\log(N/\tau))$ . Then, $\sum_{i=1}^{N}u^{-}_{i}X_{i}X_{i}^{T}\preceq(1+O(\delta))\Sigma$ .

By the definition of $C_{c\delta}$ , we must have that $\sum_{i=1}^{N}w_{i}X_{i}X_{i}^{T}\preceq(1+c\delta)\Sigma$ . Moreover, we must have $\widetilde{u}^{-}_{i}\leq w_{i}$ for every index $i\in[N]$ . Thus we have that $\sum_{i=1}^{N}\widetilde{u}^{-}_{i}w_{i}X_{i}X_{i}^{T}\preceq(1+c\delta)\Sigma$ , and hence $\sum_{i=1}^{N}u^{-}_{i}w_{i}X_{i}X_{i}^{T}\preceq(1+c\delta)\Sigma$ , since $\sum_{i=1}^{N}u^{-}_{i}w_{i}X_{i}X_{i}^{T}\preceq(1+O(\varepsilon))\sum_{i=1}^{N}\widetilde{u}^{-}_{i}w_{i}X_{i}X_{i}^{T}$ . ∎

We now give the full algorithm. The algorithm proceeds as follows: first run FindApproxCovariance to get some set of weights $u$ which is close to some element of $C_{c\delta}$ . We then compute the empirical covariance $\Sigma_{1}=\sum_{i=1}^{N}u_{i}X_{i}X_{i}^{T}$ with the weights $u$ , and remove any points which have $\|\Sigma_{1}^{-1/2}X_{i}\|_{2}^{2}$ which are too large. We shall show that this removes no good points, and removes all corrupted points which have $\|\Sigma^{-1/2}X_{i}\|_{2}^{2}$ which are absurdly large. We then rerun FindApproxCovariance with this pruned set of points, and output the empirical covariance with the output of this second run. Formally, we give the pseudocode for the algorithm in Algorithm 5.

We now show that this algorithm is correct.

Let $1/2\geq\varepsilon>0$ , and $\tau>0$ . Let $\delta=O(\varepsilon\log 1/\varepsilon)$ . Let $X_{1},\ldots,X_{N}$ be a $\varepsilon$ -corrupted set of samples from $\operatorname{\mathcal{N}}(0,\Sigma)$ where

Let $\widehat{\Sigma}$ be the output of $\textsc{LearnCovariance}(\varepsilon,\tau,X_{1},\ldots,X_{N})$ . Then with probability $1-\tau$ , $\|\Sigma^{-1/2}\widehat{\Sigma}\Sigma^{-1/2}-I\|_{F}\leq O(\delta)$ .

We first condition on the event that we satisfy (12)-(15) with $\delta_{1},\delta_{2}\leq O(\delta)$ and $\delta_{3}\leq O(\delta\log 1/\varepsilon)$ . By our choice of $N$ , Fact 4.6, Corollary 4.7, Corollary 4.9, and Theorem 4.13, and a union bound, we know that this event happens with probability $1-\tau$ .

By Theorem 4.32, Lemma 4.33, and Lemma 4.34, we have that since $\varepsilon$ is sufficiently small,

In particular, this implies that for every vector $X_{i}$ , we have

Therefore, by (12), we know that in line 6, we never throw out any uncorrupted points, and moreover, if $X_{i}$ is corrupted with $\|\Sigma^{-1/2}X_{i}\|_{2}^{2}\geq\Omega(d\log N/\tau)$ , then it is thrown out. Thus, let $S^{\prime}$ be the set of pruned points. Because no uncorrupted point is thrown out, we have that $|S^{\prime}|\geq(1-2\varepsilon)N$ , and moreover, this set of points still satisfies (13)-(15)Technically, the samples satisfy a slightly different set of conditions since we may have thrown out some corrupted points, and so in particular the number of samples may have changed, but the meaning should be clear. and moreover, for ever $i\in S^{\prime}$ , we have $\|\Sigma^{-1/2}X_{i}\|_{2}^{2}\leq O(d\log N/\tau)$ . Therefore, by Theorem 4.32, we have that there is some $u^{\prime\prime}\in C_{c|I|}$ such that $\|u^{\prime}-u^{\prime\prime}\|_{\infty}<\varepsilon/(Nd\log(N/\tau))$ . But now if $\widehat{\Sigma}=\sum_{i\in|I|}u^{\prime}_{i}X_{i}X_{i}^{T}$ , we have

5 Learning an Arbitrary Gaussian Agnostically

We have shown how to agnostically learn the mean of a Gaussian with known covariance, and we have shown how to agnostically learn the covariance of a mean zero Gaussian. In this section, we show how to use these two in conjunction to agnostically learn an arbitrary Gaussian. Throughout, let $X_{1},\ldots,X_{N}$ be an $\varepsilon$ -corrupted set of samples from $\operatorname{\mathcal{N}}(\mu,\Sigma)$ , where both $\mu$ and $\Sigma$ are unknown. We will set

We first show a simple trick which, at the price of doubling the amount of error, allows us to assume that the mean is zero, without changing the covariance. We do so as follows: for each $i=1,\ldots,N/2$ , let $X_{i}^{\prime}=(X_{i}-X_{N/2+i})/\sqrt{2}$ . Observe that if both $X_{i}$ and $X_{N/2+i}$ are uncorrupted, then $X_{i}^{\prime}\sim\operatorname{\mathcal{N}}(0,\Sigma)$ . Moreover, observe that $X_{i}^{\prime}$ is corrupted only if either $X_{i}$ or $X_{N/2+i}$ is corrupted. Then we see that if $X_{1},\ldots,X_{N}$ is $\varepsilon$ -corrupted, then the $X^{\prime}_{1},\ldots,X^{\prime}_{N/2}$ is a $N/2$ -sized set of samples which is $2\varepsilon$ -corrupted. Thus, by using the results from Section 4.4, with probability $1-\tau$ , we can recover a $\widehat{\Sigma}$ such that

which in particular by Corollary 2.14, implies that

5.2 From Unknown Mean, Approximate Covariance, to Approximate Recovery

For each $X_{i}$ , let $X^{\prime\prime}_{i}=\widehat{\Sigma}^{-1/2}X_{i}$ . Then, for $X_{i}$ which is not corrupted, we have that $X_{i}^{\prime\prime}\sim\operatorname{\mathcal{N}}(\widehat{\Sigma}^{-1/2}\mu,\Sigma_{1})$ , where $\Sigma_{1}=\widehat{\Sigma}^{-1/2}\Sigma\widehat{\Sigma}^{-1/2}$ . By Corollary 2.14 and Lemma 4.25, if (20) holds, then we have

By Claim 2.5, this means that if (20) holds, the uncorrupted set of $X_{i}^{\prime\prime}$ can be treated as an $O(\varepsilon\log 1/\varepsilon)$ -corrupted set of samples from $\operatorname{\mathcal{N}}(\widehat{\Sigma}^{-1/2}\mu,I)$ . Thus, if (20) holds, the entire set of samples $X^{\prime\prime}_{1},\ldots,X^{\prime\prime}_{m}$ is a $O(\varepsilon\log 1/\varepsilon)$ -corrupted set of samples from $\operatorname{\mathcal{N}}(\widehat{\Sigma}^{-1/2}\mu,I)$ . Then, by using results from Section 4.3, with probability $1-\tau$ , assuming that 20 holds, we can recover a $\widehat{\mu}$ such that $\|\widehat{\mu}-\widehat{\Sigma}^{-1/2}\mu\|_{2}\leq O(\varepsilon\log^{3/2}(1/\varepsilon))$ . Thus, by Corollary 2.13, this implies that

which in conjunction with (21), implies that

and thus by following this procedure, whose formal pseudocode is given in Algorithm 6, we have shown the following:

Fix $\varepsilon,\tau>0$ . Let $X_{1},\ldots,X_{N}$ be an $\varepsilon$ -corrupted set of samples from $\operatorname{\mathcal{N}}(\mu,\Sigma)$ , where $\mu,\Sigma$ are both unknown, and

There is a polynomial-time algorithm $\textsc{RecoverRobustGaussian}(\varepsilon,\tau,X_{1},\ldots,X_{N})$ which with probability $1-\tau$ , outputs a $\widehat{\Sigma},\widehat{\mu}$ such that

Agnostically Learning a Gaussian, via Filters

In this section, we use our filter technique to give an agnostic learning algorithm for an unknown mean Gaussian with known covariance matrix. More specifically, we prove:

Notation. We will denote $\mu^{S}=\frac{1}{|S|}\sum_{X\in S}X$ and $M_{S}=\frac{1}{|S|}\sum_{X\in S}(X-\mu^{G})(X-\mu^{G})^{T}$ for the sample mean and modified sample covariance matrix of the set $S$ .

We start by defining our notion of good sample, i.e, a set of conditions on the uncorrupted set of samples under which our algorithm will succeed.

For all $x\in S$ we have $\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log(|S|/\tau)})$ .

We have that $\|\mu^{S}-\mu^{G}\|_{2}\leq\varepsilon.$

We have that $\left\|M_{S}-I\right\|_{2}\leq{\varepsilon}.$

We show in Appendix B that a sufficiently large set of independent samples from $G$ is $(\varepsilon,\tau)$ -good (with respect to $G$ ) with high probability. Specifically, we prove:

Let $G$ be a Gaussian distribution with identity covariance, and $\varepsilon,\tau>0.$ If the multiset $S$ is obtained by taking $\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau))$ independent samples from $G,$ it is $(\varepsilon,\tau)$ -good with respect to $G$ with probability at least $1-\tau.$

We require the following definition that quantifies the extent to which a multiset has been corrupted:

Given finite multisets $S$ and $S^{\prime}$ we let $\Delta(S,S^{\prime})$ be the size of the symmetric difference of $S$ and $S^{\prime}$ divided by the cardinality of $S.$

As in the convex program case, we will first use NaivePrune to remove points which are far from the mean. Then, we iterate the algorithm whose performance guarantee is given by the following:

A mean vector $\widehat{\mu}$ such that $\|\widehat{\mu}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).$

We start by showing how Theorem 5.1 follows easily from Proposition 5.5.

By the definition of $\Delta(S,S^{\prime}),$ since $S^{\prime}$ has been obtained from $S$ by corrupting an $\varepsilon$ -fraction of the points in $S,$ we have that $\Delta(S,S^{\prime})\leq 2\varepsilon.$ By Lemma 5.3, the set $S$ of uncorrupted samples is $(\varepsilon,\tau)$ -good with respect to $G$ with probability at least $1-\tau.$ We henceforth condition on this event.

Since $S$ is $(\varepsilon,\tau)$ -good, all $x\in S$ have $\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log|S|/\tau})$ . Thus, the NaivePrune procedure does not remove from $S^{\prime}$ any member of $S$ . Hence, its output, $S^{\prime\prime}$ , has $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})$ and for any $x\in S^{\prime\prime}$ , there is a $y\in S$ with $\|x-y\|_{2}\leq O(\sqrt{d\log|S|/\tau})$ . By the triangle inequality, for any $x,z\in S^{\prime\prime}$ , $\|x-z\|_{2}\leq O(\sqrt{d\log|S|/\tau})=O(\sqrt{d\log(d/\varepsilon\tau}))$ .

Then, we iteratively apply the Filter-Gaussian-Unknown-Mean procedure of Proposition 5.5 until it terminates returning a mean vector $\mu$ with $\|\widehat{\mu}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).$ We claim that we need at most $O(\alpha)$ iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets $S_{i}^{\prime},$ such that $\Delta(S,S_{i}^{\prime})\leq\Delta(S,S^{\prime})-i\cdot\varepsilon/{\alpha}.$ Thus, if we do not output the empirical mean in the first $2\alpha$ iterations, in the next iteration there are no outliers left. Hence in the next iteration it is impossible for the algorithm to output a subset satisfying Condition (ii) of Proposition 5.5, so it must output a mean vector satisfying (i), as desired. ∎

In this subsection, we describe the efficient algorithm establishing Proposition 5.5 and prove its correctness. Our algorithm calculates the empirical mean vector $\mu^{S^{\prime}}$ and empirical covariance matrix $\Sigma$ . If the matrix $\Sigma$ has no large eigenvalues, it returns $\mu^{S^{\prime}}.$ Otherwise, it uses the eigenvector $v^{\ast}$ corresponding to the maximum magnitude eigenvalue of $\Sigma$ and the mean vector $\mu^{S^{\prime}}$ to define a filter. Our efficient filtering procedure is presented in detailed pseudocode below.

1.2 Proof of Correctness of Filter-Gaussian-Unknown-Mean

We define $\mu^{G},\mu^{S},\mu^{S^{\prime}},\mu^{L},$ and $\mu^{E}$ to be the means of $G,S,S^{\prime},L,$ and $E$ , respectively.

Our analysis will make essential use of the following matrices:

Our analysis will hinge on proving the important claim that $\Sigma-I$ is approximately $(|E|/|S^{\prime}|)M_{E}$ . This means two things for us. First, it means that if the positive errors align in some direction (causing $M_{E}$ to have a large eigenvalue), there will be a large eigenvalue in $\Sigma-I$ . Second, it says that any large eigenvalue of $\Sigma-I$ will correspond to an eigenvalue of $M_{E}$ , which will give an explicit direction in which many error points are far from the empirical mean.

Useful Structural Lemmas. We will use the following simple fact about the concentration of Gaussian random variables:

We begin by noting that we have concentration bounds on $G$ and therefore, on $S$ due to its goodness.

The first line is Fact 5.6, and the former follows from it using the goodness of $S$ . ∎

By using the above fact, we obtain the following simple claim:

This follows from Fact 5.7 upon noting that $|w\cdot(X-\mu^{S^{\prime}})|>T+\|\mu^{S^{\prime}}-\mu^{G}\|_{2}$ only if $|w\cdot(X-\mu^{G})|>T$ . ∎

We can use the above facts to prove concentration bounds for $L$ . In particular, we have the following lemma:

We have that $\|M_{L}\|_{2}={O\left(\log(|S|/|L|)+\varepsilon|S|/|L|\right)}.$

For unit vectors $v$ , the RHS is bounded from above as follows:

where the second line follows from the fact that $\|v\|_{2}=1$ , $L\subset S$ , and $S$ satisfies condition (i) of Definition 5.2; the third line follows from (22); and the fourth line follows from Fact 5.7. ∎

As a corollary, we can relate the matrices $M_{S^{\prime}}$ and $M_{E}$ , in spectral norm:

We have that $M_{S^{\prime}}-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon))$ , where the $O(\varepsilon\log(1/\varepsilon))$ term denotes a matrix of spectral norm $O(\varepsilon\log(1/\varepsilon))$ .

By definition, we have that $|S^{\prime}|M_{S^{\prime}}=|S|M_{S}-|L|M_{L}+|E|M_{E}$ . Thus, we can write

where the second line uses the fact that $1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon$ , the goodness of $S$ (condition (iv) in Definition 5.2), and Lemma 5.9. Specifically, Lemma 5.9 implies that $(|L|/|S^{\prime}|)\|M_{L}\|_{2}=O(\varepsilon\log(1/\varepsilon))$ . Therefore, we have that

We now establish a similarly useful bound on the difference between the mean vectors:

Since $S$ is a good set, by condition (iii) of Definition 5.2, we have $\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon)$ . Since $1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon$ , it follows that $(|S|/|S^{\prime}|)\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon)$ . Using the valid inequality $\|M_{L}\|_{2}\geq\|\mu^{L}-\mu^{G}\|_{2}^{2}$ and Lemma 5.9, we obtain that $\|\mu^{L}-\mu^{G}\|_{2}\leq O\left(\sqrt{\log(|S|/|L|)}+\sqrt{\varepsilon|S|/|L|}\right)$ . Therefore,

as desired. This completes the proof of the lemma. ∎

By combining the above, we can conclude that $\Sigma-I$ is approximately proportional to $M_{E}$ . More formally, we obtain the following corollary:

We have $\Sigma-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon))+O(|E|/|S^{\prime}|)^{2}\|M_{E}\|_{2}$ , where the additive terms denote matrices of appropriately bounded spectral norm.

By definition, we can write $\Sigma-I=M_{S^{\prime}}-I-(\mu^{S^{\prime}}-\mu^{G})(\mu^{S^{\prime}}-\mu^{G})^{T}.$ Using Corollary 5.10 and Lemma 5.11, we obtain:

where the second line follows from the valid inequality $\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}$ . This completes the proof. ∎

On the other hand, since $\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}$ , Lemma 5.11 gives that

Case of Large Spectral Norm. We next show the correctness of the algorithm when it returns a filter in Step 7.

Moreover, using the inequality $\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}$ and Lemma 5.11 as above, we get that

Suppose for the sake of contradiction that for all $T>0$ we have that

Using (24), we obtain that for all $T>0$ we have that

for some universal constant $C^{\prime\prime}$ .

We now have the following sequence of inequalities:

Combined with (23), we obtain $\lambda^{\ast}=O(\varepsilon\log(1/\varepsilon))$ , which is a contradiction if $C$ is sufficiently large. Therefore, it must be the case that for some value of $T$ the condition in Step 7 is satisfied.

Recall that $S^{\prime}=(S\setminus L)\cup E,$ with $E$ and $L$ disjoint multisets such that $L\subset S.$ We can similarly write $S^{\prime\prime}=(S\setminus L^{\prime})\cup E^{\prime},$ with $L^{\prime}\supseteq L$ and $E^{\prime}\subset E.$ Since

it suffices to show that $|E\setminus E^{\prime}|\geq|L^{\prime}\setminus L|+\varepsilon|S|/{\alpha}.$ Note that $|L^{\prime}\setminus L|$ is the number of points rejected by the filter that lie in $S\cap S^{\prime}.$ Note that the fraction of elements of $S$ that are removed to produce $S^{\prime\prime}$ (i.e., satisfy $|v^{\ast}\cdot(x-\mu^{S^{\prime}})|>T+\delta$ ) is at most $2\exp(-T^{2}/2)+\varepsilon/{\alpha}$ . This follows from Claim 5.8 and the fact that $T=O(\sqrt{d\log(d/\varepsilon\tau)})$ .

Hence, it holds that $|L^{\prime}\setminus L|\leq(2\exp(-T^{2}/2)+\varepsilon/{\alpha})|S|.$ On the other hand, Step 7 of the algorithm ensures that the fraction of elements of $S^{\prime}$ that are rejected by the filter is at least $8\exp(-T^{2}/2)+8\varepsilon/{\alpha})$ . Note that $|E\setminus E^{\prime}|$ is the number of points rejected by the filter that lie in $S^{\prime}\setminus S.$ Therefore, we can write:

where the second line uses the fact that $|S^{\prime}|\geq|S|/2$ and the last line uses the fact that $|L^{\prime}\setminus L|/|S|\leq 2\exp(-T^{2}/2)+\varepsilon/{\alpha}.$ Noting that $\log(d/\varepsilon\tau)\geq 1$ , this completes the proof of the claim. ∎

2 Learning a Gaussian With Unknown Covariance

In this subsection, we use our filter technique to agnostically learn a Gaussian with zero mean vector and unknown covariance. By combining the algorithms of the current and the previous subsections, as in our convex programming approach (Section 4.5), we obtain a filter-based algorithm to agnostically learn an arbitrary unknown Gaussian.

The main result of this subsection is the following theorem:

Let $G\sim\operatorname{\mathcal{N}}(0,\Sigma)$ be a Gaussian in $d$ dimensions with mean and unknown covariance, and let $\varepsilon,\tau>0$ . Let $S$ be an $\varepsilon$ -corrupted set of samples from $G$ of size $\Omega((d^{2}/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon{\tau}))$ . There exists an efficient algorithm that, given $S$ and $\varepsilon$ , returns the parameters of a Gaussian distribution $G^{\prime}\sim\operatorname{\mathcal{N}}(0,\widehat{\Sigma})$ such that with probability at least ${1-\tau}$ , it holds $\|I-\Sigma^{-1/2}\widehat{\Sigma}\Sigma^{-1/2}\|_{F}=O(\varepsilon\log(1/\varepsilon)).$

As in the previous subsection, we will need a condition on $S$ under which our algorithm will succeed.

For all $x\in S$ , $x^{T}\Sigma^{-1}x<O(d\log(|S|/\tau))$ .

We have that $\|\Sigma^{-1/2}\mbox{Cov}(S)\Sigma^{-1/2}-I\|_{F}=O(\varepsilon)$ .

For all even degree- $2$ polynomials $p$ , we have that $\operatorname*{Var}(p(S))=\operatorname*{Var}(p(G))(1+O(\varepsilon))$ .

Let us first note some basic properties of such polynomials on a normal distribution. The proof of this lemma is deferred to Section B.

$\operatorname*{Var}[p(X)]=2\|P_{2}\|_{F}^{2}$ and

For all $\delta>0$ , $\Pr(|p(X)|\leq\delta^{2})\leq O(\delta).$

We note that, if $S$ is obtained by taking random samples from $G$ , then $S$ is good with high probability. The proof of this lemma is also deferred to Section B.

Let $G$ be a $d$ -dimensional Gaussian with mean and let $\varepsilon,\tau>0$ . Let $N$ be a sufficiently large constant multiple of $d^{2}\log^{5}(d/\varepsilon\tau)/\varepsilon^{2}$ . Then a set $S$ of $N$ independent samples from $G$ is $(\varepsilon,\tau)$ -good with respect to $G$ with probability at least $1-\tau$ .

As in Definition 5.4, $\Delta(S,S^{\prime})$ is the size of the symmetric difference of $S$ and $S^{\prime}$ divided by $|S|$ .

The basic thrust of our algorithm is as follows: By Lemma 5.17, with high probability we have that $S$ is $(\varepsilon,\tau)$ -good with respect to $G$ . The algorithm is then handed a new set $S^{\prime}$ such that $\Delta(S,S^{\prime})\leq 2\varepsilon|S|$ . The algorithm will run in stages. In each stage, the algorithm will either output $G^{\prime}$ or will return a new set $S^{\prime\prime}$ such that $\Delta(S,S^{\prime\prime})<\Delta(S,S^{\prime})$ . In the latter case, the algorithm will recurse on $S^{\prime\prime}$ . We formalize this idea below:

Given Proposition 5.18, the proof of Theorem 5.14 is straightforward. By Lemma 5.17 the original set $S$ is $(\varepsilon,\tau)$ -good with respect to $G$ with probability at least $1-\tau$ . Then, $S^{\prime}$ satisfies the hypotheses of Proposition 5.18. We then repeatedly iterate the algorithm from Proposition 5.18 until it outputs a distribution $G^{\prime}$ close to $G$ . This must eventually happen because at every step the distance between $S$ and the set returned by the algorithm decreases by at least $1$ .

The function Find-max-poly uses similar notation to SeparationOracleUnknownCovariance, such that Filter-Gaussian-Unknown-Covariance and SeparationOracleUnknownCovariance can be more easily compared.

Let us first show that Find-max-poly is correct.

Algorithm Find-max-poly is correct and Filter-Gaussian-Unknown-Covariance runs time $\operatorname{poly}(d\log\tau/\varepsilon).$

Note that since the covariance matrix of $S^{\prime}$ is $\Sigma^{\prime}\;,$ we have

We let $T^{\prime}$ be the multiset of $y=\Sigma^{-1/2}x$ for $x\in S^{\prime}$ and $U^{\prime}$ the multiset of $z=y^{\otimes 2}$ for $y$ in $T^{\prime}$ . Recall that $P_{2}^{\flat}=\sqrt{2}v$ . We thus have

Thus, the $p(x)$ that maximizes $Q_{S^{\prime}}(p)$ is given by the unit vector $v$ that maximizes $v^{T}T_{S^{\prime}}v$ subject to $v^{\sharp}$ being symmetric.

Let $v^{\prime}=v^{\sharp T\flat}$ . Note that $v^{T}T_{S^{\prime}}v=v^{\prime T}T_{S^{\prime}}v^{\prime}$ by symmetries of $T_{S^{\prime}}$ . Thus, by linearity, $v^{\prime\prime}=v/2+v^{\prime}/2$ also has $v^{\prime\prime T}T_{S^{\prime}}v^{\prime\prime}=v^{T}T_{S^{\prime}}v$ . However, if $v^{\sharp}$ is not symmetric, $v^{\prime\prime}$ has $\|v^{\prime\prime}\|_{2}<1$ . Thus, the unit vector $v^{\prime\prime}/\|v^{\prime\prime}\|_{2}$ achieves a higher value of the bilinear form. Consequently, $v^{\ast\sharp}$ is symmetric.

To prove Lemma 5.20, we will require the following:

This holds essentially because the distribution of $p(X)$ for $X\in S$ is close to that for $p(G)$ , which has rapidly decaying tails. Therefore, throwing away an $\varepsilon$ -fraction of the mass cannot change the value of the variance by very much. In particular, we have that

Note that, since the matrix inner product is an inner product,

Let $p(x)$ denote the quadratic polynomial

As a corollary of this we note that $\Sigma^{\prime}$ cannot be too much smaller than $\Sigma$ .

The first step in verifying correctness is to note that if our algorithm returns on Step 6 that it does so correctly.

If our algorithm returns on Step 6, then $\Delta(S,S^{\prime\prime})<\Delta(S,S^{\prime}).$

This is clearly true if we can show that all $x$ removed have $x\not\in S$ . However, this follows because $(\Sigma^{\prime})^{-1}\leq 2\Sigma^{-1}$ , and therefore, by $(\varepsilon,\tau)$ -goodness, all $x\in S$ satisfy

By Corollary 2.14, it suffices to show that

Therefore, we will have an appropriate bound unless $\|I-\Sigma^{-1/2}\Sigma_{E}\Sigma^{-1/2}\|_{F}=\Omega(\log(1/\varepsilon)).$

Next, note that there is a matrix $M$ with $\|M\|_{F}=1$ such that

Indeed we can take $M=(I-\Sigma^{-1/2}\Sigma_{E}\Sigma^{-1/2})/\|I-\Sigma^{-1/2}\Sigma_{E}\Sigma^{-1/2}\|_{F}$ . Thus, there is a symmetric $M$ such that this holds.

This shows that if the algorithm returns in this step, it does so correctly. ∎

Next, we need to show that if the algorithm reaches Step 13 that such a $T$ exists.

If the algorithm reaches Step 13, then there exists a $T>1$ such that

Before we begin, we will need the following critical Lemma:

We note that since $\operatorname*{Var}_{X\sim G^{\prime}}(p(G^{\prime}))=Q_{G^{\prime}}(p)=1$ , we just need to show that the variance with respect to $G$ instead of $G^{\prime}$ is not too much larger. This will essentially be because the covariance matrix of $G$ cannot be much bigger than the covariance matrix of $G^{\prime}$ by Corollary 5.22.

We claim that if $A,B$ are matrices, then $\|AB\|_{F}\leq\|A\|_{2}\|B\|_{F}$ . If $B_{j}$ are the columns of $B$ , then we have $\|AB\|_{F}^{2}=\sum_{j}\|AB_{j}\|_{2}^{2}\leq\|A\|_{2}^{2}\sum_{j}\|B_{j}\|_{2}^{2}=(\|A\|_{2}\|B\|_{F})^{2}$ . Similarly for rows, we have $\|AB\|_{F}\leq\|A\|_{F}\|B\|_{2}$ .

Next, we need to consider $\mu$ . In particular, we note that by the similarity of $S$ and $S^{\prime}$ , $\mu$ must be between the $40$ and $60$ percentiles of values of $p(X)$ for $X\in S$ . However, since $S$ is $(\varepsilon,\tau)$ -good, this must be between the $30$ and $70$ percentiles of $p(G)$ . Therefore, by Cantelli’ s inequality,

In particular, we have that $\operatorname*{Var}_{X\in_{u}S^{\prime}}(p(X))=Q_{S^{\prime}}(p)\geq 1+C\varepsilon\ln^{2}(1/\varepsilon)$ , which means that

Hence, for $C$ sufficiently large, it must be the case that

For a sufficiently large $C$ , this contradicts Equation (28). ∎

Finally, we need to verify that if our algorithm returns output in Step 14, that it is correct.

If the algorithm returns during Step 14, then $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-\varepsilon/(d\log(N/\tau))^{2}.$

We note that it is sufficient to show that $|E\setminus S^{\prime\prime}|>|(S\setminus L)\setminus S^{\prime\prime}|$ . In particular, it suffices to show that

On the other hand, using (27) and the $\varepsilon$ -goodness of $S$ , we have that

Agnostically Learning a Mixture of Spherical Gaussians, via Convex Programming

In this section, we give an algorithm to agnostically learn a mixture of $k$ Gaussians with identical spherical covariance matrices up to error $\widetilde{O}(\operatorname{poly}(k)\cdot\sqrt{\varepsilon})$ . Let $\mathcal{M}=\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\sigma^{2}I)$ be the unknown $k$ -GMM each of whose components are spherical. For $X\sim\mathcal{M}$ , we write $X\sim_{j}\mathcal{M}$ if $X$ was drawn from the $j$ th component of $\mathcal{M}$ .

Our main result of this section is the following theorem:

There is an algorithm which with probability $1-\tau$ , outputs a distribution $\mathcal{M}^{\prime}$ such that

The running time of the algorithm is $\operatorname{poly}(d,1/\varepsilon,\log(1/\tau))^{k^{2}}$ .

Our overall approach will be a combination of our method for agnostically learning a single Gaussian and recent work on properly learning mixtures of multivariate spherical Gaussians [SOAJ14, LS17]. At a high level, this recent work relies upon the empirical covariance matrix giving an accurate estimate of the overall covariance matrix in order to locate the subspace in which the component mean lie. However, as we have observed already, the empirical moments do not necessarily give good approximations of the true moments in the agnostic setting. Therefore, we will use our separation oracle framework to approximate the covariance matrix, and the rest of the arguments follow similarly as previous methods.

The organization of this section will be as follows. We define some of the notation we will be using and the Schatten top- $k$ norm in Section 6.1. Section 6.2 states the various concentration inequalities we require. In Section 6.3, we go over our overall algorithm in more detail. Section 6.4 describes a first naive clustering step, which deals with components which are very well separated. Section 6.5 contains details on our separation oracle approach, allowing us to approximate the true covariance. Section 6.6 describes our spectral clustering approach to cluster components with means separated more than $\Omega_{k}(\log 1/\varepsilon)$ . In Section 6.7, we describe how to exhaustively search over a particular subspace to obtain a good estimate for the component means. In Section 6.8, we go over how to limit the set of hypotheses in order to satisfy the conditions of Lemma 2.23. For clarity of exposition, all of the above describe the algorithm assuming all $\sigma_{j}^{2}$ are equal. In Section 6.9, we discuss the changes to algorithm which are required to handle unequal variances.

For conciseness, many of the proofs are deferred to Section C.

Recall the definition of $S_{N,\varepsilon}$ from Section 4.1, which we will use extensively in this section. We will use the notation $\mu=\sum_{j\in[k]}\alpha_{j}\mu_{j}$ to denote the mean of the unknown GMM. Also, we define parameters $\gamma_{j}=\alpha_{j}\|\mu_{j}-\mu\|_{2}^{2}$ and let $\gamma=\max_{j}\gamma_{j}$ . And for ease of notation, let

to denote the covariance of the unknown GMM. Our algorithm for learning spherical $k$ -GMMs will rely heavily on the following, non-standard norm:

i.e., it is the sum of the top- $k$ singular values of $M$ .

It is easily verified that $\|\cdot\|_{T_{k}}$ has a dual characterization

where the maxima is taken over all $X$ with orthonormal columns. From this, it is easy to see that the Schatten top- $k$ norm is indeed a norm, as its name suggests:

$\|M\|_{T_{k}}$ is a norm on symmetric matrices.

2 Concentration Inequalities

In this section, we will establish some concentration inequalities that we will need for our algorithm for agnostically learning mixtures of spherical Gaussians. Recall the notation as described in Section 6.1. The following two concentration lemmata follow from the same proofs as for Lemmata 42 and 44 in [LS17].

Fix $\varepsilon,\delta>0$ . If $Y_{1},\dots,Y_{N}$ are independent samples from the GMM with PDF $\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\Sigma_{j})$ where $\alpha_{j}\geq\Omega(\varepsilon)$ for all $j$ , and $N=\Omega\left(\frac{d+\log{(k/\delta)}}{\varepsilon^{2}}\right)$ then with probability at least $1-O(\delta)$ ,

where $Q$ is defined as in equation (29).

From the same techniques as before, we get the same sort of union bounds as usual over the weight vectors:

Fix $\varepsilon\leq 1/2$ and $\tau\leq 1$ . There is a $\delta=O(\varepsilon\sqrt{\log 1/\varepsilon})$ such that if $Y_{1},\dots,Y_{N}$ are independent samples from the GMM with PDF $\sum_{j\in[k]}\alpha_{j}\mathcal{N}(\mu_{j},\Sigma_{j})$ where $\alpha_{j}\geq\Omega(\varepsilon)$ for all $j$ , and $N=\Omega\left(\frac{d+\log{(k/\tau)}}{\delta_{1}^{2}}\right)$ , then

where $Q$ is defined as in equation (29).

3 Algorithm

Our algorithm is based on the following deterministic conditions:

(32) follows from basic Gaussian concentration, and (33) and (34) follow from the results in Section 6.2 for $N$ sufficiently large. Note that these trivially imply similar conditions for the Schatten top- $k$ norm, at the cost of an additional factor of $k$ on the right-hand side of the inequalities. For the rest of this section, let $\delta=\max(\delta_{1},\delta_{2})$ .

At this point, we are ready to apply our separation oracle framework. In particular, we will find a weight vector $w$ over the points such that

for some choice of $\eta$ . The set of such weights is convex, and concentration implies that the true weight vector will have this property. Furthermore, we can describe a separation oracle given any weight vector not contained in this set (as long as $\eta$ is not too small). At this point, we use classical convex programming methods to find a vector which satisfies these conditions. Further details are provided in Section 6.5.

After this procedure, Lemma 6.13 shows that the weighted empirical covariance is spectrally close to the true covariance matrix. We are now in the same regime as [SOAJ14], which obtains their results as a consequence of the empirical covariance concentrating about the true covariance matrix. Thus, we will appeal to their analysis, highlighting the differences between our approach and theirs. We note that [LS17] also follows a similar approach and the interested reader may also adapt their arguments instead.

First, if $\gamma$ is sufficiently large (i.e., $\Omega_{k}(\log(1/\varepsilon))$ ), this implies a separation condition between some component mean and the mixture’s mean. This allows us to cluster the points further, using a spectral method. We take the top eigenvector of the weighted empirical covariance matrix and project in this direction, using the sign of the result as a classifier. In contrast to previous work, which requires that no points are misclassified, we can tolerate $\operatorname{poly}(\varepsilon/k)$ misclassifications, since our algorithms are agnostic. This crucially allows us to avoid a dependence on $d$ in our overall agnostic learning guarantee. Further details are provided in Section 6.6.

Finally, if $\gamma$ is sufficiently small, we may perform an exhaustive search. The span of the means is in the span of the top $k-1$ eigenvectors of the true covariance matrix, which we can approximate with our weighted empirical covariance matrix. Since $\gamma$ is small, by trying all points within a sufficiently tight mesh, we can guess a set of candidate means which are sufficiently close to the true means. Combining the approximations to the means with Corollary 2.13 and the triangle inequality, we can guarantee that at least one of our guesses is sufficiently close to the true distribution. Additional details are provided in Section 6.7.

To conclude our algorithm, we can apply Lemma 2.23. We note that this hypothesis selection problem has been studied before (see, e.g., [DL01, DK14]), but we must adapt it for our agnostic setting. This allows us to select a hypothesis which is sufficiently close to the true distribution, thus concluding the proof. We note that the statement of Lemma 2.23 requires the hypotheses to come from some fixed finite set, while there are an infinite number of Gaussian mixture models. In Section 6.8, we discuss how to limit the number of hypotheses based on the set of uncorrupted samples in order to satisfy the conditions of Lemma 2.23.

4 Naive Clustering

We give a very naive clustering algorithm, the generalization of NaivePrune, which recursively allows us to cluster components if they are extremely far away. The algorithm is very simple: for each $X_{i}$ , add all points within distance $O(d\log(k/\varepsilon))$ to a cluster $S_{i}$ . Let $\mathcal{C}$ be the set of clusters which contain at least $4\varepsilon N$ points, and let the final clustering be $C_{1},\ldots,C_{k^{\prime}}$ be formed by merging clusters in $\mathcal{C}$ if they overlap, and stopping if no clusters overlap. We give the pseudocode in Algorithm 10.

We prove here that this process (which may throw away points) throws away only at most a $\varepsilon$ fraction of good points, and moreover, the resulting clustering only misclassifies at most an $O(\varepsilon)$ -fraction of the good points, assuming (32).

Thus, we have that by applying this algorithm, given an $\varepsilon$ -corrupted set of samples from $\mathcal{M}$ , we may cluster them in a way which misclassifies at most an $\varepsilon/k$ fraction of the samples from any component, and such that within each cluster, the means of the associated components differ by at most $dk\log k/\varepsilon$ . Thus, each separate cluster is simply a $\varepsilon$ -corrupted set of samples from the mixture restricted to the components within that cluster; moreover, the number of components in each cluster must be strictly smaller than $k$ . Therefore, we may simply recursively apply our algorithm on these clusters to agnostically learn the mixture for each cluster, since if $k=1$ , this is a single Gaussian, which we know how to learn agnostically.

Thus, for the remainder of this section, let us assume that for all $j,j^{\prime}$ , we have $\|\mu_{j}-\mu_{j}^{\prime}\|_{2}^{2}\leq O(dk\log 1/\varepsilon)$ . Moreover, we may assume that there are no points $j,j^{\prime}$ (corrupted or uncorrupted), such that $\|X_{j}-X_{j^{\prime}}\|_{2}^{2}\geq\Omega(dk\log 1/\varepsilon)$ .

5 Estimating the Covariance Using Convex Programming

In this section, we will apply our separation oracle framework to estimate the covariance matrix. While in the non-agnostic case, the empirical covariance will approximate the actual covariance, this is not necessarily true in our case. As such, we will focus on determining a weight vector over the samples such that the weighted empirical covariance is a good estimate for the true covariance.

We first define the convex set for which we want an interior point:

In Section 6.5.1, we prove lemmata indicating important properties of this set. In Section 6.5.2, we give a separation oracle for this convex set. We conclude with Lemma 6.13, which shows that we have obtained an accurate estimate of the true covariance.

We start by proving the following lemma, which states that for any weight vector which is not in our set, the weighted empirical covariance matrix is noticeably larger than it should be (in Schatten top- $k$ norm).

Suppose that (33) holds, and $w\not\in\mathcal{C}_{ckh(k,\gamma,\delta)}$ . Then

We also require the following lemma, which shows that if a set of weights poorly approximates $\mu$ , then it is not in our convex set.

Suppose that (33) and (34) hold. Let $w\in S_{m,\varepsilon}$ and set $\widehat{\mu}=\sum_{i=1}^{m}w_{i}X_{i}$ and $\Delta=\mu-\widehat{\mu}$ . Furthermore, suppose that $\|\Delta\|_{2}\geq\Omega(h(k,\gamma,\delta))$ . Then

By contraposition, if a set of weights is in our set, then it provides a good approximation for $\mu$ :

Suppose that (33) and (34) hold. Let $w\in\mathcal{C}_{h(k,\gamma,\delta)}$ for $\delta=\Omega(\varepsilon\log 1/\varepsilon)$ . Then

5.2 Separation Oracle

In this section, we provide a separation oracle for $\mathcal{C}_{\eta}$ . In particular, we have the following theorem:

Fix $\varepsilon>0$ , and let $\delta=\Omega(\varepsilon\log 1/\varepsilon).$ Suppose that (33) and (34) hold. Let $w^{\ast}$ denote the weights which are uniform on the uncorrupted points. Then there is a constant $c$ and an algorithm such that:

(Completeness) If $w=w^{\ast}$ , then it outputs “YES”.

These two facts imply that the ellipsoid method with this separation oracle will terminate in $\operatorname{poly}(d,1/\varepsilon)$ steps, and moreover, will with high probability output a $w^{\prime}$ such that $\|w-w^{\prime}\|_{\infty}\leq\varepsilon/(Ndk\log 1/\varepsilon)$ for some $w\in\mathcal{C}_{ckh(k,\gamma,\delta)}$ . Moreover, it will do so in polynomially many iterations.

After running this procedure, we technically do not have a set of weights in $\mathcal{C}_{ckh(k,\gamma,\delta)}$ . But by the same argument as in Section 4.3, because the maximum distance between two points within any cluster is bounded, and we have the guarantee that $\|X_{i}-X_{j}\|^{2}\leq O(dk\log 1/\varepsilon)$ for all $i,j$ , we may assume we have a set of weights satisfying

We require the following lemma, describing the accuracy of the empirical covariance matrix with the obtained weights.

Let $\widehat{\mu}=\sum_{i=1}^{N}w_{i}X_{i}$ . After running the algorithm above, we have a vector $w$ such that

By triangle inequality and Corollary 6.11,

6 Spectral Clustering

Now that we have a good estimate of the true covariance matrix, we will perform spectral clustering while $\gamma$ is sufficiently large. We will adapt Lemma 6 from [SOAJ14], giving the following lemma:

Given a weight vector $w$ as output by Algorithm 11, if $\gamma\geq\Omega(\operatorname{poly}(k)\cdot\log 1/\varepsilon)$ , there exists an algorithm which produces a unit vector $v$ with the following guarantees:

There exists a non-trival partition of $[k]$ into $S_{0}$ and $S_{1}$ such that $v^{T}\mu_{j}>0$ for all $j\in S_{0}$ and $v^{T}\mu_{j}<0$ for all $j\in S_{1}$ ;

The probability of a sample being misclassified is at most $O(\operatorname{poly}(\varepsilon/k))$ , where a misclassification is defined as a sample $X$ generated from a component in $S_{0}$ having $v^{T}X<0$ , or a sample generated from a component in $S_{1}$ having $v^{T}X>0$ .

The algorithm will be as follows. Let $v$ be the top eigenvector of

For a sample $X$ , cluster it based on the sign of $v^{T}X$ . After performing this clustering, recursively perform our algorithm from the start on the two clusters.

The proof is very similar to that of Lemma 6 in [SOAJ14]. Their main concentration lemma is Lemma 30, which states that they obtain a good estimate of the true covariance matrix, akin to our Lemma 6.13. Lemma 31 argues that the largest eigenvector of this estimate is highly correlated with the top eigenvector of the true covariance matrix. Since $\gamma$ is large, this implies there is a large margin between the mean and the hyperplane. However, by standard Gaussian tail bounds, the probability of a sample landing on the opposite side of this hyperplane is small.

We highlight the main difference between our approach and theirs. For their clustering step, they require that no sample is misclustered with high probability. As such, they may perform spectral clustering while $\gamma=\Omega\left(\operatorname{poly}(k)\cdot\log(d/\varepsilon)\right)$ . We note that, in the next step of our algorithm, we will perform an exhaustive search. This will result in an approximation which depends on the value of $\gamma$ at the start of the step, and as such, using the same approach as them would result in an overall approximation which depends logarithmically on the dimension.

We may avoid paying this cost by noting that our algorithm is agnostic. They require that no sample is misclustered with high probability, while our algorithm tolerates that a $\operatorname{poly}(\varepsilon/k)$ -fraction of points are misclustered. As such, we can continue spectral clustering until $\gamma=O\left(\operatorname{poly}(k)\cdot\log(1/\varepsilon)\right)$ .

7 Exhaustive Search

The final stage of the algorithm is when we know that all $\gamma_{i}$ ’s are sufficiently small. We can directly apply the following lemma:

Given a weight vector $w$ as output by Algorithm 11, then the projection of $\frac{\mu_{j}-\mu}{\|\mu_{j}-\mu\|_{2}}$ onto the space orthogonal to the span of the top $k-1$ eigenvectors of

By taking a $k$ -wise Cartesian product of this set, we are guaranteed to obtain a vector which has this guarantee simultaneously for all $\mu_{j}$ .

8 Applying the Tournament Lemma

In this section, we discuss details about how to apply our hypothesis selection algorithm. First, in Section 6.8.1, we describe how to guess the mixing weights and the variance of the components. Then in Section 6.8.2, we discuss how to ensure our hypotheses come from some fixed finite set, in order to deal with technicalities which arise when performing hypothesis selection with our adversary model.

The majority of our algorithm is focused on generating guesses for the means of the Gaussians. In this section, we guess the remaining parameters: the mixing weights and the variance. While most of these guessing arguments are standard, we emphasize that we reap an additional benefit because our algorithm is agnostic. In particular, most algorithms must deal with error incurred due to misspecification of the parameters. Since our algorithm is agnostic, we can pretend the misspecified parameter is the true one, at the cost of increasing the value of the agnostic parameter $\varepsilon$ . If our misspecified parameters are accurate enough, the agnostic learning guarantee remains unchanged.

Guessing the mixing weights is fairly straightforward. For some $\nu=\operatorname{poly}(\varepsilon/k)$ sufficiently small, our algorithm generates a set of at most $(1/\nu)^{k}=\operatorname{poly}(k/\varepsilon)^{k}$ possible mixing weights by guessing the values $\{0,\varepsilon,\varepsilon+\nu,\varepsilon+2\nu,\dots,1-\nu,1\}$ for each $\alpha_{j}$ . Note that we may assume each weight is at least $\varepsilon$ , since components with weights less than this can be specified arbitrarily at a total cost of $O(k\varepsilon)$ in total variation distance.

Next, we need to guess the variance $\sigma^{2}$ of the components. To accomplish this, we will take $k+1$ samples (hoping to find only uncorrupted ones) and compute the minimum distance between any pair of them. Since we assume $k\ll 1/\varepsilon$ , we can repeatedly draw $k+1$ samples until we have the guarantee that at least one set is uncorrupted. If none of the $k+1$ samples are corrupted, then at least two of them came from the same component, and in our high-dimensional setting the distance between any pair of samples from the same component concentrates around $\sqrt{2d}\sigma$ . After rescaling this distance, we can then multiplicatively enumerate around this value with granularity $\operatorname{poly}(\varepsilon/dk)$ to get an estimate for $\sigma^{2}$ that is sufficiently good for our purposes. Applying Corollary 2.14 bounds the cost of this misspecification by $O(\varepsilon)$ . By rescaling the points, we may assume that $\sigma^{2}=1$ .

8.2 Pruning Our Hypotheses

In this section, we describe how to prune our set of hypotheses in order to apply Lemma 2.23. Recall that this lemma requires our hypotheses to come from some fixed finite set, rather than the potentially infinite set of GMM hypotheses. We describe how to prune and discretize the set of hypotheses obtained during the rest of the algorithm to satisfy this condition. For the purposes of this section, a hypothesis will be a $k$ -tuple of $d$ -dimensional points, corresponding only to the means of the components. While the candidate mixing weights already come from a fixed finite set (so no further work is needed), the unknown variance must be handled similarly to the means. The details for handling the variance are similar to (and simpler than) those for handling the means, and are omitted.

More precisely, this section will describe a procedure to generate a set of hypotheses $\mathcal{M}$ , which is exponentially large in $k$ and $d$ , efficiently searchable, and comes from a finite set of hypotheses which are fixed with respect to the true distribution. Then, given our set of hypotheses generated by the main algorithm (which is exponentially large in $k$ but polynomial in $d$ ), we iterate over this set, either replacing each hypothesis with a “close” hypothesis from $\mathcal{M}$ (i.e., one which is within $O(\varepsilon)$ total variation distance), or discarding the hypothesis if none exists. Finally, we run the tournament procedure of Lemma 2.23 on the resulting set of hypotheses.

At a high level, the approach will be as follows. We will take a small set of samples, and remove any samples from this set which are clear outliers (due to having too few nearby neighbors). This will give us a set of points, each of which are within a reasonable distance from some component mean. Taking a union of balls around these samples will give us a space that is a subset of a union of (larger) balls centered at the component centers. We take a discrete mesh over this space to obtain a fixed finite set of possible means, and round each hypothesis such that its means are within this set.

We start by taking $N=O(k\log(1/\tau)/\varepsilon^{2})$ samples, which is sufficient to ensure that the number of (uncorrupted) samples from component $j$ will be $(w_{j}\pm\Theta(\varepsilon))N$ for all $j\in[k]$ with probability $1-O(\tau)$ . Recall that we are assuming that $w_{j}=\Omega(\varepsilon)$ for all $j$ , as all other components may be defined arbitrarily at the cost of $O(k\varepsilon)$ in total variation distance. This implies that even after corruption, each component has generated at least $\varepsilon N$ uncorrupted samples.

By standard Gaussian concentration bounds, we know that if $N$ samples are taken from a Gaussian, the maximum distance between a sample and the Gaussian’s mean will be at most $\zeta=O(\sqrt{d\log(N/\tau)})$ with probability $1-\tau$ . Assume this condition holds, and thus each component’s mean will have at least $\varepsilon N$ points within distance $\zeta$ . We prune our set of samples by removing any point with fewer than $\varepsilon N$ other points at distance less than $2\zeta$ . This will not remove any uncorrupted points, by the above assumption, and triangle inequality. However, this will remove any corrupted points at distance at least $3\zeta$ from all component means, due to the fact that the adversary may only move an $\varepsilon$ -fraction of the points, and reverse triangle inequality.

Now, we consider the union of the balls of radius $3\zeta$ centered at each of the remaining points. This set contains all of the component means, and is also a subset of the union of the balls of radius $6\zeta$ centered at the component means. We discretize this set by taking its intersection with a lattice of side-length $\frac{\varepsilon}{k\sqrt{d}}$ . We note that any two points in this discretization are at distance at most $\varepsilon/k$ . By a volume argument, the number of points in the intersection is at most $k\left(\frac{12\zeta k\sqrt{d}}{\varepsilon}\right)^{d}$ . Each hypothesis will be described by the $k$ -wise Cartesian product of these points, giving us a set $\mathcal{M}$ of at most $k^{k}\left(\frac{12\zeta k\sqrt{d}}{\varepsilon}\right)^{kd}$ hypotheses.

Given a set of hypotheses $\mathcal{H}$ from the main algorithm, we prune it using $\mathcal{M}$ as a reference. For each $h\in\mathcal{H}$ , we see if there exists some $h^{\prime}\in\mathcal{M}$ such that the means in $h$ are at distance at most $\varepsilon/k$ from the corresponding means in $h^{\prime}$ .We observe that the complexity of this step is polynomial in $d$ and $k$ , not exponential, if one searches for the nearest lattice point in the sphere surrounding each unpruned sample, rather than performing a naive linear scan over the entire list. If such an $h^{\prime}$ exists, we replace $h$ with $h^{\prime}$ – otherwise, $h$ is simply removed. By Corollary 2.13 and the triangle inequality, this replacement incurs a cost of $O(\varepsilon)$ in total variation distance. At this point, the conditions of Lemma 2.23 are satisfied and we may run this procedure to select a sufficiently accurate hypothesis.

9 Handling Unequal Variances

In this section, we describe the changes required to allow the algorithm to handle different variances for the Gaussians. The main idea is to find the minimum variance of any component and perform clustering so we only have uncorrupted samples from Gaussians with variances within some known, polynomially-wide interval. This allows us to grid within this interval in order to guess the variances, and the rest of the algorithm proceeds with minor changes.

The first step is to locate the minimum variance of any component. Again using standard Gaussian concentration, in sufficiently high dimensions, if $N$ samples are taken from a Gaussian with variance $\sigma^{2}I$ , the distance between any two samples will be concentrated around $\sigma(\sqrt{2d}-\Theta(d^{1/4}))$ . With this in hand, we use the following procedure to estimate the minimum variance. For each sample $i$ , record the distance to the $(\varepsilon N+1)$ st closest sample. We take the $(\varepsilon N+1)$ st smallest of these values, rescale it by $1/\sqrt{2d}$ , and similar to before, guess around it using a multiplicative $(1+\operatorname{poly}(\varepsilon/kd))$ grid, which will give us an estimate $\hat{\sigma}_{min}^{2}$ for the smallest variance. We note that discarding the smallest $\varepsilon N$ fraction of the points prevents this statistic from being grossly corrupted by the adversary. For the remainder of this section, assume that $\sigma_{min}^{2}$ is known exactly.

At this point, we partition the points into those that come from components with small variance, and those with large variance. We will rely upon the following concentration inequality from [SOAJ14], which gives us the distance between samples from different components:

Given $N$ samples from a collection of Gaussian distributions, with probability $1-O(\tau)$ , the following holds for every pair of samples $X,Y$ :

where $X\sim\mathcal{N}(\mu_{1},\sigma_{1}^{2}I)$ and $Y\sim\mathcal{N}(\mu_{2},\sigma_{2}^{2}I)$ .

This procedure works due to the following argument. When we compute $H_{1}$ , we are guaranteed that it will contain all samples from components with variance $\sigma_{\min}^{2}$ , by the upper bound in Lemma 6.16. However, it may also contain samples from other components – in particular, those with variance at most $\gamma\sigma_{\min}^{2}$ , for

After this clustering step, the algorithm follows similarly to before. The main difference is in the convex programming steps and concentration bounds. For instance, before, we considered the set

Now, to reflect the different expression for the covariance of the GMM, we replace $\sigma^{2}I$ with $\sum_{j\in[k]}\alpha_{j}\sigma_{j}^{2}I$ ; for example:

We note that since all variances in each cluster are off by a factor of at most $e$ , this will only affect our concentration and agnostic guarantees by a constant factor.

Agnostically Learning Binary Product Distributions, via Filters

where $\Sigma$ is the modified empirical covariance. Ultimately, we show that this yields an estimate that is $\widetilde{O}(\sqrt{\varepsilon})$ close in total variation distance. See Section 7.2 for further details.

The main result of this section is the following theorem:

Let $P$ be a binary product distribution in $d$ dimensions and $\varepsilon,\tau>0.$ Let $S$ be a multiset of $\Theta(d^{4}\log(1/\tau)/\varepsilon^{2})$ independent samples from $P$ and $S^{\prime}$ be a multiset obtained by arbitrarily changing an $\varepsilon$ -fraction of the points in $S.$ There exists a polynomial time algorithm that returns a product distribution $P^{\prime}$ such that, with probability at least $1-\tau$ , we have $\|p-p^{\prime}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}),$ where $p$ and $p^{\prime}$ are the mean vectors of $P$ and $P^{\prime}$ respectively.

For $0<c<1/2,$ we say that a binary product distribution is $c$ -balanced if the expectation of each coordinate is in $[c,1-c].$

For $c$ -balanced binary product distributions, we have the following corollary of Lemma 2.17:

We start by defining a condition on the uncorrupted set of samples $S,$ under which our algorithm will succeed.

The following simple lemma shows that a sufficiently large set of independent samples from $P$ is $\varepsilon$ -good (with respect to $P$ ) with high probability.

Let $P$ be an arbitrary distribution on $\{0,1\}^{d}$ and $\varepsilon,\tau>0.$ If the multiset $S$ is obtained by taking $\Omega((d^{4}+d^{2}\log(1/\tau))/\varepsilon^{2})$ independent samples from $P,$ it is $\varepsilon$ -good with respect to $P$ with probability at least $1-\tau.$

Recall (see Definition 5.4) that $\Delta(S,S^{\prime})$ is the size of the symmetric difference of $S$ and $S^{\prime}$ divided by the cardinality of $S.$

Our agnostic learning algorithm establishing Theorem 7.1 is obtained by repeated application of the efficient procedure whose performance guarantee is given in the following proposition:

Let $P$ be a binary product distribution with mean vector $p$ and $\varepsilon>0$ be sufficiently small. Let $S$ be $\varepsilon$ -good with respect to $P$ , and $S^{\prime}$ be any multiset with $\Delta(S,S^{\prime})\leq 2\varepsilon.$ There exists a polynomial time algorithm Filter-Balanced-Product that, given $S^{\prime}$ and $\varepsilon>0,$ returns one of the following:

A mean vector $p^{\prime}$ such that $\|p-p^{\prime}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).$

A multiset $S^{\prime\prime}\subseteq S^{\prime}$ such that $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d.$

We start by showing how Theorem 7.1 follows easily from Proposition 7.7.

The proof of Theorem 7.1 is very similar to that of Theorem 5.1, however, we include it here for completeness. By the definition of $\Delta(S,S^{\prime}),$ since $S^{\prime}$ has been obtained from $S$ by corrupting an $\varepsilon$ -fraction of the points in $S,$ we have that $\Delta(S,S^{\prime})\leq 2\varepsilon.$ By Lemma 7.6, the set $S$ of uncorrupted samples is $\varepsilon$ -good with respect to $P$ with probability at least $1-\tau.$ We henceforth condition on this event.

Our algorithm iteratively applies the Filter-Balanced-Product procedure of Proposition 7.7 until it terminates returning a mean vector $p^{\prime}$ with $\|p-p^{\prime}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).$ We claim that we need at most $d+1$ iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets $S_{0}=S^{\prime},S_{1}^{\prime},\ldots,$ such that $\Delta(S,S_{i}^{\prime})\leq\Delta(S,S^{\prime})-i\cdot(2\varepsilon/d).$ Thus, if the algorithm does not terminate in the first $d$ iterations, we have $S^{\prime}_{d}=S,$ and in the next iteration we output the sample mean of $S$ . ∎

In this section, we describe the efficient procedure establishing Proposition 7.7 followed by its proof of correctness. Our algorithm Filter-Balanced-Product is very simple: We consider the empirical distribution defined by the (corrupted) sample multiset $S^{\prime}.$ We calculate its mean vector $\mu^{S^{\prime}}$ and covariance matrix $M$ . If the matrix $M$ has no large eigenvalues, we return $\mu^{S^{\prime}}.$ Otherwise, we use the eigenvector $v^{\ast}$ corresponding to the maximum magnitude eigenvalue $\lambda^{\ast}$ of $M$ and the mean vector $\mu^{S^{\prime}}$ to define a filter. We zero out the diagonal elements of the covariance matrix for the following reason: The diagonal elements could contribute up to $\Omega(1)$ to the spectral norm, even without noise. This would prevent us from obtaining the desired error of $\widetilde{O}(\varepsilon).$ Our efficient filtering procedure is presented in detailed pseudocode below.

Tightness of our Analysis. We remark that the analysis of our filter-based algorithm is tight, and more generally our bound of $O(\varepsilon\sqrt{\log(1/\varepsilon)})$ is a bottleneck for filter-based approaches.

More specifically, we note that our algorithm will never successfully add points back to $S$ after they have been removed by the adversary. Therefore, if an $\varepsilon$ -fraction of the points in $S$ are changed, our algorithm may be able to remove these outliers from $S^{\prime},$ but will not be able to replace them with their original values. These changed values can alter the sample mean by as much as $\Omega(\varepsilon\sqrt{\log(1/\varepsilon)}).$

The rest of this section is devoted to the proof of correctness of algorithm Filter-Balanced-Product.

1.2 Setup and Basic Structural Lemmas

By definition, there exist disjoint multisets $L,E,$ of points in $\{0,1\}^{d}$ , where $L\subset S,$ such that $S^{\prime}=(S\setminus L)\cup E.$ With this notation, we can write $\Delta(S,S^{\prime})=\frac{|L|+|E|}{|S|}.$ Our assumption $\Delta(S,S^{\prime})\leq 2\varepsilon$ is equivalent to $|L|+|E|\leq 2\varepsilon\cdot|S|,$ and the definition of $S^{\prime}$ directly implies that $(1-2\varepsilon)|S|\leq|S^{\prime}|\leq(1+2\varepsilon)|S|.$ Throughout the proof, we assume that $\varepsilon$ is a sufficiently small constant. Our analysis will make essential use of the following matrices:

Our first claim follows from the Chernoff bound and the definition of a good set:

The following sequence of lemmata bound from above the spectral norms of the associated matrices. Our first simple lemma says that the (diagonally reduced) empirical covariance matrix $M_{S}$ , where $S$ is the set of uncorrupted samples drawn from the binary product distribution $P,$ is a good approximation to the matrix $M_{P},$ in spectral norm.

If $S$ is $\varepsilon$ -good, $\|M_{P}-M_{S}\|_{2}\leq O(\varepsilon).$

It suffices to show that $|(M_{P})_{i,j}-(M_{S})_{i,j}|\leq O(\varepsilon/d)$ for all $i\neq j$ . Then, we have that

A similar expression holds for $M_{S}$ except with probabilities for $X\in_{u}S.$ Since $S$ is $\varepsilon$ -good with respect to $P$ , we have $|(M_{P})_{i,j}-(M_{S})_{i,j}|\leq\varepsilon/d+\mu^{S^{\prime}}_{i}\varepsilon/d+\mu^{S^{\prime}}_{j}\varepsilon/d\leq 3\varepsilon/d$ . This completes the proof. ∎

As a simple consequence of the above lemma, we obtain the following:

If $S$ is $\varepsilon$ -good, $\|M-(1/|S^{\prime}|)(|S|M_{P}+|E|M_{E}-|L|M_{L})\|_{2}=O(\varepsilon).$

First note that we can write $|S^{\prime}|M=|S|M_{S}+|E|M_{E}^{0}-|L|M_{L}^{0},$ where $M_{E}^{0}$ and $M_{L}^{0}$ are obtained from $M_{E}$ and $M_{L}$ by zeroing out the diagonal. Observe that $|E|+|L|=O(\varepsilon)|S^{\prime}|.$ This follows from the assumption that $\Delta(S,S^{\prime})\leq 2\varepsilon$ and the definition of $S^{\prime}.$ Now note that the matrices $M_{E}-M_{E}^{0}$ and $M_{L}-M_{L}^{0}$ are diagonal with entries at most $1,$ and thus have spectral norm at most $1.$ The claim now follows from Lemma 7.9. ∎

We have that $\|M_{P}\|_{2}\leq\|\mu^{S^{\prime}}-p\|_{2}^{2}.$

Note that $(M_{P})_{i,j}=(\mu^{S^{\prime}}_{i}-p_{i})(\mu^{S^{\prime}}_{j}-p_{j})$ for $i\neq j$ and otherwise. Therefore, $M_{P}$ is the difference of $(\mu^{S^{\prime}}-p)(\mu^{S^{\prime}}-p)^{T}$ and the diagonal matrix with entries $(\mu^{S^{\prime}}_{i}-p_{i})^{2}.$ This in turn implies that

Note that both bounding matrices have spectral norm at most $\|\mu^{S^{\prime}}-p\|_{2}^{2}$ , hence so does $M_{P}.$ ∎

The following lemma, bounding from above the spectral norm of $M_{L},$ is the main structural result of this section. This is the core result needed to establish that the subtractive error cannot change the sample mean by much:

We have that $\|M_{L}\|_{2}=O(\log(|S|/|L|)+\|\mu^{S^{\prime}}-p\|_{2}^{2}+\varepsilon\cdot|S|/|L|),$ hence

Since $L\subseteq S$ , for any $x\in\{0,1\}^{d}$ , we have that

The RHS is in turn bounded from above as follows:

where the second line follows from (35) and the third line follows from Claim 7.8. This establishes the first part of the lemma.

The bound $(|L|/|S|)\|M_{L}\|_{2}=O(\varepsilon\log(1/\varepsilon)+\varepsilon\|\mu^{S^{\prime}}-p\|_{2}^{2})$ follows from the previously established bound using the monotonicity of the function $x\log(1/x),$ and the fact that $|L|/|S|\leq 2\varepsilon.$ The observation $|S|/|S^{\prime}|\leq 1+2\varepsilon\leq 2$ completes the proof of the second part of the lemma. ∎

Claim 7.10 combined with Lemmas 7.11 and 7.12 and the triangle inequality yield the following:

We have that $\|M-(|E|/|S^{\prime}|)M_{E}\|_{2}=O(\varepsilon\log(1/\varepsilon)+\|\mu^{S^{\prime}}-p\|_{2}^{2}).$

We are now ready to analyze the two cases of the algorithm Filter-Balanced-Product.

1.3 The Case of Small Spectral Norm

We start by analyzing the case where the mean vector $\mu^{S^{\prime}}$ is returned. This corresponds to the case that the spectral norm of $M$ is appropriately small, namely $\|M\|_{2}\leq O(\varepsilon\log(1/\varepsilon)).$ We start with the following simple claim:

Let $\mu^{E},\mu^{L}$ be the mean vectors of $E$ and $L$ respectively. Then, $\|\mu^{E}-\mu^{S^{\prime}}\|_{2}^{2}\leq\|M_{E}\|_{2}$ and $\|\mu^{L}-\mu^{S^{\prime}}\|_{2}^{2}\leq\|M_{L}\|_{2}.$

We prove the first inequality, the proof of the second being identical. Note that $M_{E}$ is a symmetric matrix, so $\|M_{E}\|_{2}=\max_{\|v\|_{2}=1}|v^{T}M_{E}v|.$ Moreover, for any vector $v$ we have that

Let $w=\mu^{E}-\mu^{S^{\prime}}$ and take $v=w/\|w\|_{2}.$ We conclude that $\|M_{E}\|_{2}\geq\|w\|_{2}^{2},$ as desired. ∎

The following crucial lemma, bounding from above the distance $\|\mu^{S^{\prime}}-p\|_{2}$ as a function of $\varepsilon$ and $\|M\|_{2},$ will be important for both this and the following subsections.

We have that $\|\mu^{S^{\prime}}-p\|_{2}\leq 2\sqrt{\varepsilon\|M\|_{2}}+O(\varepsilon\sqrt{\log(1/\varepsilon)}).$

First we observe that the mean vector $\mu^{S}$ of the uncorrupted sample set $S$ is close to $p.$ Since $S$ is $\varepsilon$ -good, this follows from the fact that for any $i\in[d],$ we have

Therefore, we get that $\|\mu^{S}-p\|_{2}\leq\varepsilon/\sqrt{d}.$

Consider $\mu^{E}$ and $\mu^{L}$ , the mean vectors of $E$ and $L$ , respectively. By definition, we have that

and thus by the triangle inequality we obtain

Therefore, we have the following sequence of inequalities:

where the first line follows from the triangle inequality, the second uses Claim 7.14, while the third uses Lemma 7.12 and Corollary 7.13. Finally, the last couple of lines assume that $\varepsilon$ is sufficiently small. The proof of Lemma 7.15 is now complete. ∎

We can now deduce the correctness of Step 7 of the algorithm Filter-Balanced-Product, since for $\|M\|_{2}\leq O(\varepsilon\log(1/\varepsilon)),$ Lemma 7.15 directly implies that $\|\mu^{S^{\prime}}-p\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).$

1.4 The Case of Large Spectral Norm

We next show the correctness of the algorithm Filter-Balanced-Product if it returns a filter (rejecting an appropriate subset of $S^{\prime}$ ) in Step 8. This corresponds to the case that $\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon),$ for a sufficiently large universal constant $C>0.$ We will show that the multiset $S^{\prime\prime}\subset S^{\prime}$ computed in Step 8 satisfies $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d.$

We start by noting that, as a consequence of Lemma 7.15, we have the following:

We have that $\|\mu^{S^{\prime}}-p\|_{2}\leq\delta:=3\sqrt{\varepsilon\|M\|_{2}}.$

By Lemma 7.15, we have that $\|\mu^{S^{\prime}}-p\|_{2}\leq 2\delta/3+O(\varepsilon\sqrt{\log(1/\varepsilon)}).$ Recalling that $\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon),$ if $C>0$ is sufficiently large, the term $O(\varepsilon\sqrt{\log(1/\varepsilon)})$ is at most $\delta/3.$ ∎

By construction, $v^{\ast}$ is the unit eigenvector corresponding to the maximum magnitude eigenvalue of $M.$ Thus, we have $(v^{\ast})^{T}Mv^{\ast}=\|M\|_{2}=\delta^{2}/(9\varepsilon).$ We thus obtain that

where the equality holds by definition, and the inequality follows from Corollary 7.13 and Claim 7.16 using the fact that $\varepsilon$ is sufficiently small and the constant $C$ is sufficiently large (noting that the constant in the RHS of Corollary 7.13 does not depend on $C$ ).

We show that (36) implies the existence of a $T>0$ with the properties specified in Step 8 of the algorithm Filter-Balanced-Product. More specifically, we have the following crucial lemma:

If $\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon),$ for a sufficiently large constant $C>0,$ there exists a $T>0$ satisfying the property in Step 8 of the algorithm Filter-Balanced-Product, i.e., such that

Assume for the sake of contradiction that this is not the case, i.e., that for all $T>0$ we have that

Since $E\subseteq S^{\prime},$ for all $x\in\{0,1\}^{d}$ , we have that $|S^{\prime}|\Pr_{X\in_{u}S^{\prime}}[X=x]\geq|E|\Pr_{Y\in_{u}E}[Y=x].$ This fact combined with (37) implies that for all $T>0$

Using (36) and (38), we have the following sequence of inequalities:

This yields the desired contradiction recalling that the assumption $\|M\|_{2}\geq C\varepsilon\log(1/\varepsilon)$ and the definition of $\delta$ imply that $\delta\geq C^{\prime}\varepsilon\sqrt{\log(1/\varepsilon)}$ for an appropriately large $C^{\prime}>0.$ ∎

The following simple claim completes the proof of Proposition 7.7:

We have that $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d\;.$

it suffices to show that $|E\setminus E^{\prime}|\geq|L^{\prime}\setminus L|+2\varepsilon|S|/d.$ Note that $|L^{\prime}\setminus L|$ is the number of points rejected by the filter that lie in $S\cap S^{\prime}.$ By Claim 7.8 and Claim 7.16, it follows that the fraction of elements $x\in S$ that are removed to produce $S^{\prime\prime}$ (i.e., satisfy $|v^{\ast}\cdot(x-\mu_{S^{\prime}})|>T+\delta$ ) is at most $2\exp(-T^{2}/2)+\varepsilon/d.$ Hence, it holds that $|L^{\prime}\setminus L|\leq(2\exp(-T^{2}/2)+\varepsilon/d)|S|.$ On the other hand, Step 8 of the algorithm ensures that the fraction of elements of $S^{\prime}$ that are rejected by the filter is at least $8\exp(-T^{2}/2)+8\varepsilon/d.$ Note that $|E\setminus E^{\prime}|$ is the number of points rejected by the filter that lie in $S^{\prime}\setminus S.$ Therefore, we can write:

where the second line uses the fact that $|S^{\prime}|\geq|S|/2$ and the last line uses the fact that $|L^{\prime}\setminus L|/|S|\leq(2\exp(-T^{2}/2)+\varepsilon/d).$ This completes the proof of the claim. ∎

2 Agnostically Learning Arbitrary Binary Product Distributions

In this subsection, we build on the approach of the previous subsection to show the following:

Similarly to the case of balanced product distributions, we will require a notion of a “good” set for our distribution. For technical reasons, the definition in this setting turns out to be more complicated. Roughly speaking, this is to allow us to ignore coordinates for which the small fraction of errors is sufficient to drastically change the sample mean.

We note that a sufficiently large set of samples from $P$ will satisfy the above properties with high probability:

If $S$ is obtained by taking $\Omega(d^{6}\log(1/\tau)/\varepsilon^{3})$ independent samples from $P,$ it is $(\varepsilon,1/5)$ -good with respect to $P$ with probability at least $9/10.$

The proof of this lemma is deferred to Section D.

We will also require a notion of the number of coordinates on which $S$ non-trivially depends:

Similarly to the balanced case, our algorithm is obtained by repeated application of an efficient filter procedure, whose precise guarantee is described below.

Let $P$ be a binary product distribution in $d$ dimensions and $\varepsilon>0.$ Suppose that $S$ is an $(\varepsilon,\eta)$ -good multiset with respect to $P$ with $\eta>10\varepsilon$ and $S^{\prime}$ be any multiset with $\Delta(S,S^{\prime})\leq 20\varepsilon.$ There exists a polynomial time algorithm which, given $\varepsilon$ and $S^{\prime}$ , returns one of the following:

A multiset $S^{\prime\prime}\subset S^{\prime}$ of elements of $\{0,1\}^{d}$ such that there exists a product distribution $\widetilde{P}$ with mean ${\widetilde{p}}$ and a multiset $\widetilde{S}$ that is $(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})$ -good with respect to $\widetilde{P}$ such that

Our agnostic learning algorithm is then obtained by iterating this procedure. We can prove Theorem 7.19 given Proposition 7.23.

When $\tau\leq 1/5$ , we will need to draw fresh $\varepsilon$ -corrupted samples and repeat this procedure $O(\log(1/\tau))$ times, and then one of the resulting output distributions is within total variation distance $O(\sqrt{\varepsilon\log(1/\varepsilon)})$ with probability at least $1-\tau/2$ . Then we use the agnostic hypothesis selection procedure of Lemma 2.23. ∎

In this section, we describe and analyze the efficient routine establishing Proposition 7.23. Our efficient filtering procedure is presented in detailed pseudocode below.

This completes the description of the algorithm. We now proceed to prove correctness.

2.2 Chi-Squared Distance and Basic Reductions

As previously mentioned, our algorithm will use the $\chi^{2}$ -distance between the mean vectors as a proxy for the total variation distance between two binary product distributions. Since the mean vector of the target distribution is not known to us, we will not be able to use the symmetric definition of the $\chi^{2}$ -distance used in Lemma 2.17 We will instead require the following asymmetric version of the $\chi^{2}$ -distance:

The following fact follows directly from Lemma 2.17.

There are two problems with using the $\chi^{2}$ distance between the mean vectors as a proxy for the total variation distance. The first is that the $\chi^{2}$ -distance between the means is a very loose approximation of the total variational distance when the means are close to or $1$ in some coordinate. To circumvent this obstacle, we remove such coordinates via an appropriate pre-processing in Step 8. The second is that the above asymmetric notion of the $\chi^{2}$ -distance may be quite far from the symmetric definition. To overcome this issue, it suffices to have that $q_{i}=O(p_{i})$ and $1-q_{i}=O(1-p_{i}).$ To ensure this condition is satisfied, we appropriately modify the target product distribution (that we aim to be close to). Next, we will show how we deal with these problems in detail.

The next reduction will be slightly more complicated and depends on the following idea: Suppose that there is a new product distribution ${\widetilde{P}}$ with mean ${\widetilde{p}}$ and an $(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})$ -good multiset ${\widetilde{S}}$ for ${\widetilde{P}}$ such that

Then, it suffices to show that our algorithm works for ${\widetilde{P}}$ and ${\widetilde{S}}$ instead of $P$ and $S$ (note that the input to the algorithm, $S^{\prime}$ and $\varepsilon$ in the same in either case). This is because the conditions imposed by the output in this case would be strictly stronger. In particular, we may assume that $\mu^{S^{\prime}}_{i}\geq p_{i}/3$ for all $i$ :

There is a product distribution ${\widetilde{P}}$ whose mean vector ${\widetilde{p}}$ satisfies $\mu^{S^{\prime}}_{i}\geq{\widetilde{p}}_{i}/3$ and $1-\mu^{S^{\prime}}_{i}\geq(1-{\widetilde{p}}_{i}/3)$ for all $i$ , and a set ${\widetilde{S}}\subseteq S$ that is $(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})$ -good for ${\widetilde{P}}$ and satisfies

If all coordinates $i$ have $\mu^{S^{\prime}}_{i}\geq p_{i}/3$ and $1-\mu^{S^{\prime}}_{i}\geq(1-p_{i}/3),$ then we can take ${\widetilde{P}}=P$ and ${\widetilde{S}}=S.$

Suppose that the $i^{th}$ coordinate has $\mu^{S^{\prime}}_{i}<p_{i}/3.$ Let ${\widetilde{P}}$ be the product whose mean vector ${\widetilde{p}}$ has ${\widetilde{p}}_{i}=0$ and ${\widetilde{p}}_{j}=p_{j}$ for $j\neq i.$ Let ${\widetilde{S}}$ be obtained by removing from $S$ all of the entries with $1$ in the $i^{th}$ -coordinate. Then, we claim that ${\widetilde{S}}$ is $(\varepsilon,\eta-p_{i})$ -good for ${\widetilde{P}}$ and has $\Delta({\widetilde{S}},S^{\prime})+p_{i}/5\leq\Delta(S,S^{\prime}).$ Note that here we have $\|p-{\widetilde{p}}\|_{1}=p_{i}.$

First, we show that ${\widetilde{S}}$ is $(\varepsilon,\eta-p_{i})$ -good for ${\widetilde{P}}.$ For any affine function $L(x)$ and set $T\subseteq[d]$ with $\sum_{j\in T}{\widetilde{p}}_{j}(1-{\widetilde{p}}_{j})\leq\eta-p_{i},$ we need to show that

Let ${\widetilde{T}}=T\cup\{i\}.$ We may or may not have $i\in T$ but, from the definition of ${\widetilde{p}}$ ,

Moreover, note that $S_{{\widetilde{T}}}={\widetilde{S}}_{T}$ and $P_{{\widetilde{T}}}={\widetilde{P}}_{T}.$ Thus, ${\widetilde{S}}$ is $(\varepsilon,\eta-p_{i})$ -good for ${\widetilde{P}}.$

Next, we show that $\Delta({\widetilde{S}},S^{\prime})+p_{i}/5\leq\Delta(S,S^{\prime}).$ We write $S={\widetilde{S}}\setminus{\widetilde{L}}\cup{\widetilde{E}}.$ We write $S_{1},L_{1},S^{\prime}_{1}$ for the subset of $S,L,S^{\prime}$ respectively, where the $i^{th}$ coordinate is $1.$ Since $S$ is $(\varepsilon,\eta)$ -good for $P,$ we have that $|\mu^{S}_{i}-p_{i}|\leq\varepsilon^{3/2}/d^{2}.$ Recall that we are already assuming that ${\widetilde{p}}_{i}\geq\varepsilon/d.$ Thus, $\mu^{S}_{i}\geq 29p_{i}/30.$ Therefore, we have that $|S_{1}|\geq 29p_{i}|S|/30.$ On the other hand, we have that $|S^{\prime}_{1}|\leq p_{i}|S^{\prime}|/3\leq 11p_{i}|S|/30.$ Thus, $|L_{1}|=|S_{1}\setminus S^{\prime}_{1}|\geq 18p_{i}|S|/30.$ This means that $p_{i}=O(\Delta({\widetilde{S}},S^{\prime}))=O(\varepsilon).$ However, ${\widetilde{E}}=E\cup S^{\prime}_{1}$ and ${\widetilde{L}}=L\setminus L_{1}$ . This gives

Similarly, suppose that instead the $i^{th}$ -coordinate has $1-\mu^{S^{\prime}}_{i}<(1-p_{i})/3.$ Let ${\widetilde{P}}$ be the product whose mean vector ${\widetilde{p}}$ has ${\widetilde{p}}_{i}=1$ and ${\widetilde{p}}_{j}=p_{j}$ for $j\neq i.$ Let ${\widetilde{S}}$ be obtained by removing from $S$ all of the entries with in the $i^{th}$ -coordinate. Then, by a similar proof we have that ${\widetilde{S}}$ is $(\varepsilon,\eta-(1-p_{i}))$ -good for ${\widetilde{P}}$ and has $\Delta({\widetilde{S}},S^{\prime})+(1-p_{i})/5\leq\Delta(S,S^{\prime}).$ Note that here we have $\|p-{\widetilde{p}}\|_{1}=1-p_{i}.$

By an easy induction, we can set all coordinates $i$ with $\mu^{S^{\prime}}_{i}\geq{\widetilde{p}}_{i}/3$ and $1-\mu^{S^{\prime}}_{i}\geq(1-{\widetilde{p}}_{i}/3)$ to or $1$ respectively, giving an ${\widetilde{S}}$ and ${\widetilde{P}}$ such that ${\widetilde{S}}$ is $(\varepsilon,\eta-\|p-{\widetilde{p}}\|_{1})$ -good for ${\widetilde{P}}$ and

In conclusion, throughout the rest of the proof we may and will assume that for all $i,$

$\varepsilon/d\leq\mu^{S^{\prime}}_{i}\leq 1-\varepsilon/d.$

$\mu^{S^{\prime}}_{i}\geq p_{i}/3$ and $1-\mu^{S^{\prime}}_{i}\geq(1-p_{i})/3.$

2.3 Setup and Basic Structural Lemmas

As in the balanced case, we can write $S^{\prime}=(S\setminus L)\cup E$ for disjoint multisets $L$ and $E.$ Similarly, we define the following matrices:

Note that we no longer zero-out the diagonals of $M_{P}$ and $M_{S}$ . This will turn out to allow us to more naturally relate spectral properties of these matrices to the $\chi^{2}$ -distance between the means. We start with the following simple claim:

$\operatorname*{Var}_{X\sim P}[v\cdot X]\leq 9$ and $|v\cdot(p-\mu^{S^{\prime}})|\leq\sqrt{\chi^{2}(\mu^{S^{\prime}},p)},$ and

$\Pr_{X\sim P}\left(|v\cdot X-\mu^{S^{\prime}}|\geq T+\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\right)\leq 9/T^{2}\;.$

where we used that $p_{i}\leq 3\mu^{S^{\prime}}_{i},$ $(1-p_{i})\leq 3(1-\mu^{S^{\prime}}_{i})$ and the assumption in the claim statement. For the second part of (i), note that

where the first inequality is Cauchy-Schwarz, and the second follows from the assumption in the claim statement that $\sum_{i=1}^{d}v_{i}^{2}\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})\leq 1.$ This proves (i).

To prove (ii), we note that Chebyshev’s inequality gives

where the second inequality follows from (i). To complete the proof note the inequality

where we used the triangle inequality and the second part of (i). ∎

Let $\mbox{Cov}[S]$ denote the sample covariance matrix with respect to $S,$ and $\mbox{Cov}[P]$ denote the covariance matrix of $P.$ We will need the following lemma:

$\left|\sqrt{\chi^{2}(\mu^{S^{\prime}},\mu^{S})}-\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\right|\leq O(\varepsilon/d),$ and

$\|D\left(\mbox{Cov}[S]-\mbox{Cov}[P]\right)D\|_{2}\leq O(\sqrt{\varepsilon})\;.$

For (i): Since $S$ is good, for any $i\in[d],$ we have

Therefore, by the triangle inequality we get

where the second inequality uses the fact that $\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})\geq\varepsilon/(2d).$

For (ii): Since $S$ is good, for any $i,j\in[d],$ we have

Combined with the bound $|\mu^{S}_{i}-p_{i}|\leq\varepsilon^{3/2}/d^{2}$ above, this gives

Note that $\|D\|_{2}=\max_{i}\left(1/\sqrt{\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})}\right)\leq\sqrt{2d/\varepsilon}.$ Therefore,

Combining Claim 7.27 and Lemma 7.28 we obtain:

$\operatorname*{Var}_{X\in_{u}S}[v\cdot X]\leq 10$ and $|v\cdot(\mu^{S}-\mu^{S^{\prime}})|\leq\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}+O(\varepsilon/d),$ and

$\Pr_{X\in_{u}S}\left(|v\cdot X-\mu^{S^{\prime}}|\geq T+\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\right)\leq 9/T^{2}+\varepsilon^{3/2}/d^{2}\;.$

where the second line uses Lemma 7.28 (ii), and the assumption $\|D^{-1}v\|_{2}^{2}=\sum_{i=1}^{d}v_{i}^{2}\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})\leq 1,$ and the third line holds for small enough $\varepsilon.$ Thus, using Claim 7.27 (i), we get that

By the Cauchy-Schwarz inequality and Lemma 7.28, we get

Part (ii) follows directly from Claim 7.27 (ii) and the assumption that $S$ is good for $P.$ ∎

We have that $\|D(M_{S}-M_{P})D\|_{2}\leq O(\sqrt{\varepsilon})$ .

We can show that $|(M_{S})_{i,j}-(M_{P})_{i,j}|\leq O(\varepsilon^{3/2}/d^{2})$ for all $i,j\in[d]$ , by expanding the LHS in terms of the differences of linear threshold functions on $S$ and $P$ in the same way as the proof of Lemma 7.28. Thus,

Finally note that $\|D\|_{2}=\max_{i}\left(1/\sqrt{\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})}\right)\leq\sqrt{2d/\varepsilon}$ , and so

We have that $\left\|D(|S^{\prime}|M-|S|M_{P}-|E|M_{E}+|L|M_{L})D\right\|_{2}=O(|S^{\prime}|\cdot\sqrt{\varepsilon})\;.$

This follows from Lemma 7.30 combined with the fact that $|S^{\prime}|M=|S|M_{S}+|E|M_{E}-|L|M_{L}$ and the observation $|S|\leq|S^{\prime}|/(1-2\varepsilon)\leq 2|S^{\prime}|.$ ∎

We have that $\|DM_{P}D\|_{2}\leq 9+\chi^{2}(\mu^{S^{\prime}},p).$

The following crucial lemma bounds from above the contribution to the error from $L$ :

The spectral norm $\|DM_{L}D\|_{2}=O(|S^{\prime}|/|L|+\chi^{2}(\mu^{S^{\prime}},p)).$

where the first line uses the triangle inequality, the second line uses Claim 7.27 (i), the third line follows from the fact that $L\subseteq S$ , the fourth line uses Corollary 7.29 (i), and the last line uses the fact that $\varepsilon$ is small enough. ∎

The above lemmata and the triangle inequality yield the following:

We have that $\left\|D\left(M-(|E|/|S^{\prime}|)M_{E}\right)D\right\|_{2}=O(1+\chi^{2}(\mu^{S^{\prime}},p))\;.$

We are now ready to analyze the two cases of the algorithm Filter-Product.

2.4 The Case of Small Spectral Norm

We begin by bounding various $\chi^{2}$ -distances by the spectral norm of appropriate matrices.

Let $\mu^{E},\mu^{L}$ be the mean vector of $E$ and $L$ , respectively. Then, $\chi^{2}(\mu^{S^{\prime}},\mu^{E})\leq\|DM_{E}D\|_{2}$ and $\chi^{2}(\mu^{S^{\prime}},\mu^{L})\leq\|DM_{L}D\|_{2}.$

We prove the first inequality, the proof of the second being very similar.

We can now prove that the output in Step 10 has the desired guarantee:

We have that $\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq 2\sqrt{\varepsilon\|DMD\|_{2}}+O(\sqrt{\varepsilon}).$

Since $S^{\prime}=(S\setminus L)\cup E,$ we have that $|S^{\prime}|\mu^{S^{\prime}}=|S|\mu^{S}+|E|\mu^{E}-|L|\mu^{L}.$ Recalling that $L,E$ are disjoint, the latter implies that

First note that, by Lemma 7.28, $|\sqrt{\chi^{2}(\mu^{S^{\prime}},\mu^{S})}-\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}|\leq O(\varepsilon/d).$ Lemma 7.35 and Corollary 7.34 give that

For $\varepsilon$ sufficiently small, we have that the $\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}$ terms satisfy

Recalling that $|E|/|S^{\prime}|\leq\Delta(S,S^{\prime})|S|/|S^{\prime}|\leq(5/2)\varepsilon$ , we now have:

Let $\delta:=3\sqrt{\varepsilon|\lambda|}$ . For some universal constant $C$ , if $\delta\leq C\sqrt{\varepsilon\log(1/\varepsilon)},$ then $\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq O(\sqrt{\varepsilon\log(1/\varepsilon)}).$ Otherwise, we have $\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq\delta.$

If $C$ is sufficiently large, when $\delta>C\sqrt{\varepsilon\log(1/\varepsilon)},$ this $O(\sqrt{\varepsilon})$ is at most $C\sqrt{\varepsilon\log(1/\varepsilon)}/6.$ ∎

By Corollary 7.37, we have that $\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq O(\sqrt{\varepsilon\log(1/\varepsilon)}).$ Thus, by Corollary 7.25, the total variation distance between the product distributions with means $p$ and $\mu^{S^{\prime}}$ is $O(\sqrt{\varepsilon\log(1/\varepsilon)}).$ ∎

2.5 The Case of Large Spectral Norm

We next need to show the correctness of the algorithm if it returns a filter, If we reach this step, then we have $\|DMD\|_{2}=\Omega(1),$ indeed $|v^{\prime}DMDv^{\prime T}|=\Omega(1),$ and by Corollary 7.37, it follows that $\sqrt{\chi^{2}(\mu^{S^{\prime}},p)}\leq\delta$ where $\delta:=3\sqrt{\varepsilon\|DMD\|_{2}}.$

Since $\|v^{\prime}\|_{2}=1,$ $Dv^{\prime}$ satisfies $\sum_{i=1}^{d}(Dv^{\prime})_{i}^{2}\mu^{S^{\prime}}_{i}(1-\mu^{S^{\prime}}_{i})=\sum_{i=1}^{m}v_{i}^{\prime 2}=1.$ Thus, we can apply Corollary 7.29 to it.

Let $a=\max_{x\in S^{\prime}}|v^{\ast}\cdot x-\mu^{S^{\prime}}|$ . Firstly, we look at the expected number of samples we reject:

Next, we look at the expected number of false positive samples we reject. If we write $S^{\prime\prime}=S\cup L^{\prime}\setminus E^{\prime}$ for disjoint multisets $L^{\prime}$ and $E^{\prime}$ , then these are the elements of $L^{\prime}\setminus L$ . We have:

Agnostically Learning Mixtures of Two Balanced Binary Products, via Filters

Unfortunately, bounds on the top absolute eigenvalue do not translate as well into bounds on the total variation distance of our estimate to the true distribution, as they did in all previous cases (e.g., if the top absolute eigenvalue is small in the case of learning the mean of a Gaussian with identity covariance, we can just use the empirical mean, etc). In fact, an eigenvalue $\lambda$ could just mean that $p$ and $q$ differ by $\sqrt{\lambda}$ along the direction $v$ . However, we can proceed by zeroing out the diagonals. If $uu^{T}$ has any large value along the diagonal, this operation can itself produce large eigenvalues. So, this strategy only works when $\|u\|_{\infty}$ is appropriately bounded. When $\|u\|_{\infty}$ is large, there is a separate strategy to deal with large entries in $u$ by guessing a coordinate whose value is large and conditioning on it, and once again setting up a modified eigenvalue problem. Our overall algorithm then follows from balancing all of these different cases, and we describe the technical components in more detail in the next subsection.

This section is devoted to the proof of the following theorem:

Recall that our overall approach is based on two strategies that succeed under different assumptions. Our first algorithm (Section 8.2) assumes that there exists a coordinate in which the means of the two component product distributions differ by a substantial amount. Under this assumption, we can use the empirical mean vectors conditioned on this coordinate being and $1$ . We show that the difference between these conditional mean vectors is almost parallel to the difference between the mean vectors of the product distributions. Considering eigenvectors perpendicular to this difference will prove a critical part of the analysis of this case. Our second algorithm (Section 8.3) succeeds under the assumption that the mean vectors of the two product distributions are close in all coordinates. This assumption allows us to zero out the diagonal of the covariance matrix without introducing too much error.

We note that together these algorithms will cover all possible cases. Our final algorithm runs all of these procedures in parallel, obtaining a polynomial number of candidate hypothesis distributions, such that at least one is sufficiently close to $\Pi$ . We then run the tournament described by Lemma 2.23 in order to find a particular candidate that is sufficiently close to the target. To ensure that all the distributions returned are in some finite set $\mathcal{M}$ , we round each of the probabilities of each of the products to the nearest multiple of $\varepsilon/d$ , and similarly round the mixing weight to the nearest multiple of $\varepsilon$ . This introduces at most $O(\varepsilon)$ additional error.

2 Mixtures of Products Whose Means Differ Significantly in One Coordinate

We will use the following notation. Let $\Pi$ be a mixture of two $c$ -balanced binary product distributions. We will write $\Pi$ as $\alpha P+(1-\alpha)Q,$ where $P,Q$ are binary product distributions with mean vectors $p,q$ , and $\alpha\in$ . In this subsection, we prove the following theorem:

For simplicity of the analysis, we will assume without loss of generality that $i^{\ast}=d,$ unless otherwise specified. First, we determine some conditions under which our sample set will be sufficient. We start by recalling our condition of a good set for a balanced binary product distribution:

We will also need this to hold after conditioning on the last coordinate.

Finally, we define the notion of a good set for a mixture of two balanced products.

Let $\Pi=\alpha P+(1-\alpha)Q$ be a mixture of two binary product distributions. We say that a multiset $S$ of elements of $\{0,1\}^{d}$ is $(\varepsilon,i)$ -good with respect to $\Pi$ if we can write $S=S_{P}\cup S_{Q}$ , where $S_{P}$ is $(\varepsilon,i)$ -good with respect to $P$ , $S_{Q}$ is $(\varepsilon,i)$ -good with respect to $Q$ , and $|\frac{|S_{P}|}{|S|}-\alpha|\leq\varepsilon/d^{2}$ .

We now show that taking random samples from $\Pi$ produces such a set with high probability.

Let $\Pi=\alpha P+(1-\alpha)Q$ be a mixture of binary product distributions, where $P,Q$ are binary product distributions with mean vectors $p,q$ . Let $S$ be a set obtained by taking $\Omega(d^{4}\log(1/\tau)/\varepsilon^{13/6})$ independent samples from $\Pi$ . Then, with probability at least $1-\tau$ , $S$ is $(\varepsilon,i)$ -good with respect to $\Pi$ for all $i\in[d]$ .

The proof of this lemma is deferred to Section E.

We claim that given a good set with an $\varepsilon$ -fraction of its entries corrupted, we can still determine $\Pi$ from it. In particular, this is achieved by iterating the following proposition.

We note that iteratively applying this algorithm until it outputs a set $R$ of mixtures gives Theorem 8.2.

Notation. All vectors in this section should be assumed to be over the first $d-1$ coordinates only. We will write $p_{-d}$ and $q_{-d}$ for the first $d-1$ coordinates of $p$ and $q$ , but for other vectors we will use the similar notation to that used elsewhere to denote $(d-1)$ -dimensional vectors.

The algorithm, written in terms of $i^{\ast}$ instead of $d$ for generality, is as follows:

We now proceed to prove correctness. We note that given $S=S_{P}\cup S_{Q}$ , we can write

where $S^{\prime}_{P}\subseteq S_{P}$ , $S^{\prime}_{Q}\subseteq S_{Q}$ and $E$ is disjoint from $S_{P}\setminus S^{\prime}_{P}$ and $S_{Q}\setminus S^{\prime}_{Q}$ . Thus, we have

We next need some basic lemmata relating the means of some of these distributions.

We next show that $\mu^{(1)}-\mu^{(0)}$ is approximately parallel to $p_{-d}-q_{-d}$ . Note that if we had $S=S^{\prime}$ and $\mu^{S_{P}}=p_{-d},\mu^{S_{Q}}=q_{-d}$ , then $\mu^{(1)}-\mu^{(0)}$ would be a multiple of $p_{-d}-q_{-d}$ . Since $S$ is $\varepsilon$ -good, we can bound the error introduced by $\mu^{S_{P}}-p$ , $\mu^{S_{Q}}-q_{-d}$ and Lemma 8.8 allow us to bound the error in taking $\mu^{S^{\prime}_{P}},\mu^{S^{\prime}_{Q}}$ instead of $p_{-d},q_{-d}$ . However, we still have terms in the conditional means of $E$ :

For some scalars $a=O(\varepsilon)$ , $b^{0}=O(|E^{0}|/|S^{\prime}|)$ , $b^{1}=O(|E^{1}|/|S^{\prime}|)$ , we have

where $E^{j}$ is the subset of $E$ with last entry $j$ , $\mu^{E^{j}}$ is the mean of $E^{j}$ with $d^{th}$ coordinate removed.

Let $S_{P}^{\prime j},S_{Q}^{\prime j},E^{j},S^{\prime j}$ denote the subset of the appropriate set in which the last coordinate is $j$ . Let $\mu^{{S^{\prime}_{P}}^{j}},\mu^{{S^{\prime}_{Q}}^{j}},\mu^{E^{j}}$ denote the means of $S_{P}^{\prime j},S_{Q}^{\prime j}$ , and $E^{j}$ with the last entry truncated, respectively.

Taking the means of the subsets of $S^{\prime j}$ , we find that

Therefore, using this and Lemma 8.8, we have that

Since $|S^{\prime j}|=|S_{P}^{\prime j}|+|S_{Q}^{\prime j}|+|E^{j}|$ , we have:

Thus, the sum of the coefficients of the $p_{-d}$ and $q_{-d}$ terms in Equation (40) is $|E^{0}||S^{\prime 1}|-|E^{1}||S^{\prime 0}|$ , which is bounded in absolute value by $|E||S^{\prime}|\leq O(\varepsilon|S|^{2})$ . Meanwhile, the $p_{-d}$ coefficient of Equation (40) has:

Noting that $(|E^{1}||S^{\prime 0}|-|E^{0}||S^{\prime 1}|)\alpha=O(\varepsilon|S^{\prime}|^{2})$ and $(|E^{1}||S^{\prime 0}|-|E^{0}||S^{\prime 1}|)(1-\alpha)=O(\varepsilon|S^{\prime}|^{2})$ , we can write Equation (40) as:

We write $\mu^{\Pi}=\alpha p_{-d}+(1-\alpha)q_{-d}$ and so, dividing by $|S^{\prime}|^{2}$ and recalling that $|E|/|S^{\prime}|\leq O(\varepsilon)$ , we get

If $\mu^{\Pi}=\mu$ , then we would be done. So, we must bound the error introduced by making this substitution. We can express $\mu$ as

Substituting this into Equation (41), gives the lemma. ∎

We now show that, for any vector $v$ perpendicular to $u$ , if the variance of $S^{\prime}$ in the $v$ -direction is small, then $v\cdot p_{-d}$ and $v\cdot q_{-d}$ are both approximately $v\cdot\mu$ .

For any $v$ with $\|v\|_{2}=1$ , $v\cdot u=0$ , we have that $|v\cdot(p_{-d}-\mu)|\leq\delta$ and $|v\cdot(q_{-d}-\mu)|\leq\delta$ for $\delta:=C(\varepsilon^{1/6}\|\Sigma\|_{2}+\varepsilon^{2/3}\log(1/\varepsilon))$ for a sufficiently large constant $C$ as defined in the algorithm.

Next, since $v\cdot u=0$ , we have by Lemma 8.9 that

However, we have that $|S^{\prime}|\mu=|S^{\prime}_{P}|\mu^{S^{\prime}_{p}}+|S^{\prime}_{Q}|\mu^{S^{\prime}_{q}}+|E|\mu^{E}+|S^{\prime}|O(\varepsilon\log(1/\varepsilon))$ , and so

Inserting our assumptions that $\alpha(1-\alpha)\geq\varepsilon^{1/6}/2$ , and $p_{d}-q_{d}\geq\varepsilon^{1/6}$ gives

We can now show that if we return $R,$ some distribution returned is close to $\Pi$ . First, we show that there are points on $L$ close to $p_{-d}$ and $q_{-d}$ .

It is clear that even discretizing $c$ and $d$ , we can still find such a pair that satisfies this condition.

There are $p^{\prime},q^{\prime}\in L$ such that $\|p_{-d}-p^{\prime}\|_{2},\|q_{-d}-q^{\prime}\|_{2}\leq\delta+O(\varepsilon^{1/6})\;.$

Similarly, we show that there is a $q^{\prime}\in L$ such that $\|q-q^{\prime}\|_{2}\leq\delta+O(\varepsilon^{1/6})$ . ∎

Now, we are ready to analyze the second part of our algorithm. The basic idea will be to show that if $\lambda$ is large, then a large fraction of the variance in the $v$ -direction is due to points in $E$ .

Since $S_{P}$ and $S_{Q}$ are $\varepsilon$ -good, we have that

Since $\lambda$ is larger than a sufficiently large constant, this completes the proof. ∎

We next show that the threshold $T>0$ required by our algorithm exists.

If $\lambda\geq\Omega(1)$ , there exists a $T>0$ such that

Assume for the sake of contradiction that this is not the case, i.e., that for all $T>0$ we have that

Using the fact that $E\subset S^{\prime}$ , this implies that for all $T>0$

Finally, we show that $S^{\prime\prime}$ is closer to $S$ than $S^{\prime}$ was.

If the algorithm returns $S^{\prime\prime}$ then $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d$ .

Since $S^{\prime\prime}\subset S$ , we can write $S^{\prime\prime}=S^{\prime\prime}_{P}\cup S^{\prime\prime}_{Q}\cup E^{\prime\prime}$ for $S^{\prime\prime}_{P}\subseteq S^{\prime}_{P}$ , $S^{\prime\prime}_{Q}\subseteq S_{Q}$ and $E^{\prime\prime}\subset E$ , where $E^{\prime\prime}$ has disjoint support from $S^{\prime\prime}_{P}\setminus S_{P}$ and $S^{\prime\prime}_{Q}\setminus S_{Q}$ . Thus, we need to show that

By Lemma 8.10, we have that $|v^{\ast}\cdot(\mu-p_{-d})|\leq\delta$ and so

Since $S$ is $(\varepsilon,d)$ -good, we have

We get the same bound for $X\in_{u}S_{Q}$ , and so

Since $L^{\prime\prime}_{P}\cup L^{\prime\prime}_{Q}\subseteq S$ but any $x\in(S^{\prime}_{P}\setminus S^{\prime\prime}_{P})\cup(S^{\prime}_{Q}\setminus S^{\prime\prime}_{Q})$ has $v^{\ast}\cdot(x-\mu)\geq T+\delta$ , we have that

3 Mixtures of Products Whose Means Are Close in Every Coordinate

In this section, we prove the following theorem:

We will assume without loss of generality that $\alpha\leq 1/2.$ We may also assume that $\alpha>10\delta\geq 10\varepsilon$ since otherwise, we can make use of our algorithm for learning a single product distribution.

In this context, we require the following slightly different definition of a good set:

Let $S$ be a multiset in $\{0,1\}^{d}$ . We say that $S$ is $\varepsilon$ -good for the mixture $\Pi$ if there exists a partition $S=S_{P}\cup S_{Q}$ such that $\left|\frac{|S_{P}|}{|S|}-\alpha\right|\leq\varepsilon$ and that $S_{P}$ and $S_{Q}$ are $\varepsilon/6$ -good for the component product distributions $P$ and $Q$ , respectively.

If $\Pi$ has mixing weights $\delta\leq\alpha\leq 1-\delta$ , with probability at least $1-\tau$ , a set $S$ of $\Omega(d^{4}\log 1/\tau/(\varepsilon^{2}\delta))$ samples drawn from $\Pi$ is good for $\Pi$ .

Our theorem will follow from the following proposition:

Before we present the algorithm, we give one final piece of notation. For $S$ a set of points, we let $\mbox{Cov}(S)$ be the sample covariance matrix of $S$ and $\mbox{Cov}_{0}(S)$ be the sample covariance matrix with zeroed out diagonal. Our algorithm is presented in detailed pseudocode in Algorithm 16.

To analyze this algorithm, we begin with a few preliminaries. Firstly, we recall that $S=S_{P}\cup S_{Q}$ . We can write $S^{\prime}=S^{\prime}_{P}\cup S^{\prime}_{Q}\cup E$ , where $S^{\prime}_{P}\subset S_{P}$ , $S^{\prime}_{Q}\subset S_{Q}$ , and

Let $\mu^{S^{\prime}_{P}}$ and $\mu^{S^{\prime}_{Q}}$ be the sample means of $S^{\prime}_{P}$ and $S^{\prime}_{Q}$ , respectively.

We have that $\alpha\|p-\mu^{S^{\prime}_{P}}\|_{2},(1-\alpha)\|q-\mu^{S^{\prime}_{Q}}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)})\;.$

We will require that the matrix $\mbox{Cov}_{0}(S^{\prime})$ is close to being PSD. The proof of this fact is rather technical and we defer it to the appendix.

Let $T$ be the multiset obtained from $S^{\prime}$ by replacing all points of $S^{\prime}_{P}$ with copies of $\mu^{S^{\prime}_{P}}$ and all points of $S^{\prime}_{Q}$ with copies of $\mu^{S^{\prime}_{Q}}$ . Then, $\|\mbox{Cov}_{0}(S^{\prime})-\mbox{Cov}(T)\|_{2}=O(\delta^{2})\;.$

We are now prepared to show that the first return condition outputs a correct answer. We begin by showing that vectors $u$ with large inner products with $\mu^{S^{\prime}_{P}}-\mu$ or $\mu^{S^{\prime}_{Q}}-\mu$ correspond to large eigenvectors of $\mbox{Cov}_{0}(S^{\prime})$ .

Using Lemma 8.22, we have $u^{T}\mbox{Cov}_{0}(S^{\prime})u=\operatorname*{Var}_{X\in_{u}T}(u\cdot X)+O(\delta^{2})\|u\|_{2}^{2}$ . From the definition of $T$ it follows that

Next, we show that, if there is only one large eigenvalue of $\mbox{Cov}_{0}(S^{\prime})$ , the means in question are both close to a given line.

Next, we analyze the second case of the algorithm. We must show that Step 8 will find an $r$ and $t$ . First, we claim that there is a $\theta$ which makes $r$ nearly perpendicular to $\mu^{S^{\prime}_{p}}-\mu^{S^{\prime}_{Q}}$ .

There exists a $r=(\cos\theta)u^{\ast}+(\sin\theta)v^{\ast}$ , with $\theta$ a multiple of $\delta^{2}/d$ , that has

Let $z=(\mu^{S^{\prime}_{P}}-\mu^{S^{\prime}_{Q}})$ . If $u^{\ast}\cdot z=0$ , then $\theta=0$ suffices. Otherwise, we take $\theta^{\prime}=\cot^{-1}(\frac{v^{\ast}\cdot z}{u^{\ast}\cdot z}).$ Then, let $\theta$ be the nearest multiple of $\delta^{2}/d$ to $\theta^{\prime}$ . Note that $|\cos\theta-\cos\theta^{\prime}|,|\sin\theta-\sin\theta^{\prime}|\leq|\theta-\theta^{\prime}|$ and $|u^{\ast}\cdot z|,|v^{\ast}\cdot z|\leq\sqrt{\|z\|_{2}}\leq\sqrt{d}$ . Then, we have

We now need to show that for this $r$ , Step 8 will find a $t$ . For this $r$ , $r\cdot\mu^{S^{\prime}_{P}}$ and $r\cdot\mu^{S^{\prime}_{Q}}$ are close. We need to show that $E$ contains many elements $x$ whose $r\cdot x$ is far from these. We can express this in terms of $T$ :

Let $r$ be a unit vector in $r\in\langle u^{\ast},v^{\ast}\rangle$ with $|r\cdot(\mu^{S^{\prime}_{P}}-\mu^{S^{\prime}_{Q}})|\leq\delta^{2}/\sqrt{d}.$ Then, there is a $t>1$ such that

for sufficiently large $C$ and we also have that $r^{T}\mbox{Cov}_{0}(S^{\prime})r$ is positive.

Suppose for a contradiction that this lemma does not hold. Then, since $E\subset T$ , we have

Since we assumed that $\delta^{2}\geq\Omega(\varepsilon\log(1/\varepsilon)$ , this is a contradiction.

To get a similar result for $S^{\prime}$ , we first need to show that $S^{\prime}_{P}$ and $S^{\prime}_{Q}$ are suitably concentrated about their means:

If $t\geq 1+\sqrt{2\log 6/\varepsilon}$ , this is strictly less than $2\varepsilon/3.$

Since $|r\cdot(\mu^{S_{Q}}-\mu^{S_{P}})|\leq\delta^{2}/\sqrt{d}\leq 1/2,$ we have

Noting that $1-(|S^{\prime}_{P}|+|S^{\prime}_{Q}|)/|S^{\prime}|=|E|/|S^{\prime}|\geq 4\varepsilon/3$ , we have

for $\varepsilon$ sufficiently small. If $t\geq 1+\sqrt{2\log 6/\varepsilon}$ , this expression is $(5/2)(\varepsilon/6)+\varepsilon/5d\leq 2\varepsilon/3$ . ∎

Now we can finally show that a $t$ exists for this $r$ , so Step 8 will succeed:

There is a $t\geq 1+2\sqrt{\log(9/\varepsilon)}$ such that

By Lemma 8.27, there exists a $t\geq 1$ such that

Using the definition of $T$ , the points when $x=\mu^{S^{\prime}_{P}}$ or $x=\mu^{S^{\prime}_{Q}}$ do not contribute to this probability so all points in $T$ that satisfy $r\cdot(x-\mu^{S^{\prime}_{P}})>2t$ come from $E$ . Since $E\subset S^{\prime}$ and $|S^{\prime}|=|T|$ , we have

Noting that $|E|/|S^{\prime}|\leq 4\varepsilon/3$ , all except a $4\varepsilon/3$ fraction of points $x\in T$ have $r\cdot(x-\mu^{S^{\prime}_{P}})=O(\delta^{2})$ . So, $4\varepsilon/3\geq 12\exp(-(t-1)^{2}/4)$ . Therefore, $t\geq 1+2\sqrt{\log(9/\varepsilon)}$ .

Thus, by Lemma 8.28, we have $(1-|E|/|S^{\prime}|)\Pr_{X\in_{u}S^{\prime}_{P}\cup S^{\prime}_{Q}}\left(r\cdot(X-\mu^{S^{\prime}_{P}})>t\right)<2\varepsilon/3$ . Again, using that $|E|/|S^{\prime}|\leq 4\varepsilon/3$ , we have that

Consequently, if $x$ satisfies $r\cdot(x-\mu^{S^{\prime}_{P}})>2t$ , then it satisfies $\Pr_{Y\in_{u}S^{\prime}}\left(r\cdot(x-Y)\leq t\right)<2\varepsilon$ . Substituting this condition into Equation (43) gives the lemma. ∎

Again we need to show that any filter does not remove too many points of $S$ . We need to show this for an arbitrary $r$ , not just one nearly parallel to $\mu^{S^{\prime}_{P}}-\mu^{S^{\prime}_{Q}}$ .

For any unit vector $r^{\prime}$ and $t\geq 2\sqrt{\log(1/\varepsilon)}$ , we have

Every point $x$ with $|r^{\prime}\cdot(x-p)|\leq t/2$ has $|r^{\prime}\cdot(x-y)\leq t|$ for all $y$ with $|r^{\prime}\cdot(y-p)|\leq t/2$ . Thus, for $x$ with $|r^{\prime}\cdot(x-p)|\leq t/2$ , we have

When $t\geq 2\sqrt{\log(1/\varepsilon)}$ , we have

Thus, we have $\Pr_{Y\in_{u}S^{\prime}}(r\cdot(x-Y)\leq t)\geq 4\varepsilon>2\varepsilon.$

But inequality (44) gives a bound on the number of $x$ in $S_{P}$ that do not satisfy this condition. That is,

Similarly, every point $x$ with $|r^{\prime}\cdot(x-q)|\leq t/2$ has

Dividing by $|S^{\prime}|$ and noting that $|S|\leq(1+2\varepsilon)|S^{\prime}|\leq(3/2)|S^{\prime}|$ completes the proof. ∎

Now, we can show that the filter improves $\Delta(S,S^{\prime\prime}),$ and such that the algorithm is correct in the filter case.

If we reach Step 9 and return $S^{\prime\prime}$ , then $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})-2\varepsilon/d.$

We can write $S^{\prime\prime}=S^{\prime\prime}_{P}\cup S^{\prime\prime}_{Q}\cup E^{\prime\prime}$ , where $E^{\prime\prime}$ has disjoint support from $S_{P}\setminus S^{\prime\prime}_{P}$ and $S_{Q}\setminus S^{\prime\prime}_{Q}$ . Note that, since we have $S^{\prime\prime}\subset S^{\prime}$ , we can define these sets such that $S^{\prime\prime}_{P}\subseteq S^{\prime}_{P}$ , $S^{\prime\prime}_{Q}\subseteq S^{\prime}_{Q}$ and $E^{\prime\prime}\subseteq E$ . We assume that we do. Now we have that

In Step 8, we found a vector $r$ and $t\geq 1+2\sqrt{\log(1/\varepsilon)}$ such that

Then in Step 9, we remove at least a $12\exp(-t^{2}/4)+3\varepsilon/d$ fraction of points. That is,

The fact that $t\geq 1+2\sqrt{\log(1/\varepsilon)}$ allows us to use Lemma 8.30, with $r^{\prime}=r$ , yielding that:

since $|S^{\prime}|\geq|S|(1-\Delta(S,S^{\prime}))\geq(1-2\varepsilon)|S|\geq 5|S|/6$ . ∎

Acknowledgements

J.L. would like to thank Michael B. Cohen and Samuel B. Hopkins for some very helpful discussions. We thank the reviewers for their detailed feedback, and Lili Su for pointing out an error in the proof of Lemma 5.3.

References

Appendix A Deferred Proofs from Section 4

This section contains deferred proofs of several concentration inequalities.

Therefore, by the triangle inequality, we have

Observe that the first term on the right hand side does not depend on the choice of $J$ . Let $E_{1}$ denote the event that

By Lemma 2.22, this happens with probability $1-\tau$ so long as

For any $J\subset[n]$ so that $|J|=(1-\varepsilon)n$ , let $E_{2}(J)$ denote the event that

Fix any such $J$ . By multiplying both sides by $\rho=(1-\varepsilon)/\varepsilon$ , the event $E_{2}(J)$ is equivalent to the event that

Let $A,B$ be as in Lemma 2.22. Observe that $\rho\delta_{1}=\Omega(\log 1/\varepsilon)\geq 1$ for $\varepsilon$ sufficiently small. Then, by Lemma 2.22, we have that for any fixed $J$ ,

Let $H(\varepsilon)$ denote the binary entropy function. We now have

as claimed, where (a) follows by a union bound over all sets $J$ of size $(1-\varepsilon)N$ , (b) follows from the bound $\log\binom{n}{\varepsilon n}\leq\varepsilon H(\varepsilon)$ , (c) follows since $H(\varepsilon)=O(\varepsilon\log 1/\varepsilon)$ as $\varepsilon\rightarrow 0$ , (d) follows from our choice of $\delta_{1}$ , and (e) follows from our choice of $n$ . This completes the proof. $\hfill\qed$

Proof of Theorem 4.12: We first recall Isserlis’ theorem, which we will require in this proof.

where the $\sum\prod$ is over all matchings of $\{1,\ldots,k\}$ .

Since $A$ is symmetric, it has a eigenvalue expansion $A=\sum_{i=1}^{d}\lambda_{i}u_{i}u_{i}^{T}$ , which immediately implies that $v=\sum_{i=1}^{d}\lambda_{i}u_{i}\otimes u_{i}$ . Let $X\sim\operatorname{\mathcal{N}}(0,\Sigma)$ . We compute the quadratic form:

where the last line follows by invoking Isserlis’s theorem. We now manage both sums individually. We have

Proof of Corollary 4.9: Let $\mathfrak{S}_{m}=\{S\subseteq[N]:|S|=m\}$ denote the set of subsets of $[N]$ of size $m$ . The same Bernstein-style analysis as in the proof of Lemma 4.3 yields that there exist universal constants $A,B$ so that:

Thus, union bounding over all $m\in\{1,\ldots,\varepsilon N\}$ yields that

by the same manipulations as in the proof of Lemma 4.3. $\hfill\qed$

This follows immediately from Lemmas 5.17 and 5.20.

Appendix B Deferred Proofs from Section 5

Proof of Lemma 5.3: Let $N=\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau))$ be the number of samples drawn from $G$ . For (i), the probability that a coordinate of a sample is at least $\sqrt{2\nu\log(Nd/3\tau)}$ is at most $\tau/3dN$ by Fact 5.6. By a union bound, the probability that all coordinates of all samples are smaller than $\sqrt{2\nu\log(Nd/3\tau)}$ is at least $1-\tau/3$ . In this case, $\|x\|_{2}\leq\sqrt{2\nu d\log(Nd/3\tau)}=O(\sqrt{d\nu\log(N\nu/\tau)})$ .

After translating by $\mu^{G}$ , we note that (iii) follows immediately from Lemma 2.21 and (iv) follows from Theorem 5.50 of [Ver10], as long as $N=\Omega(\nu^{4}d\log(1/\tau)/\varepsilon^{2})$ , with probability at least $1-\tau/3$ . It remains to show that, conditioned on (i), (ii) holds with probability at least $1-\tau/3$ .

To simplify some expressions, let $\delta:=\varepsilon/(\log(d\log d/\varepsilon\tau))$ and $R=C\sqrt{d\log(|S|/\tau)}$ . We need to show that for all unit vectors $v$ and all $0\leq T\leq R$ that

Firstly, we show that for all unit vectors $v$ and $T>0$

with probability at least $1-\tau/6$ . Since the VC-dimension of the set of all halfspaces is $d+1$ , this follows from the VC inequality [DL01], since we have more than $\Omega(d/(\delta/(10\nu\log(1/\delta))^{2})$ samples. We thus only need to consider the case when $T\geq\sqrt{10\nu\ln(1/\delta)}$ .

For any fixed unit vector $v$ and $T>\sqrt{10\nu\ln(1/\delta)}$ , except with probability $\exp(-N\delta/(6C\nu))$ , we have that

Let $E$ be the event that $|v\cdot(X-\mu^{G})|>T$ . Since $G$ is sub-gaussian, Fact 5.6 yields that $\Pr_{G}[E]=\Pr_{Y\sim G}[|v\cdot(X-\mu^{G})|>T]\leq\exp(-T^{2}/(2\nu))$ . Note that, thanks to our assumption on $T$ , we have that $T\leq\exp(T^{2}/(4\nu))/2C$ , and therefore $T^{2}\Pr_{G}[E]\leq\exp(-T^{2}/(4\nu))/2C\leq\delta/2C$ .

where (a) follows from sub-gaussianity, (b) follows from our choice of $T$ , and (c) comes from the fact that $1+x\leq e^{x}$ for all $x$ .

Thus, if $\delta$ is a sufficiently small constant and $C$ is sufficiently large, this yields the desired bound. ∎

Now let $\mathcal{C}$ be a $1/2$ -cover in Euclidean distance for the set of unit vectors of size $2^{O(d)}$ . By a union bound, for all $v^{\prime}\in\mathcal{C}$ and $T^{\prime}$ a power of 2 between $\sqrt{4\nu\ln(1/\delta)}$ and $R$ , we have that

Then, by a union bound, (46) holds simultaneously for all unit vectors $v$ and all $0\leq T\leq R$ , with probability a least $1-\tau/3$ . This completes the proof. $\hfill\qed$

B.2 Proof of Lemma 5.16

Let $P_{2}=U^{T}\Lambda U$ , where $U$ is orthogonal and $\Lambda$ is diagonal be an eigen-decomposition of the symmetric matrix $P_{2}$ . Then, $p(x)=(U\Sigma^{-1/2}x)^{T}P_{2}(U\Sigma^{-1/2}x)$ . Let $X\sim G$ and $Y=U\Sigma^{-1/2}X$ . Then, $Y\sim\mathcal{N}(0,I)$ and $p(X)=\sum_{i}\lambda_{i}Y_{i}^{2}+p_{0}$ for independent Gaussians $Y_{i}$ . Thus, $p(X)$ follows a generalized $\chi^{2}$ -distribution.

Let $Z=\sum_{i}a_{i}Y_{i}^{2}$ , where $Y_{i}$ are independent random variables distributed as $\mathcal{N}(0,1)$ . Let $a$ be the vector with coordinates $a_{i}$ . Then,

Noting that $2\sqrt{a}=1+a-(1-\sqrt{a})^{2}\leq 1+a$ for $a>0$ , we have

Applying this for $-p(x)$ instead of $p(x)$ and putting these together, we get

Substituting $t=T/3\|P_{2}\|_{F}-1/3$ , and $2\|P_{2}\|_{F}^{2}=\operatorname*{Var}_{X\sim G}(p(X))$ gives:

The final property is a consequence of the following anti-concentration inequality:

B.3 Proof of Lemma 5.17

Proof of Lemma 5.17: Firstly, we note that it suffices to prove this for the case $\Sigma=I$ , since for $X\sim\operatorname{\mathcal{N}}(0,\Sigma)$ , $Y=\Sigma^{-1/2}X$ is distributed as $\operatorname{\mathcal{N}}(0,I)$ , and all the conditions transform to those for $G=\operatorname{\mathcal{N}}(0,I)$ under this transformation.

Condition 1 follows by standard concentration bounds on $\|x\|_{2}^{2}$ . Condition 2 follows by estimating the entry-wise error between $\mbox{Cov}(S)$ and $I$ . These two conditions hold by Lemma 5.3, since they follow from (i), (iii), and (iv) of $(\varepsilon,\tau)$ goodness in the sense of Definition 5.2.

Condition 4 is substantially the most difficult of these conditions to prove. Naively, we would want to find a cover of all possible $p$ and all possible $T$ , and bound the probability that the desired condition fails. Unfortunately, the best a priori bound on $\Pr(|p(G)|>T)$ are on the order of $\exp(-T)$ . As our cover would need to be of size $2^{d^{2}}$ or so, to make this work with $T=d$ , we would require on the order of $d^{3}$ samples in order to make this argument work.

However, we will note that this argument is sufficient to cover the case of $T<10\log(1/\varepsilon)\log^{2}(d/\varepsilon)$ .

Let $\mathcal{P}_{k}$ be the set of even, mean-, degree- $2$ polynomials, such that the associated matrix $A$ satisfies:

Note that for $p\in\mathcal{P}_{k}$ that $|p(x)|\leq|x|^{2}/\sqrt{k}+\sqrt{k}$ .

Importantly, any polynomial can be written in terms of these sets.

Let $A$ be the associated matrix to $p$ . Note that $\|A\|_{F}=\operatorname*{Var}{p}=1$ . Let $A_{k}$ be the matrix corresponding to the top $k$ eigenvalues of $A$ . We now let $p_{1}$ be the polynomial associated to $A_{1}/2$ , $p_{2}$ be associated to $(A_{2}-A_{1})/2$ , $p_{4}$ be associated to $(A_{4}-A_{2})/2$ , and so on. It is clear that $p=2(p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d})$ . It is also clear that the matrix associated to $p_{k}$ has rank at most $k$ . If the matrix associated to $p_{k}$ had an eigenvalue more than $1/\sqrt{k}$ , it would need to be the case that the $k/2^{nd}$ largest eigenvalue of $A$ had size at least $2/\sqrt{k}$ . This is impossible since the sum of the squares of the eigenvalues of $A$ is at most $1$ .

We will also need covers of each of these sets $\mathcal{P}_{k}$ . We will assume that condition (1) holds, i.e., that $\|x\|_{2}\leq\sqrt{R}$ , where $R=O(d\log(d/\varepsilon\tau))$ . Under this condition, $p(x)$ cannot be too large and this affects how small a variance polynomial we can ignore.

For each $k$ , there exists a set $\mathcal{C}_{k}\subset\mathcal{P}_{k}$ such that

For each $p\in\mathcal{P}_{k}$ there exists a $q\in\mathcal{C}_{k}$ such that $\operatorname*{Var}(p(G)-q(G))\leq 1/R^{2}d^{2}$ .

We note that any such $p$ is associated to a matrix $A$ of the form $A=\sum_{i=1}^{k}\lambda_{i}v_{i}v_{i}^{T}$ , for $\lambda_{i}\in[0,1/\sqrt{k}]$ and $v_{i}$ orthonormal. It suffices to let $q$ correspond to the matrix $A^{\prime}=\sum_{i=1}^{k}\mu_{i}w_{i}w_{i}^{T}$ for with $|\lambda_{i}-\mu_{i}|<1/R^{2}d^{3}$ and $|v_{i}-w_{i}|<1/R^{2}d^{3}$ for all $i$ . It is easy to let $\mu_{i}$ and $w_{i}$ range over covers of the interval and the sphere with appropriate errors. This gives a set of possible $q$ ’s of size $2^{O(dk\log R)}$ as desired. Unfortunately, some of these $q$ will not be in $\mathcal{P}_{k}$ as they will have eigenvalues that are too large. However, this is easily fixed by replacing each such $q$ by the closest element of $\mathcal{P}_{k}$ . This completes our proof. ∎

We next will show that these covers are sufficient to express any polynomial.

Combining the above two lemmata we have that any such $p$ can be written as

where $q_{k}$ above is in $\mathcal{C}_{k}$ and $\operatorname*{Var}[p_{k}(G)]<1/R^{2}d^{2}$ . Thus, $p^{\prime}=p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d}$ has $\operatorname*{Var}[p^{\prime}(G)]\leq O(1/R^{2})$ . This completes the proof. ∎

The key observation now is that if $|p(x)|\geq T$ for $\|x\|_{2}\leq\sqrt{d/\varepsilon}$ , then writing $p=q_{1}+q_{2}+q_{4}+\ldots+q_{d}+p^{\prime}$ as above, it must be the case that $|q_{k}(x)|>(T-1)/(2\log(d))$ for some $k$ . Therefore, to prove our main result, it suffices to show that, with high probability over the choice of $S$ , for any $T\geq 10\log(1/\varepsilon)\log^{2}(d/\varepsilon)$ and any $q\in\mathcal{C}_{k}$ for some $k$ , that $\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log(d))$ . Equivalently, it suffices to show that for $T\geq 10\log(1/\varepsilon)\log(d/\varepsilon)$ it holds $\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log^{2}(d))$ . Note that this holds automatically for $T>R$ , as $p(x)$ cannot possibly be that large for $\|x\|_{2}\leq\sqrt{R}$ . Furthermore, note that losing a constant factor in the probability, it suffices to show this only for $T$ a power of $2$ .

Therefore, it suffices to show for every $k\leq d$ , every $q\in\mathcal{C}_{k}$ and every $R/\sqrt{k}\gg T\gg\log(1/\varepsilon)\log R$ that with probability at least $1-{\tau}2^{-\Omega(dk\log R)}$ over the choice of $S$ we have that $\Pr_{x\in_{u}S}(|q(x)|>T)\ll\varepsilon/(T^{2}\log^{4}(R))$ . However, by the Hanson-Wright inequality, we have that

Therefore, by Chernoff bounds, the probability that more than a $\varepsilon/(T^{2}\log^{4}R)$ -fraction of the elements of $S$ satisfy this property is at most

Appendix C Deferred Proofs from Section 6

Construct an auxiliary graph on $k$ vertices, where we put an edge between nodes $r_{i}$ and $r_{i+1}$ . By the above, there must be a path from $j$ to $j^{\prime}$ in this graph. Since this graph has $k$ nodes, there must be a path of length at most $k$ from $j$ to $j^{\prime}$ ; moreover, by the above, we know that this implies that $\|\mu_{j}-\mu_{j^{\prime}}\|_{2}^{2}\leq O(kd\log k/\varepsilon)$ .

Finally, the fourth property follows from the same argument as the proof of the third. $\hfill\qed$

Proof of Lemma 6.9: Let $C=\sum_{i=1}^{N}w_{i}(X_{i}-\mu)(X_{i}-\mu)^{T}-I$ . Let $v$ be the top eigenvector of

In particular, since $\left|v^{T}\left(\sum_{i=1}^{N}w_{i}(X_{i}-\mu)(X_{i}-\mu)^{T}-(I+Q)\right)v\right|\geq ckh(k,\gamma,\delta)$ , we must have

Let $U=[v,u_{1},\ldots,u_{d-1}]$ be an $d\times k$ matrix with orthonormal columns, where the columns span the set of vectors $\left\{(\mu_{j}-\mu)\ :\ j\in[k]\right\}\cup\{v\}$ . We note the rank of this set is at most $k$ due to the definition of $\mu$ .

Using the dual characterization of the Schatten top- $k$ norm, we have that

Observe that since $\mbox{span}(Q)\subseteq\mbox{span}(U)$ , we have

Proof of Lemma 6.10: By Fact 4.2 and (34) we have $\|\sum_{i=G}\frac{w_{i}}{w_{g}}X_{i}-\mu\|_{2}\leq k^{1/2}\delta_{2}$ . Now, by the triangle inequality, we have

Using the fact that variance is nonnegative we have

where in the last inequality we have used Fact 4.2 and (33). Hence altogether this implies that

Once more, let $\Delta=\mu-\hat{\mu}$ and expand the formula for $M$ :

Suppose that $w=w^{*}$ . Then $\|M\|_{T_{k}}\leq\sum_{i\in[k]}\gamma_{j}+\frac{ckh(k,\gamma,\delta_{1})}{2}$ .

$w^{\ast}$ are the weights that are uniform on the uncorrupted points. Because $E\leq 2\varepsilon N$ , we have that $w^{\ast}\in S_{N,\varepsilon}$ . Using (33), this implies that $w^{*}\in\mathcal{C}_{f(k,\gamma,\delta_{1})}$ . By Corollary 6.11, $\|\Delta\|_{2}\leq O(\varepsilon\sqrt{\log 1/\varepsilon})$ .

Suppose that $w\not\in\mathcal{C}_{ckh(k,\gamma,\delta)}$ . Then $\|M\|_{T_{k}}>\sum_{i\in[k]}\gamma_{j}+\frac{ckh(k,\gamma,\delta_{1})}{2}$ .

We split into two cases. In the first case, $\|\Delta\|_{2}^{2}\leq\frac{ckh(k,\gamma,\delta)}{10}$ . By Lemma 6.9, we have that

In the other case, $\|\Delta\|_{2}^{2}\geq\frac{ckh(k,\gamma,\delta)}{10}$ . Recall that $Q=\sum_{j\in[k]}\alpha_{j}(\mu_{j}-\mu)(\mu_{j}-\mu)^{T}$ from (29). Write $M$ as follows:

Now taking the Schatten top- $k$ norm of $M$ , we have

The first inequality is the triangle inequality, the second is by (33) and Fact 4.2, the third is because the summed matrices are positive semidefinite, the fourth follows from Lemma 6.10, and the fifth holds for all $c$ sufficiently large. ∎

If $\|\Delta\|_{2}^{2}\leq\frac{ckh(k,\gamma,\delta)}{10}$ , then

But this is true for $c$ sufficiently large, as $\sqrt{h(k,\gamma,\delta)}\geq\sqrt{\delta}$ . Therefore,

where the second inequality holds since $\Lambda>\sum_{j\in[k]}\gamma_{j}+\frac{ckh(k,\gamma,\delta)}{2}$ .

On the other hand, consider when $\|\Delta\|_{2}^{2}\geq\frac{ckh(k,\gamma,\delta)}{10}$ . By (47), we know that

The first and third terms are immediately dominated by $\Omega\left(\frac{\|\Delta\|^{2}_{2}}{\varepsilon}\right)$ , it remains to show that

Or equivalently, $k^{1/2}\delta\varepsilon=O\left(\|\Delta\|_{2}\right).$ This follows since

Appendix D Deferred Proofs from Section 7

Proof of Lemma 7.21: By Lemma 7.6 applied with $\varepsilon^{\prime}:=\varepsilon^{3/2}/10d$ in place of $\varepsilon,$ since we have $\Omega(d^{4}\log(1/\tau)/\varepsilon^{\prime 2})$ samples from $P,$ with probability at least $1-\tau,$ the set $S$ is such that for all affine functions $L,$ it holds $|\Pr_{X\in_{u}S}(L(X)\geq 0)-\Pr_{X\sim P}(L(X)\geq 0)|\leq\varepsilon^{\prime}/d.$ We henceforth condition on this event.

Let $C_{T}$ be the event that all coordinates in $T$ take their most common value. For a single coordinate $i,$ the probability that it does not take its most common value, $\min\{p_{i},1-p_{i}\},$ satisfies

Thus, by a union bound, we have that $\Pr_{P}(C_{T})\geq 3/5.$ Let $\#_{T}(x)$ be the number of coordinates of $x$ in $T$ which do not have their most common value, and observe that $\#_{T}(x)$ is an affine function of $x.$ Noting that for $x\in\{0,1\}^{d}$ , we have that $1-\#_{T}(x)>0$ if and only if $C_{T}$ holds for $x,$ it follows that $|\Pr_{S}(C_{T})-\Pr_{P}[C_{T}]|\leq\varepsilon^{\prime}/d.$ Hence, we have that $\Pr_{S}(C_{T})\geq 1/2.$

Note that for $x\in\{0,1\}^{d}$ , we have that $L_{T}(x)>0$ if and only if $L(x)>0$ and $C_{T}$ holds for $x.$ Therefore, we can write

This completes the proof of Lemma 7.21. $\hfill\qed$

Appendix E Deferred Proofs from Section 8

Proof of Lemma 8.6: Let $S_{P}\subseteq S$ be the set of samples drawn from $P$ and $S_{Q}\subseteq S$ be the set of samples drawn from $Q$ . Firstly, we note that by a Chernoff bound, $\left||S_{P}|/|S|-\alpha\right|\leq O(\varepsilon/d^{2})$ with probability $1-\tau/3$ . Assuming this holds, it follows that $|S_{P}|\geq(\alpha/2)|S|\geq(\varepsilon^{1/6}/2)|S|=\Omega(d^{4}\log(1/\tau)/\varepsilon^{2})$ . Similarly, $|S_{Q}|\geq(1-\alpha)|S|/2\geq\Omega(d^{4}\log(1/\tau)/\varepsilon^{2})$ . By Lemma 7.6 applied with $\varepsilon^{\prime}:=(c^{2}/4)\cdot\varepsilon$ in place of $\varepsilon$ , since we have $\Omega((d^{4}+d^{2}\log(\tau))/\varepsilon^{\prime 2})$ samples, with probability $1-\tau/3$ , the set $S_{P}$ is $\varepsilon^{\prime}$ -good for $P$ , i.e., it satisfies that for all affine functions $L$ , $|\Pr_{X\in_{u}S_{P}}(L(X)>0)-\Pr_{X\sim P}(L(X)>0)|\leq\varepsilon^{\prime}/d.$ We show that assuming $S$ is $\varepsilon^{\prime}$ -good, it is $(\varepsilon,i)$ good for each $1\leq i\leq d$ .

Note that $\Pr_{X\sim P}[X_{i}=1]=p_{i}\geq c$ and $\Pr_{X\sim P}[X_{i}=0]=1-p_{i}\geq c$ . Since $|\Pr_{X\sim P}[X_{i}=1]-\Pr_{X\in_{u}S_{P}}[X_{i}=1]|\leq c^{2}\varepsilon/(4d)$ , it follows that $\Pr_{X\in_{u}S}[X_{i}=1]\geq c/2$ . For any affine function $L$ , define $L^{(0)}(x):=L(x)-x_{i}(\max_{y}|L(y)|)$ and $L^{(1)}(x):=L(x)-(1-x_{i})(\max_{y}|L(y)|)$ . Then, we have the following:

So, we have that $S_{P}$ is $(\varepsilon,i)$ good for $P$ for all $1\leq i\leq d$ with probability $1-\tau/3$ . Similarly, $S_{Q}$ is $(\varepsilon,i)$ good for $Q$ for all $1\leq i\leq d$ with probability $1-\tau/3$ . Thus, we have that $||S_{P}|/|S|-\alpha|\leq\varepsilon/d^{2}$ , $S_{P}$ is $(\varepsilon,i)$ good for $P$ and $S_{Q}$ is $(\varepsilon,i)$ good for $Q$ for all $1\leq i\leq d$ with probability $1-\tau$ . That is, $S$ is $(\varepsilon,i)$ -good for $\Pi$ for all $1\leq i\leq d$ with probability at least $1-\tau$ . $\hfill\qed$

Proof of Lemma 8.19: Let $S_{P}\subseteq S$ be the set of samples drawn from $P$ and $S_{Q}\subseteq S$ be the set of samples drawn from $Q$ . Firstly, we note that by a Chernoff bound, $||S_{P}|/|S|-\alpha|\leq O(\varepsilon/d^{2})$ with probability at least $1-\tau/3$ . Assuming this holds, $|S_{P}|\geq(\alpha/2)|S|\geq\delta|S|=\Omega(d^{4}\log(1/\tau)/\varepsilon^{2})$ . Similarly, $|S_{Q}|\geq(1-\alpha)|S|/2\geq\Omega(d^{4}\log(1/\tau)/\varepsilon^{2})$ .

By Lemma 7.6 applied with $\varepsilon^{\prime}:=\varepsilon/6$ , since we have $\Omega(d^{4}\log(1/\tau)/\varepsilon^{\prime 2})$ samples, with probability at least $1-\tau/3$ , the set $S_{P}$ is $\varepsilon$ -good for $P$ . Similarly, with probability at least $1-\tau/3$ , the set $S_{Q}$ is $\varepsilon$ -good for $Q$ . Thus, with probability $1-\tau$ , we have that $\left|\frac{|S_{P}|}{|S|}-\alpha\right|\leq\varepsilon$ and that $S_{P}$ and $S_{Q}$ are $\varepsilon$ -good for $P$ and $Q$ respectively. $\hfill\qed$

Noting that the mean of $T$ is $\mu$ and $|T|=|S^{\prime}|$ , we have:

Since $P$ and $Q$ are product distributions, $\mbox{Cov}(S^{\prime}_{P})$ and $\mbox{Cov}(S^{\prime}_{Q})$ can have large diagonal elements but small off-diagonal ones. On the other hand, we bound the elements on the diagonal of $\mbox{Cov}(T)$ , but $\|\mbox{Cov}(T)\|_{2}$ may still be large due to off-diagonal elements.

By the triangle inequality, and Equation (48) with zeroed diagonal, we have:

We will bound each of these terms separately.

Note that $\mbox{Cov}_{0}(T)-\mbox{Cov}(T)$ is a diagonal matrix and its non-zero entries are

Note that $\mu$ satisfies $|S^{\prime}|\mu=|S^{\prime}_{P}|\mu^{S^{\prime}_{P}}+|S^{\prime}_{Q}|\mu^{S^{\prime}_{Q}}+|E|\mu^{E}$ . Since $|S^{\prime}|-|E|=|S^{\prime}_{P}|+|S^{\prime}_{Q}|,$ we have $(|S^{\prime}|-|E|)(\mu-\mu^{S^{\prime}_{P}})=|S^{\prime}_{Q}|(\mu^{S^{\prime}_{Q}}-\mu^{S^{\prime}_{P}})+|E|(\mu^{E}-\mu).$ Using that $|S^{\prime}|-|E|=(1+O(\varepsilon))|S|$ , $|S_{Q}^{\prime}|=(1-\alpha)|S|-O(\varepsilon)$ , $|E|\leq O(\varepsilon)|S|$ , we have

Since $S$ is $\varepsilon$ -good for $\Pi$ , it follows that $\|\mu^{S_{P}}-p\|_{\infty}\leq\varepsilon/d$ and $\|\mu^{S_{Q}}-q\|_{\infty}\leq\varepsilon/d.$ Also,

Finally, $\|p-q\|_{\infty}\leq\delta.$ Thus, by the triangle inequality, we get

We have the following sequence of inequalities:

It remains to bound the $\left(\frac{|S^{\prime}_{P}|}{|S^{\prime}|}\right)\|\mbox{Cov}_{0}(S^{\prime}_{P})\|_{2}+\left(\frac{|S^{\prime}_{Q}|}{|S^{\prime}|}\right)\|\mbox{Cov}_{0}(S^{\prime}_{Q})\|_{2}$ terms in (49). To analyze the first of these terms, note that $\mbox{Cov}_{0}(P)=\mathbf{0}.$ We have that

Note that since $S$ is good, the $(i,j)$ -th entry of $\mbox{Cov}(S_{P})-\mbox{Cov}(P)$ has absolute value at most $\varepsilon/d.$ Thus,

Therefore, combining the above we have that

By the assumption on $\delta$ in Theorem 8.17, $\delta^{2}=\Omega(\varepsilon\log(1/\varepsilon))$ , and the proof is complete. ∎