Robust Estimators in High Dimensions without the Computational Intractability
Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, Alistair Stewart
Introduction
A central goal of machine learning is to design efficient algorithms for fitting a model to a collection of observations. In recent years, there has been considerable progress on a variety of problems in this domain, including algorithms with provable guarantees for learning mixture models [FOS08, KMV10, MV10, BS10, HK13], phylogenetic trees [CGG02, MR05], HMMs [AHK12], topic models [AGM12, AGHK13], and independent component analysis [AGMS15]. These algorithms crucially rely on the assumption that the observations were actually generated by a model in the family. However, this simplifying assumption is not meant to be exactly true, and it is an important direction to explore what happens when it holds only in an approximate sense. In this work, we study the following family of questions:
This is a natural formalization of the problem of designing robust and efficient algorithms for distribution estimation. We refer to it as (proper) agnostic distribution learning and we refer to the samples as being -corrupted. This family of problems has its roots in many fields, including statistics, machine learning, and theoretical computer science. Within computational learning theory, it is related to the agnostic learning model of Haussler [Hau92] and Kearns, Schapire, and Sellie [KSS94], where the goal is to learn a labeling function whose agreement with some underlying target function is close to the best possible, among all functions in some given class. In the even more challenging malicious noise model [Val85, KL93], an adversary is allowed to corrupt both the labels and the samples. A major difference with our setting is that these models apply to supervised learning problems, while here we will work in an unsupervised setting.
Within statistics and machine learning, inference problems like Question 1.1 are often termed “estimation under model misspecification.” The usual prescription is to use the maximum likelihood estimator [Hub67, Whi82], which is unfortunately hard to compute in general. Even ignoring computational considerations, the maximum likelihood estimator is only guaranteed to converge to the distribution in that is closest (in Kullback-Leibler divergence) to the distribution from which the observations are generated. This is problematic because such a distribution is not necessarily close to at all.
A branch of statistics – called robust statistics [HR09, HRRS86] – aims to tackle questions like the one above. The usual formalization is in terms of breakdown point, which (informally) is the fraction of observations that an adversary would need to control to be able to completely corrupt an estimator. In low-dimensions, this leads to the prescription that one should use the empirical median instead of the empirical mean to robustly estimate the mean of a distribution, and interquartile range for robust estimates of the variance. In high-dimensions, the Tukey median [Tuk75] is a high-dimensional analogue of the median that, although provably robust, is hard to compute [JP78]. Similar hardness results have been shown [Ber06, HM13] for essentially all known estimators in robust statistics.
Is high-dimensional agnostic distribution learning even possible, algorithmically? The difficulty is that corruptions are often hard to detect in high dimensions, and could bias the natural estimator by dimension-dependent factors. In this work, we study agnostic distribution learning for a number of fundamental classes of distributions: a single Gaussian, a product distribution on the hypercube , mixtures of two product distributions (under a natural balancedness condition), and mixtures of Gaussians with spherical covariances. Prior to our work, all known efficient algorithms (e.g., [LT15, BD15]) for these classes required the error guarantee, , to depend polynomially in the dimension . Hence, previous efficient estimators could only tolerate at most a fraction of errors. In this work, we obtain the first efficient algorithms for the aforementioned problems, where is completely independent of and depends polynomially (often, nearly linearly) in the fraction of corrupted samples. Our work is just a first step in this direction, and there are many exciting questions left to explore.
2 Our Techniques
All of our algorithms are based on a common recipe. The first question to address is the following: Even if we were given a candidate hypothesis , how could we test if it is -close in total variation distance to ? The usual way to certify closeness is to exhibit a coupling between and that marginally samples from both distributions, where the samples produced from each agree with probability . However, we have no control over the process by which samples are generated from , in order to produce such a coupling. And even then, the way that an adversary decides to corrupt samples can introduce complex statistical dependencies.
Unfortunately, in our agnostic setting, we cannot afford for (1) to depend on the dimension at all. Any such dependence would appear in the error guarantee of our algorithm. Instead, the starting point of our algorithms is a notion of parameter distance that satisfies
Given our notion of parameter distance satisfying (2), our main ingredient is an efficient method for robustly estimating the parameters. We provide two algorithmic approaches which are based on similar principles. Our first approach is faster, requiring only approximate eigenvalue computations. Our second approach relies on convex programming and achieves slightly better sample complexity, in some cases matching the information-theoretic limit. Notably, either approach can be used to give all of our concrete learning applications with nearly identical sample complexity and error guarantees. In what follows, we specialize to the problem of robustly learning the mean of a Gaussian whose covariance is promised to be the identity, which we will use to illustrate how both approaches operate. We emphasize that what is needed to learn the parameters in more general settings requires many additional ideas.
Our second algorithmic approach relies on convex programming. Here, instead of rejecting corrupted samples, we compute appropriate weights for the samples , such that the weighted empirical average is close to . We work with the convex set:
We prove that any set of weights in yields a good estimate in the obvious way. The catch is that the set is defined based on , which is unknown. Nevertheless, it turns out that we can use the same type of spectral arguments that underlie the filtering approach to design an approximate separation oracle for . Combined with standard results in convex optimization, this yields an algorithm for robustly estimating .
The third and final ingredient is some new concentration bounds. In both of the approaches above, at best we are hoping that we can remove all of the corrupted points and be left with only the uncorrupted ones, and then use standard estimators (e.g., the empirical average) on them. However, an adversary could have removed an -fraction of the samples in a way that biases the empirical average of the remaining uncorrupted samples. What we need are concentration bounds that show for sufficiently large , for samples from a Gaussian with mean and identity covariance, that every set of samples produces a good estimate for . In some cases, we can derive such concentration bounds by appealing to known concentration inequalities and taking a union bound. However, in other cases (e.g., concentration bounds for degree-two polynomials of Gaussian random variables) the existing concentration bounds are not strong enough, and we need other arguments to prove that every set of samples produces a good estimate.
3 Our Results
We give the first efficient algorithms for agnostically learning several important distribution classes with dimension-independent error guarantees. Our first main result is for a single arbitrary Gaussian with mean and covariance , which we denote by . In the previous subsection, we described our convex programming approach for learning the mean vector when the covariance is promised to be the identity. A technically more involved version of the technique can handle the case of zero mean and unknown covariance. More specifically, consider the following convex set, where is the unknown covariance matrix and is the Frobenius norm:
We design an approximate separation oracle for this unknown convex set, by analyzing the spectral properties of the fourth moment tensor of a Gaussian. Combining these two intermediate results, we obtain our first main result (below). Throughout this paper, we will abuse notation and write when referring to our sample complexity, to signify that our algorithm works if for a large enough universal constant .
We can alternatively establish Theorem 1.2 via our filtering technique. See Section 5. In the first version of our paper, our analysis required samples. In [DKK+17], we showed that a simple adaptation of our algorithm and analysis achieves the improved sample complexity above, which is information-theoretically optimal up to logarithmic factors. We have incorporated this modification (along with the analysis) into this version of the paper, for the sake of completeness.
For the sake of simplicity in the presentation, we did not make an effort to optimize the sample complexity of our robust estimators in the above setting. We note that methods similar to the analysis of the Gaussian setting can lead to near-optimal sample complexity in this setting as well. We also remark that for the case of balanced binary product distributions, our algorithm achieves an error of
Interestingly enough, the above two distribution classes are trivial to learn in the noiseless case, but in the agnostic setting the learning problem turns out to be surprisingly challenging. Using additional ideas, we are able to generalize our agnostic learning algorithms to mixtures of the above classes under some natural conditions. We note that even in the noiseless case, learning mixtures of the above families is non-trivial. First, we study -mixtures of -balanced products, which stipulates that the coordinates of the mean vector of each component are in the range . We prove:
This generalizes the algorithm of Freund and Mansour [FM99] to the agnostic setting. An interesting open question is to improve the -dependence in the above bound to (nearly) linear, or to remove the assumption of balancedness and obtain an agnostic algorithm for mixtures of two arbitrary product distributions.
Finally, we give an agnostic learning algorithm for mixtures of spherical Gaussians.
In total, these results give new robust and computationally efficient estimators for several well-studied distribution learning problems that can tolerate a constant fraction of errors independent of the dimension. This points to an interesting new direction of making robust statistics algorithmic. The general recipe we have developed here gives us reason to be optimistic about many other problems in this domain.
4 Discussion and Related Work
Our results fit in the framework of density estimation and parameter learning which are both classical problems in statistics with a rich history (see e.g., [BBBB72, DG85, Sil86, Sco92, DL01]). While these problems have been studied for several decades by different communities, the computational complexity of learning is still not well understood, even for some surprisingly simple distribution families. Most textbook estimators are hard to compute in general, especially in high-dimensional settings. In the past few decades, a rich body of work within theoretical computer science has focused on designing computationally efficient distribution learning algorithms. In a seminal work, Kearns, Mansour, Ron, Rubinfeld, Schapire, and Sellie [KMR+94] initiated a systematic investigation of the computational complexity of distribution learning. Since then, efficient learning algorithms have been developed for a wide range of distributions in both low and high-dimensions [Das99, FM99, AK01, VW02, CGG02, MR05, BV08, KMV10, MV10, BS10, DDS12, CDSS13, DDO+13, CDSS14a, CDSS14b, HP15, ADLS17, DDS15b, DDKT16, DKS16b, DKS16a].
We will be particularly interested in efficient learning algorithms for mixtures of high-dimensional Gaussians and mixtures of product distributions, as this is the focus of our algorithmic results in the agnostic setting. In a pioneering work, Dasgupta [Das99] introduced the problem of parameter estimation of a Gaussian mixture to theoretical computer science, and gave the first provably efficient algorithms under the assumption that the components are suitably well-separated. Subsequently, a number of works improved these separation conditions [AK01, VW02, BV08] and ultimately removing them entirely [KMV10, MV10, BS10]. In another line of work, Freund and Mansour [FM99] gave the first polynomial time algorithm for properly learning mixtures of two binary product distributions. This algorithm was substantially generalized to phylogenetic trees [CGG02] and to mixtures of any constant number of discrete product distributions [FOS08]. Given the vast body of work on high-dimensional distribution learning, there is a plethora of problems where one could hope to reconcile robustness and computational efficiency. Thus far, the only setting where robust and efficient algorithms are known is on one-dimensional distribution families, where brute-force search or some form of polynomial regression often works. In contrast, essentially nothing is known about efficient agnostic distribution learning in the high-dimensional setting that we study here.
Question 1.1 also resembles learning in the presence of malicious errors [Val85, KL93]. There, an algorithm is given samples from a distribution along with their labels according to an unknown target function. The adversary is allowed to corrupt an -fraction of both the samples and their labels. A sequence of works studied the problem of learning a homogeneous halfspace with malicious noise in the setting where the underlying distribution is a Gaussian [Ser01, Ser03, KLS09], culminating in the work of Awasthi, Balcan, and Long [ABL17], who gave an efficient algorithm that finds a halfspace with agreement . There is no direct connection between their problem and ours, especially since one is a supervised learning problem and the other is unsupervised. We note however that there is an interesting technical parallel in that the work [KLS09] also uses spectral methods to detect outliers. Both their work and our algorithm for agnostically learning the mean are based on the intuition that an adversary can only substantially bias the empirical mean if the corruptions are correlated along some direction. More specifically, [KLS09] produce a “hard” filter which leads to errors that scale logarithmically with the dimension, even in a weaker corruption model than ours. Our algorithms need to handle many significant conceptual and technical complications that arise when working with higher moments or other distribution families.
Another connection is to the work on robust principal component analysis (PCA). PCA is a transformation that (among other things) is often justified as being able to find the affine transformation that would place a collection of Gaussian random variables in isotropic position. One can think of our results on agnostically learning a Gaussian as a type of robust PCA that tolerates gross corruptions, where entire samples are corrupted. This is different than other variants of the problem where random sets of coordinates of the points are corrupted [CLMW11], or where the uncorrupted points were assumed to lie in a low-dimensional subspace to begin with [ZL14, LMTZ15]. Finally, Brubaker [Bru09] studied the problem of clustering samples from a well-separated mixture of Gaussians in the presence of adversarial noise. The goal of [Bru09] was to separate the Gaussian components from each other, while the adversarial points are allowed to end up in any of clusters. Our work is orthogonal to [Bru09], since even if such a clustering is given, the problem still remains to estimate the parameters of each component.
5 Concurrent and Subsequent Work
After the initial publication of our results [DKK+16], there has been a flurry of recent work on robust high-dimensional estimation. Diakonikolas, Kane, and Stewart [DKS16c] studied the problem of learning the parameters of a graphical model in the presence of noise, when given its graph theoretic structure. Charikar, Steinhardt, and Valiant [CSV17] developed algorithms that can tolerate a fraction of corruptions greater than a half, under the weaker goal of outputting a small list of candidate hypotheses that contains a parameter set close to the true values. Balakrishnan, Du, Li, and Singh [Li17, DBS17, BDLS17] studied sparse mean and covariance estimation in the presence of noise obtaining computationally efficient robust algorithms with sample complexity sublinear in the dimension. Diakonikolas, Kane, and Stewart [DKS17] proved statistical query lower bounds providing evidence that the error guarantees of our robust mean and covariance estimation algorithms are best possible, within constant factors, for efficient algorithms. In a subsequent paper [DKK+17], we obtained improved bounds on the sample complexity of our algorithms, which are optimal up to polylogarithmic factors. For the sake of completeness, we include these improved sample bounds in the present version of this paper. In the same work [DKK+17], we showed that our algorithmic approach easily extends to obtain dimension-independent robustness guarantees under much weaker distributional assumptions, and gave a practical demonstration of the efficacy of our robust algorithms on both real and synthetic data.
Since the initial submission of the journal version of this paper, there has been a substantial amount of work on robust high-dimensional estimation in a variety of settings. Diakonikolas, Kane, and Stewart [DKS18a] studied PAC learning of geometric concept classes (including low-degree polynomial threshold functions and intersections of halfspaces) in the same corruption model as ours, obtaining the first dimension-independent error guarantees for these classes. Steinhardt, Charikar, and Valiant [SCV18] focused on deterministic conditions of a dataset which allow robust estimation to be possible. In our initial publication, we gave explicit deterministic conditions in various settings; by focusing directly on this goal, [SCV18] somewhat relaxed some of these assumptions. Meister and Valiant [MV17] studied learning in a crowdsourcing model, where the fraction of honest workers may be very small (similar to [CSV17]). Qiao and Valiant [QV18] considered robust estimation of discrete distributions in a setting where we have several sources (a fraction of which are adversarial) who each provide a batch of samples. A number of simultaneous works [KSS18, HL18, DKS18b] investigated robust mean estimation in even more general settings, and apply their techniques to learning mixtures of spherical Gaussians under minimal separation conditions. Finally, several concurrent results study robustness in supervised learning tasks [PSBR18, KKM18, DKK+18], including regression and SVM problems. Despite all of this rapid progress, there are still many interesting theoretical and practical questions left to explore.
6 Organization
The structure of this paper is as follows: In Section 2, we introduce basic notation and a number of useful facts that will be required throughout the paper, as well as the formal definition of our adversary model. In Section 3, we discuss several natural approaches to high-dimensional agnostic learning, all of which lose polynomial factors that depend on the dimension, in terms of their error guarantee.
The main body of the paper is in Sections 4–8. Sections 4 and 6 illustrate our convex programming framework, while Sections 5, 7, and 8 illustrate our filter framework. More specifically, in Sections 4 and 5, we analyze the setting of a single Gaussian with unknown mean and unknown covariance, using our convex programming and filter frameworks, respectively. In Section 6, we generalize the convex programming method to obtain an agnostic algorithm for mixtures of spherical Gaussians with unknown means. In Section 7, we apply our filter techniques to a binary product distribution, and generalize these in Section 8 to obtain an agnostic learning algorithm for a mixture of two balanced binary product distributions.
We note that for some of the more advanced applications of our frameworks, the technical details can get in the way of the fundamental ideas. For the reader who is interested in seeing the details of our most basic application of the convex programming framework, we recommend reading the case a Gaussian with unknown mean, in Section 4.3. Similarly, for the filter framework, we suggest either the Gaussian with unknown mean in Section 5.1 or the balanced product distribution in Section 7.1.
Preliminaries
Throughout this paper, if is a vector, we will let denote its Euclidean norm. If is a matrix, we will let denote its spectral norm, and denote its Frobenius norm. We will also let and denote the PSD ordering on matrices. For a discrete distribution , we will denote by the probability mass at point . For a continuous distribution, let it denote the probability density function at . Let be a multiset over . We will write to denote that is drawn from the empirical distribution defined by . Throughout the paper, we let denote the Kronecker product of matrices.
As a measure of distance between distributions, we will use the notion of total variation distance:
2 Types of Adversaries
In this paper, we will consider a powerful model for agnostic distribution learning that generalizes many other existing models. The standard setup involves an oblivious adversary who chooses a distribution that is close in total variation distance to an unknown distribution in some class .
Given , and a class of distributions , the full adversary operates as follows: The algorithm specifies some number of samples . The adversary generates samples from some (unknown) distribution . It then draws from an appropriate distribution. This distribution is allowed to depend on , but when marginalized over the samples satisfies . The adversary is allowed to inspect the samples, removes of them, and replaces them with arbitrary points. The set of points is given (in any order) to the algorithm.
We rely on the following well-known fact:
Now we can describe how the full adversary can corrupt samples from to get samples distributed according to .
The full adversary can simulate any oblivious adversary.
We draw samples from . We delete each sample independently with probability and replace it with an independent sample from . This gives a set of samples that are independently sampled from . Since the distributions and are identical, we can couple them to independent samples from . Now each sample that came from , we can delete and replace with an independent sample from . The result is a set of samples that are independently sampled from where we have made edits and marginally , although has and needs to have some dependence on the original samples from . ∎
The challenge in working with the full adversary is that even the samples that came from can have biases. The adversary can now choose how to remove uncorrupted points in a careful way so as to compensate for certain other biases that he introduces using the corrupted points.
Throughout this paper, we will make use of the following notation and terminology:
We say a set of samples is an -corrupted set of samples generated by the oblivious (resp. full) adversary if it is generated by the process described above in the definition of the oblivious (resp. full) adversary. If it was generated by the full adversary, we let denote the indices of the uncorrupted samples, and we let denote the indices of the corrupted samples.
In this paper, we will give a number of algorithms for agnostic distribution learning that work in the full adversary model. In our analysis, we will identify a set of events that ensure the algorithm succeeds and will bound the probability that any of these events does not occur when is suitably large. We will often explicitly invoke the assumption that . We can do this even though the number of points that are corrupted is itself a random variable, because by the Chernoff bound, as long as , we know that holds with probability at least . Thus, making the assumption that costs us an additional additive term in our union bound, when bounding the failure probability of our algorithms.
3 Distributions of Interest
One object of study in this paper is the Gaussian (or Normal) distribution.
A Gaussian distribution with mean and covariance is the distribution with probability density function
We will also be interested in binary product distributions.
A (binary) product distribution is a probability distribution over whose coordinate random variables are independent. Note that a binary product distribution is completely determined by its mean vector.
We will also be interested in mixtures of such distributions.
A mixture of distributions with mixing weights is the distribution defined by
where for all and .
4 Bounds on TV Distance
The Kullback-Liebler divergence (also known as relative entropy, information gain, or information divergence) is a well-known measure of distance between two distributions.
The primary interest we have in this quantity is the fact that (1) the KL divergence between two Gaussians has a closed form expression, and (2) it can be related (often with little loss) to the total variation distance between the Gaussians. The first statement is expressed in the fact below:
Let and be two Gaussians such that . Then
The second statement is encapsulated in the well-known Pinsker’s inequality:
With this we can show the following two useful lemmas, which allow us to relate parameter distance between two Gaussians to their total variation distance. The first bounds the total variation distance between two Gaussians with identity covariance in terms of the Euclidean distance between the means:
In the case where , (3) simplifies to
Pinsker’s inequality (Theorem 2.12) then implies that
The second bounds the total variation distance between two mean zero Gaussians in terms of the Frobenius norm of the difference between their covariance matrices:
Let be sufficiently small. Let such that . Then,
Let . Then (3) simplifies to
Since both terms in the last line are rotationally invariant, we may assume without loss of generality that is diagonal. Let . Thus, the KL divergence between the two distributions is given exactly by where we are guaranteed that . By the second order Taylor approximation to , for small, we have that for sufficiently small,
Our algorithm for agnostically learning an arbitrary Gaussian will be based on solving two intermediate problems: (1) We are given samples from and our goal is to learn , and (2) We are given samples from and our goal is to learn . The above bounds on total variation distance will allow us to conclude that our estimate is close in total variation distance to the unknown Gaussian distribution in each of the two settings.
We note the following folklore sample complexity bounds for learning a Gaussian in the non-agnostic setting.
samples are both necessary and sufficient to learn a -dimensional Gaussian with unknown mean and known covariance to total variation distance with probability .
samples are both necessary and sufficient to learn a -dimensional Gaussian with unknown mean and covariance to total variation distance with probability .
We will also need the following lemma bounding the total variation distance between two product distributions:
Let be binary product distributions with mean vectors We have that
The elementary inequality gives that
where the last inequality follows from the union bound. ∎
5 Additional Concentration Lemmata
In this section, we list a number of standard concentration inequalities for nice random variables which we will frequently use throughout this paper. The proofs of these results are standard and omitted, see e.g., [Ver10] for a more thorough treatment of these results.
The first is a Chernoff bound for bounded random variables.
Let be independent random variables with supported on Let . Then for any ,
We will also require the following tail bounds for Gaussians and quadratic forms of Gaussians:
By standard union bound arguments (see e.g. [Ver10]), we obtain the following concentration results for the empirical mean and covariance of a set of Gaussian vectors:
Let be a positive integer. Let be a sub-gaussian distribution with mean and covariance . Let be independent, for . Then, there exist universal constants so that for all , we have
With the same setup as in Lemma 2.21, there exist universal constants so that for all , we have
6 Agnostic Hypothesis Selection
Several of our algorithms will return a polynomial-sized list of hypotheses at least one of which is guaranteed to be close to the target distribution. Usually (e.g., in a non-agnostic setting), one could use a polynomial number of additional samples to run a tournament to identify the candidate hypothesis that is (roughly) the closest to the target distribution. In the discussion that follows, we will refer to these additional samples as test samples. Such hypothesis selection algorithms have been extensively studied [Yat85, DL96, DL97, DL01, DK14, AJOS14, SOAJ14, DDS15a, DDS15b]. Unfortunately, against a strong adversary we run into a serious technical complication: the training samples and test samples are not necessarily independent. Moreover even if we randomly partition our samples in training and test, a priori there are an unbounded set of possible hypotheses that the training phase could output, and when we analyze the tournament we cannot condition on the list of hypotheses and assume that the test samples are sampled anew. Our approach is to require our original algorithm to return only hypotheses from some finite set of possibilities, and as we will see this mitigates the problem.
As a simple corollary of the agnostic tournament, observe that this allows us to do agnostic learning without knowing the precise error rate . Throughout the paper, we assume the algorithm knows , and guarantees that the output will have error which is at most . However, if the algorithm is not given this information, and instead given an and asked to return something with error at most , we may simply grid over (here is some arbitrary constant that governs a tradeoff between runtime and accuracy), run our algorithm with set to each element in this set, and perform hypothesis selection via Tournament. Then it is not hard to see that we are guaranteed to output something which has error at most .
Firstly, we randomly choose a subset of of our samples and a disjoint subset of of our samples for some sufficiently large . Note that with high probability over our randomization, at most a -fraction of samples from each subset are corrupted. Thus, we may instead consider the stronger adversary who sees a set of independent samples from and another set, , of samples from and can arbitrary corrupt a -fraction of each, giving sets ,.
With probability at least over , the original algorithm run on returns a set satisfying the desired properties.
We claim that with probability at least over the choice of , we have for any :
This follows by Chernoff bounds and a union bound over the possibilities for and . Since the total variation distance between the uniform distributions over and is at most , we also have for that
Therefore, if we throw away any in our list for which there is a in our list such that
Some Natural Approaches, and Why They Fail
Many of the agnostic distribution learning problems that we study are so natural that one would immediately wonder why simpler approaches do not work. Here we detail some other plausible approaches, and what causes them to lose dimension-dependent factors (if they have any guarantees at all!). For the discussion that follows, we note that by Fact 2.13 in order to achieve an estimate that is -close in total variation distance (for a Gaussian when is unknown and ) it is necessary and sufficient that .
One plausible approach for robust mean estimation in high dimensions is to agnostically learn along each coordinate separately. For instance, if our goal is to agnostically learn the mean of a Gaussian with known covariance , we could try to learn each coordinate of the mean separately. But since an -fraction of the samples are corrupted, our estimate can be off by in each coordinate and would be off by in high dimensions.
Maximum Likelihood
Given a set of samples , and a class of distributions , the maximum likelihood estimator (MLE) is the distribution that maximizes . Equivalently, minimizes the negative log likelihood (NLL), which is given by
and so the which minimizes is the mean of the samples , since for any set of vectors , the average of the ’s is the minimizer of the function . Hence, if an adversary places an -fraction of the points at some very large distance, then the estimate for the mean would need to move considerably in that direction. By placing the corruptions further and further away, the MLE can be an arbitrarily bad estimate. That is, even though it is well known [Hub67, Whi82] that the MLE converges to the distribution that is closest in KL-divergence to the distribution from which our samples were generated (i.e., after the adversary has added corruptions), is not necessarily close to the uncorrupted distribution.
Geometric Median
In one dimension, it is well-known that the median provides a provably robust estimate for the mean in a number of settings. The mean of a set of points is the minimizer of the function , and in contrast the median is the minimizer of the function . In higher dimensions, there are many natural definitions for the median that generalize the one-dimensional case. The Tukey median is one such notion, but as we discussed it is hard to compute [JP78] and the best known algorithms run in time exponential in . Motivated by this, the geometric median is another high-dimensional notion of a median. It often achieves better robustness than the mean, and can be computed quickly [CLM+16]. The formal definition is:
Unfortunately, this notion of median still incurs an error containing a factor of :
Given a set of samples from , then with probability at least , there exists a corruption of , such that:
Agnostically Learning a Gaussian, via Convex Programming
In this section we give a polynomial time algorithm to agnostically learn a single Gaussian up to error . Our approach is based on the following ingredients: First, in Section 4.1, we define the set , which will be a key algorithmic object in our framework. In Section 4.2 we give key, new concentration bounds on certain statistics of Gaussians. We will make crucial use of these concentration bounds throughout this section. In Section 4.3 we give an algorithm to agnostically learn a Gaussian with unknown mean and whose covariance is promised to be the identity via convex programming. This will be an important subroutine in our overall algorithm, and it also helps to illustrate our algorithmic approach without many of the additional complications that arise in our later applications. In Section 4.4 we show how to robustly learn a Gaussian with mean zero and unknown covariance again via convex programming. Finally, in Section 4.5 we show how to combine these two intermediate results to get our overall algorithm.
An important algorithmic object for us will be the following set:
For any and any integer , let
and so we see that this set is designed to capture the notion of selecting a set of samples from samples.
Given we will use the following notation
to denote the total weight on good and bad points respectively. The following facts are immediate from and the properties of .
If and , then . Moreover, the renormalized weights on good points given by for all , and otherwise, satisfy .
2 Concentration Inequalities
Throughout this section and in Section 6 we will make use of various concentration bounds on low moments of Gaussian random variables. Some are well-known, and others are new but follow from known bounds and appropriate union bound arguments.
We will also be interested in how well various statistics of Gaussians concentrate around their expectation, when we take the worst-case set of weights in . This is more subtle than standard settings such as Lemma 2.21 or Lemma 2.22 because as we take more samples, any fixed statistic (e.g. taking the uniform distribution over the samples) concentrates better but the size of (e.g. the number of sets of samples) grows too. We defer the proofs to Appendix A. The first concerns the behavior of the empirical covariance:
Fix and . There is a such that if are independent samples from and , then
A nearly identical argument (Using Hoeffding instead of Bernstein in the proof of Theorem 5.50 in [Ver10]) yields:
Fix and as above. There is a such that if are independent samples from and , then
Note that by Cauchy-Schwarz, this implies:
Fix and as above. There is a such that if are independent samples from and , then
Fix . Let . Then, with probability , we have that for all .
2.2 Estimation Error in the Frobenius Norm
Let be i.i.d. samples from . In this section we demonstrate a tight bound on how many samples are necessary such that the sample covariance is close to in Frobenius norm. Let denote the empirical covariance, defined to be
By self-duality of the Frobenius norm, we know that
Since there is a -net over all PSD matrices with Frobenius norm of size (see e.g. Lemma 1.18 in [RH17]), the Vershynin-type union bound argument combined with Lemma 2.20 immediately gives us the following:
There exist universal constants so that for all , we have
By the argument as used in the proof of Lemma 4.3, we obtain:
Fix . There is a such that if are independent samples from , with
Since the proof is essentially identical to the proof of Lemma 4.3, we omit the proof. However, we note that in fact, the proof technique there can be used to show something slightly stronger, which we will require later. The technique actually shows that if we take any set of size at most , and take the uniform weights over that set, then the empirical covariance is not too far away from the truth. More formally:
Fix . There is a such that if are independent samples from , with
2.3 Understanding the Fourth Moment Tensor
Our algorithms will be based on understanding the behavior of the fourth moment tensor of a Gaussian when restricted to various subspaces. Let denote the Kronecker product on matrices. We will make crucial use of the following definition:
We will also require the following definitions:
and let and denote the projection operators onto and respectively. Finally let
In fact, the projection of onto where is symmetric can be written out explicitly. Namely, it is given by
The key result in this section is the following:
It is important to note that the two terms above are not the same; the first term is high rank, but the second term is rank one. The proof of this theorem will require Isserlis’ theorem, and is deferred to Appendix A.
2.4 Concentration of the Fourth Moment Tensor
We also need to show that the fourth moment tensor concentrates:
Fix . Let be independent, for , where we set
To do so will require somewhat more sophisticated techniques than the ones used so far to bound spectral deviations. At a high level, this is because fourth moments of Gaussians have a sufficiently larger variance that the union bound techniques used so far are insufficient. However, we will show that the tails of degree four polynomials of Gaussians still sufficiently concentrate such that removing points cannot change the mean by too much. The proof requires slightly fancy machinery and appears in Appendix B.
3 Finding the Mean, Using a Separation Oracle
In this section, we consider the problem of approximating given samples from in the full adversary model. Our algorithm will be based on working with the following convex set:
It is not hard to show that is non-empty for reasonable values of (and we will show this later). Moreover we will show that for any set of weights in , the empirical average
will be a good estimate for . The challenge is that since itself is unknown, there is not an obvious way to design a separation oracle for even though it is convex. Our algorithm will run in two basic steps. First, it will run a very naive outlier detection to remove any points which are more than away from the good points. These points are sufficiently far away that a very basic test can detect them. Then, with the remaining points, it will use the approximate separation oracle given below to approximately optimize with respect to . It will then take the outputted set of weights and output the empirical mean with these weights. We will explain these steps in detail below.
Our results will hold under the following deterministic conditions:
The concentration bounds we gave earlier were exactly bounds on the failure probability of either of these conditions, albeit for instead of .
The first step of our algorithm will be to remove points which have distance which is much larger than from the mean. Our algorithm is very naive: it computes all pairwise distances between points, and throws away all points which have distance more than from more than a -fraction of the remaining points.
Suppose that (7) holds. Then NaivePrune removes no uncorrupted points, and moreover, if is not removed by NaivePrune, we have .
That no uncorrupted point is removed follows directly from (7) and the fact that there can be at most corrupted points. Similarly, if is not removed by NaivePrune, that means there must be an uncorrupted such that . Then the desired property follows from (7) and a triangle inequality. ∎
Henceforth, for simplicity we shall assume that no point was removed by NaivePrune, and that for all , we have . Otherwise, we can simply work with the pruned set, and it is evident that nothing changes.
3.2 The Separation Oracle
Our main result in this section is an approximate separation oracle for . Throughout this section, let and set . Moreover, let . Our first step is to show that any set of weights that does not yield a good estimate for cannot be in the set :
Suppose that (8)-(9) holds. Suppose that . Then
By Fact 4.2 and (9) we have . Now by the triangle inequality we have
Using the fact that the variance is nonnegative we have
where in the last inequality we have used Fact 4.2 and (8). Hence altogether this implies that
As a corollary, we find that any set of weights in immediately yields a good estimate for :
Suppose that (8) and (9) hold. Let for . Then
Our main result in this section is an approximate separation oracle for with .
Fix , and let Suppose that (8) and (9) hold. Let denote the weights which are uniform on the uncorrupted points. Then there is a constant and an algorithm such that:
(Completeness) If , then it outputs “YES”.
We remark that these two facts imply that for any , the ellipsoid method with this separation oracle will output a such that , for some in steps.
The conditions that the separation oracle given here satisfy are slightly weaker than the traditional guarantees, given, for instance, in [GLS88]. However, the correctness of the ellipsoid algorithm with this separation oracle follows because outside , the separation oracle acts exactly as a separation oracle for . Thus, as long as the algorithm continues to query points outside of , the action of the algorithm is equivalent to one with a separation oracle for . Moreover, the behavior of the algorithm is such that it will never exclude , even if queries are made within . From these two conditions, it is clear from the classical theory presented in [GLS88] that the ellipsoid method satisfies the guarantees given above.
The separation oracle is given in Algorithm 2. Next, we prove correctness for our approximate separation oracle:
Again, let , and let . By expanding out the formula for , we get:
Suppose . Then .
Recall that are the weights that are uniform on the uncorrupted points. Because we have that . We can now use (8) to conclude that . Now by Corollary 4.16 we have that . Thus
Suppose that . Then .
Let us now split into two cases. If , then the first term above is at least by definition and we can conclude that . On the other hand, if , by Lemma 4.15, we have that
which for sufficiently small also yields . ∎
Suppose . By (11) we immediately have:
since . On the other hand, if then by (10) we have . Putting it all together we have:
where in the last line we used the fact that , and . This now completes the proof. ∎
3.3 The Full Algorithm
This separation oracle, along with the classical theory of convex optimization [GLS88], implies that we have shown the following:
Fix , and let . Let be an -corrupted set of points satisfying (8)-(9), for and . Let be a sufficiently large constant. Then, there is an algorithm which runs in time , and outputs a set of weights such that there is a such that .
This algorithm, while an extremely powerful primitive, is technically not sufficient. However, given this, the full algorithm is not too difficult to state: simply run NaivePrune, then optimize over using this separation oracle, and get some which is approximately in . Then, output . For completeness, the pseudocode for the algorithm is given below. In the pseudocode, we assume that is a convex optimization routine, which given the SeparationOracleUnknownMean separation oracle and a target error , outputs a such that . From the classical theory of optimization, we know such a routine exists and runs in polynomial time.
Fix , and let . Let be an -corrupted set of samples, where
Let be the output of . Then with probability , we have .
By Fact 4.6, Lemma 4.3, and Lemma 4.4, we know that (7)-(9) hold with probability , with . Condition on the event that this event holds. After NaivePrune, by Fact 4.14 we may assume that no uncorrupted points are removed, and all points satisfy . Let be the output of the algorithm, and let be such that . By Corollary 4.16, we know that . Hence, we have
so the entire error is at most , as claimed. ∎
4 Finding the Covariance, Using a Separation Oracle
In this section, we consider the problem of approximating given samples from in the full adversary model. Let such that if then . Moreover let . Our approach will parallel the one given earlier in Section 4.3. Again, we will work with a convex set
and our goal is to design an approximate separation oracle. Our results in this section will rely on the following deterministic conditions:
for all , and all sets of size . As before, by Fact 4.2, the renormalized weights over the uncorrupted points are in . Hence, we can appeal to Fact 4.6, Corollary 4.8, Corollary 4.9, and Theorem 4.13 with instead of to bound the probability that this event does not hold. Let be the set of weights which are uniform over the uncorrupted points; by (13) for we have that .
Let . Suppose that (13), (14), and 15 hold for and . Then, there is a constant and an algorithm such that, given any input we have:
(Completeness) If , the algorithm outputs “YES”.
As in the case of learning an unknown mean, by the classical theory of convex optimization this implies that we will find a point such that for some , using polynomially many calls to this oracle. We make this more precise in the following subsubsection.
The pseudocode for the (approximate) separation oracle is given in Algorithm 4. Observe briefly that this algorithm does indeed run in polynomial time. Lines 2-7 require only taking top eigenvalues and eigenvectors, and so can be done in polynomial time. For any , line 8 can be run by sorting the samples by and seeing if there is a subset of the top samples satisfying the desired condition, and line 9 can be executed similarly.
We now turn our attention to proving the correctness of this separation oracle. We require the following technical lemmata.
Fix and suppose that is symmetric. If then .
We will prove this lemma in the contrapositive, by showing that if then . Since the Frobenius norm is rotationally invariant, we may assume that , where by assumption . By our assumption that , we have for all . Thus
where we have used the inequality which holds for all . This completes the proof. ∎
Let be the columns of . Then
so the desired result follows by taking square roots of both sides. ∎
By the definition of , we have
By self duality of the Frobenius norm, we know that
since . The result now follows. ∎
Let us first prove completeness. Observe that by Theorem 4.12, we know that restricted to , we have that . Therefore, by (15) we will not output a hyperplane in line 7. Moreover, by (14), we will not output a hyperplane in line 8. This proves completeness.
Thus it suffices to show soundness. Suppose that . We will make use of the following elementary fact:
Let and . Then
In particular . Using this expression and the fact that all the matrices involved are symmetric, we can write
where in the third line we have used the fact that the trace of a product of matrices is preserved under cyclic shifts. ∎
Assume (13) holds with and assume furthermore that . Then, if we let , we have
Let be as in Fact 4.28. Combining Lemma 4.25 and Fact 4.28 we have
We can rewrite (13) as the expression where is symmetric and satisfies . By the definition of we have that , and so
by applying Lemma 4.26. And putting it all together we have
It is easily verified that for , we have that for all , if , then
where . The desired result then follows from the Pythagorean theorem. ∎
Claim 4.29 tells us that if we know that one of the terms in (17) must be at least . We first show that if the first term is large, then the algorithm outputs a separating hyperplane:
Assume that (13)-(15) hold with and . Moreover, suppose that
Then the algorithm outputs a hyperplane in line 7, and moreover, it is a separating hyperplane.
Let us first show that given these conditions, then the algorithm indeed outputs a hyperplane in line 7. Since , the first term is just equal to . But this implies that there is some such that and such that
The are a set of weights satisfying the conditions of Claim 4.24 and so this implies that
Let . By Theorem 4.12 and (15), we have that
since it is easily verified that as long as , which it is by (17).
Equations 18 and 19 then together imply that
Now let us assume that the first term on the LHS is less than , such that the algorithm does not necessarily output a hyperplane in line 7. Thus, the second term on the LHS of Equation 16 is at least . We now show that this implies that this implies that the algorithm will output a separating hyperplane in line 9.
Assume that (13)-(15) hold. Moreover, suppose that
Then the algorithm outputs a hyperplane in line 9, and moreover, it is a separating hyperplane.
By the definition of , the assumption implies that
which is equivalent to the condition that
for some . In particular, the algorithm will output a hyperplane
in Step 9, where is some set of size at most , and . Since it will not affect anything, for without loss of generality let us assume that . The other case is symmetrical.
where . Hence,
as long as . By self-duality of the Frobenius norm, using the test matrix , this implies that
These two claims in conjunction directly imply the correctness of the theorem. ∎
As before, this separation oracle and the classical theory of convex optimization [GLS88] shows that we have demonstrated an algorithm FindApproxCovariance with the following properties:
Fix , and let . Let be a universal constant which is sufficiently large. Let be an -corrupted set of points satisfying (13-(15), for and . Then runs in time , and outputs a such that there is some such that .
As before, this is not quite sufficient to actually recover the covariance robustly. Naively, we would just like to output . However, this can run into issues if there are points such that is extremely large. We show here that we can postprocess the such that we can weed out these points. First, observe that we have the following lemma:
Assume satisfy (13). Let . Then
This follows since by (13), we have that . The lemma then follows since always. ∎
Let be such that there is such that . Then, .
By the definition of , we must have that . Moreover, we must have for every index . Thus we have that , and hence , since . ∎
We now give the full algorithm. The algorithm proceeds as follows: first run FindApproxCovariance to get some set of weights which is close to some element of . We then compute the empirical covariance with the weights , and remove any points which have which are too large. We shall show that this removes no good points, and removes all corrupted points which have which are absurdly large. We then rerun FindApproxCovariance with this pruned set of points, and output the empirical covariance with the output of this second run. Formally, we give the pseudocode for the algorithm in Algorithm 5.
We now show that this algorithm is correct.
Let , and . Let . Let be a -corrupted set of samples from where
Let be the output of . Then with probability , .
We first condition on the event that we satisfy (12)-(15) with and . By our choice of , Fact 4.6, Corollary 4.7, Corollary 4.9, and Theorem 4.13, and a union bound, we know that this event happens with probability .
By Theorem 4.32, Lemma 4.33, and Lemma 4.34, we have that since is sufficiently small,
In particular, this implies that for every vector , we have
Therefore, by (12), we know that in line 6, we never throw out any uncorrupted points, and moreover, if is corrupted with , then it is thrown out. Thus, let be the set of pruned points. Because no uncorrupted point is thrown out, we have that , and moreover, this set of points still satisfies (13)-(15)Technically, the samples satisfy a slightly different set of conditions since we may have thrown out some corrupted points, and so in particular the number of samples may have changed, but the meaning should be clear. and moreover, for ever , we have . Therefore, by Theorem 4.32, we have that there is some such that . But now if , we have
5 Learning an Arbitrary Gaussian Agnostically
We have shown how to agnostically learn the mean of a Gaussian with known covariance, and we have shown how to agnostically learn the covariance of a mean zero Gaussian. In this section, we show how to use these two in conjunction to agnostically learn an arbitrary Gaussian. Throughout, let be an -corrupted set of samples from , where both and are unknown. We will set
We first show a simple trick which, at the price of doubling the amount of error, allows us to assume that the mean is zero, without changing the covariance. We do so as follows: for each , let . Observe that if both and are uncorrupted, then . Moreover, observe that is corrupted only if either or is corrupted. Then we see that if is -corrupted, then the is a -sized set of samples which is -corrupted. Thus, by using the results from Section 4.4, with probability , we can recover a such that
which in particular by Corollary 2.14, implies that
5.2 From Unknown Mean, Approximate Covariance, to Approximate Recovery
For each , let . Then, for which is not corrupted, we have that , where . By Corollary 2.14 and Lemma 4.25, if (20) holds, then we have
By Claim 2.5, this means that if (20) holds, the uncorrupted set of can be treated as an -corrupted set of samples from . Thus, if (20) holds, the entire set of samples is a -corrupted set of samples from . Then, by using results from Section 4.3, with probability , assuming that 20 holds, we can recover a such that . Thus, by Corollary 2.13, this implies that
which in conjunction with (21), implies that
and thus by following this procedure, whose formal pseudocode is given in Algorithm 6, we have shown the following:
Fix . Let be an -corrupted set of samples from , where are both unknown, and
There is a polynomial-time algorithm which with probability , outputs a such that
Agnostically Learning a Gaussian, via Filters
In this section, we use our filter technique to give an agnostic learning algorithm for an unknown mean Gaussian with known covariance matrix. More specifically, we prove:
Notation. We will denote and for the sample mean and modified sample covariance matrix of the set .
We start by defining our notion of good sample, i.e, a set of conditions on the uncorrupted set of samples under which our algorithm will succeed.
For all we have .
We have that
We have that
We show in Appendix B that a sufficiently large set of independent samples from is -good (with respect to ) with high probability. Specifically, we prove:
Let be a Gaussian distribution with identity covariance, and If the multiset is obtained by taking independent samples from it is -good with respect to with probability at least
We require the following definition that quantifies the extent to which a multiset has been corrupted:
Given finite multisets and we let be the size of the symmetric difference of and divided by the cardinality of
As in the convex program case, we will first use NaivePrune to remove points which are far from the mean. Then, we iterate the algorithm whose performance guarantee is given by the following:
A mean vector such that
We start by showing how Theorem 5.1 follows easily from Proposition 5.5.
By the definition of since has been obtained from by corrupting an -fraction of the points in we have that By Lemma 5.3, the set of uncorrupted samples is -good with respect to with probability at least We henceforth condition on this event.
Since is -good, all have . Thus, the NaivePrune procedure does not remove from any member of . Hence, its output, , has and for any , there is a with . By the triangle inequality, for any , .
Then, we iteratively apply the Filter-Gaussian-Unknown-Mean procedure of Proposition 5.5 until it terminates returning a mean vector with We claim that we need at most iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets such that Thus, if we do not output the empirical mean in the first iterations, in the next iteration there are no outliers left. Hence in the next iteration it is impossible for the algorithm to output a subset satisfying Condition (ii) of Proposition 5.5, so it must output a mean vector satisfying (i), as desired. ∎
In this subsection, we describe the efficient algorithm establishing Proposition 5.5 and prove its correctness. Our algorithm calculates the empirical mean vector and empirical covariance matrix . If the matrix has no large eigenvalues, it returns Otherwise, it uses the eigenvector corresponding to the maximum magnitude eigenvalue of and the mean vector to define a filter. Our efficient filtering procedure is presented in detailed pseudocode below.
1.2 Proof of Correctness of Filter-Gaussian-Unknown-Mean
We define and to be the means of and , respectively.
Our analysis will make essential use of the following matrices:
Our analysis will hinge on proving the important claim that is approximately . This means two things for us. First, it means that if the positive errors align in some direction (causing to have a large eigenvalue), there will be a large eigenvalue in . Second, it says that any large eigenvalue of will correspond to an eigenvalue of , which will give an explicit direction in which many error points are far from the empirical mean.
Useful Structural Lemmas. We will use the following simple fact about the concentration of Gaussian random variables:
We begin by noting that we have concentration bounds on and therefore, on due to its goodness.
The first line is Fact 5.6, and the former follows from it using the goodness of . ∎
By using the above fact, we obtain the following simple claim:
This follows from Fact 5.7 upon noting that only if . ∎
We can use the above facts to prove concentration bounds for . In particular, we have the following lemma:
We have that
For unit vectors , the RHS is bounded from above as follows:
where the second line follows from the fact that , , and satisfies condition (i) of Definition 5.2; the third line follows from (22); and the fourth line follows from Fact 5.7. ∎
As a corollary, we can relate the matrices and , in spectral norm:
We have that , where the term denotes a matrix of spectral norm .
By definition, we have that . Thus, we can write
where the second line uses the fact that , the goodness of (condition (iv) in Definition 5.2), and Lemma 5.9. Specifically, Lemma 5.9 implies that . Therefore, we have that
We now establish a similarly useful bound on the difference between the mean vectors:
Since is a good set, by condition (iii) of Definition 5.2, we have . Since , it follows that . Using the valid inequality and Lemma 5.9, we obtain that . Therefore,
as desired. This completes the proof of the lemma. ∎
By combining the above, we can conclude that is approximately proportional to . More formally, we obtain the following corollary:
We have , where the additive terms denote matrices of appropriately bounded spectral norm.
By definition, we can write Using Corollary 5.10 and Lemma 5.11, we obtain:
where the second line follows from the valid inequality . This completes the proof. ∎
On the other hand, since , Lemma 5.11 gives that
Case of Large Spectral Norm. We next show the correctness of the algorithm when it returns a filter in Step 7.
Moreover, using the inequality and Lemma 5.11 as above, we get that
Suppose for the sake of contradiction that for all we have that
Using (24), we obtain that for all we have that
for some universal constant .
We now have the following sequence of inequalities:
Combined with (23), we obtain , which is a contradiction if is sufficiently large. Therefore, it must be the case that for some value of the condition in Step 7 is satisfied.
Recall that with and disjoint multisets such that We can similarly write with and Since
it suffices to show that Note that is the number of points rejected by the filter that lie in Note that the fraction of elements of that are removed to produce (i.e., satisfy ) is at most . This follows from Claim 5.8 and the fact that .
Hence, it holds that On the other hand, Step 7 of the algorithm ensures that the fraction of elements of that are rejected by the filter is at least . Note that is the number of points rejected by the filter that lie in Therefore, we can write:
where the second line uses the fact that and the last line uses the fact that Noting that , this completes the proof of the claim. ∎
2 Learning a Gaussian With Unknown Covariance
In this subsection, we use our filter technique to agnostically learn a Gaussian with zero mean vector and unknown covariance. By combining the algorithms of the current and the previous subsections, as in our convex programming approach (Section 4.5), we obtain a filter-based algorithm to agnostically learn an arbitrary unknown Gaussian.
The main result of this subsection is the following theorem:
Let be a Gaussian in dimensions with mean and unknown covariance, and let . Let be an -corrupted set of samples from of size . There exists an efficient algorithm that, given and , returns the parameters of a Gaussian distribution such that with probability at least , it holds
As in the previous subsection, we will need a condition on under which our algorithm will succeed.
For all , .
We have that .
For all even degree- polynomials , we have that .
Let us first note some basic properties of such polynomials on a normal distribution. The proof of this lemma is deferred to Section B.
and
For all ,
We note that, if is obtained by taking random samples from , then is good with high probability. The proof of this lemma is also deferred to Section B.
Let be a -dimensional Gaussian with mean and let . Let be a sufficiently large constant multiple of . Then a set of independent samples from is -good with respect to with probability at least .
As in Definition 5.4, is the size of the symmetric difference of and divided by .
The basic thrust of our algorithm is as follows: By Lemma 5.17, with high probability we have that is -good with respect to . The algorithm is then handed a new set such that . The algorithm will run in stages. In each stage, the algorithm will either output or will return a new set such that . In the latter case, the algorithm will recurse on . We formalize this idea below:
Given Proposition 5.18, the proof of Theorem 5.14 is straightforward. By Lemma 5.17 the original set is -good with respect to with probability at least . Then, satisfies the hypotheses of Proposition 5.18. We then repeatedly iterate the algorithm from Proposition 5.18 until it outputs a distribution close to . This must eventually happen because at every step the distance between and the set returned by the algorithm decreases by at least .
The function Find-max-poly uses similar notation to SeparationOracleUnknownCovariance, such that Filter-Gaussian-Unknown-Covariance and SeparationOracleUnknownCovariance can be more easily compared.
Let us first show that Find-max-poly is correct.
Algorithm Find-max-poly is correct and Filter-Gaussian-Unknown-Covariance runs time
Note that since the covariance matrix of is we have
We let be the multiset of for and the multiset of for in . Recall that . We thus have
Thus, the that maximizes is given by the unit vector that maximizes subject to being symmetric.
Let . Note that by symmetries of . Thus, by linearity, also has . However, if is not symmetric, has . Thus, the unit vector achieves a higher value of the bilinear form. Consequently, is symmetric.
To prove Lemma 5.20, we will require the following:
This holds essentially because the distribution of for is close to that for , which has rapidly decaying tails. Therefore, throwing away an -fraction of the mass cannot change the value of the variance by very much. In particular, we have that
Note that, since the matrix inner product is an inner product,
Let denote the quadratic polynomial
As a corollary of this we note that cannot be too much smaller than .
The first step in verifying correctness is to note that if our algorithm returns on Step 6 that it does so correctly.
If our algorithm returns on Step 6, then
This is clearly true if we can show that all removed have . However, this follows because , and therefore, by -goodness, all satisfy
By Corollary 2.14, it suffices to show that
Therefore, we will have an appropriate bound unless
Next, note that there is a matrix with such that
Indeed we can take . Thus, there is a symmetric such that this holds.
This shows that if the algorithm returns in this step, it does so correctly. ∎
Next, we need to show that if the algorithm reaches Step 13 that such a exists.
If the algorithm reaches Step 13, then there exists a such that
Before we begin, we will need the following critical Lemma:
We note that since , we just need to show that the variance with respect to instead of is not too much larger. This will essentially be because the covariance matrix of cannot be much bigger than the covariance matrix of by Corollary 5.22.
We claim that if are matrices, then . If are the columns of , then we have . Similarly for rows, we have .
Next, we need to consider . In particular, we note that by the similarity of and , must be between the and percentiles of values of for . However, since is -good, this must be between the and percentiles of . Therefore, by Cantelli’ s inequality,
In particular, we have that , which means that
Hence, for sufficiently large, it must be the case that
For a sufficiently large , this contradicts Equation (28). ∎
Finally, we need to verify that if our algorithm returns output in Step 14, that it is correct.
If the algorithm returns during Step 14, then
We note that it is sufficient to show that . In particular, it suffices to show that
On the other hand, using (27) and the -goodness of , we have that
Agnostically Learning a Mixture of Spherical Gaussians, via Convex Programming
In this section, we give an algorithm to agnostically learn a mixture of Gaussians with identical spherical covariance matrices up to error . Let be the unknown -GMM each of whose components are spherical. For , we write if was drawn from the th component of .
Our main result of this section is the following theorem:
There is an algorithm which with probability , outputs a distribution such that
The running time of the algorithm is .
Our overall approach will be a combination of our method for agnostically learning a single Gaussian and recent work on properly learning mixtures of multivariate spherical Gaussians [SOAJ14, LS17]. At a high level, this recent work relies upon the empirical covariance matrix giving an accurate estimate of the overall covariance matrix in order to locate the subspace in which the component mean lie. However, as we have observed already, the empirical moments do not necessarily give good approximations of the true moments in the agnostic setting. Therefore, we will use our separation oracle framework to approximate the covariance matrix, and the rest of the arguments follow similarly as previous methods.
The organization of this section will be as follows. We define some of the notation we will be using and the Schatten top- norm in Section 6.1. Section 6.2 states the various concentration inequalities we require. In Section 6.3, we go over our overall algorithm in more detail. Section 6.4 describes a first naive clustering step, which deals with components which are very well separated. Section 6.5 contains details on our separation oracle approach, allowing us to approximate the true covariance. Section 6.6 describes our spectral clustering approach to cluster components with means separated more than . In Section 6.7, we describe how to exhaustively search over a particular subspace to obtain a good estimate for the component means. In Section 6.8, we go over how to limit the set of hypotheses in order to satisfy the conditions of Lemma 2.23. For clarity of exposition, all of the above describe the algorithm assuming all are equal. In Section 6.9, we discuss the changes to algorithm which are required to handle unequal variances.
For conciseness, many of the proofs are deferred to Section C.
Recall the definition of from Section 4.1, which we will use extensively in this section. We will use the notation to denote the mean of the unknown GMM. Also, we define parameters and let . And for ease of notation, let
to denote the covariance of the unknown GMM. Our algorithm for learning spherical -GMMs will rely heavily on the following, non-standard norm:
i.e., it is the sum of the top- singular values of .
It is easily verified that has a dual characterization
where the maxima is taken over all with orthonormal columns. From this, it is easy to see that the Schatten top- norm is indeed a norm, as its name suggests:
is a norm on symmetric matrices.
2 Concentration Inequalities
In this section, we will establish some concentration inequalities that we will need for our algorithm for agnostically learning mixtures of spherical Gaussians. Recall the notation as described in Section 6.1. The following two concentration lemmata follow from the same proofs as for Lemmata 42 and 44 in [LS17].
Fix . If are independent samples from the GMM with PDF where for all , and then with probability at least ,
where is defined as in equation (29).
Fix . If are independent samples from the GMM with PDF where for all , and then with probability at least ,
From the same techniques as before, we get the same sort of union bounds as usual over the weight vectors:
Fix and . There is a such that if are independent samples from the GMM with PDF where for all , and , then
where is defined as in equation (29).
Fix and . There is a such that if are independent samples from the GMM with PDF where for all , and , then
3 Algorithm
Our algorithm is based on the following deterministic conditions:
(32) follows from basic Gaussian concentration, and (33) and (34) follow from the results in Section 6.2 for sufficiently large. Note that these trivially imply similar conditions for the Schatten top- norm, at the cost of an additional factor of on the right-hand side of the inequalities. For the rest of this section, let .
At this point, we are ready to apply our separation oracle framework. In particular, we will find a weight vector over the points such that
for some choice of . The set of such weights is convex, and concentration implies that the true weight vector will have this property. Furthermore, we can describe a separation oracle given any weight vector not contained in this set (as long as is not too small). At this point, we use classical convex programming methods to find a vector which satisfies these conditions. Further details are provided in Section 6.5.
After this procedure, Lemma 6.13 shows that the weighted empirical covariance is spectrally close to the true covariance matrix. We are now in the same regime as [SOAJ14], which obtains their results as a consequence of the empirical covariance concentrating about the true covariance matrix. Thus, we will appeal to their analysis, highlighting the differences between our approach and theirs. We note that [LS17] also follows a similar approach and the interested reader may also adapt their arguments instead.
First, if is sufficiently large (i.e., ), this implies a separation condition between some component mean and the mixture’s mean. This allows us to cluster the points further, using a spectral method. We take the top eigenvector of the weighted empirical covariance matrix and project in this direction, using the sign of the result as a classifier. In contrast to previous work, which requires that no points are misclassified, we can tolerate misclassifications, since our algorithms are agnostic. This crucially allows us to avoid a dependence on in our overall agnostic learning guarantee. Further details are provided in Section 6.6.
Finally, if is sufficiently small, we may perform an exhaustive search. The span of the means is in the span of the top eigenvectors of the true covariance matrix, which we can approximate with our weighted empirical covariance matrix. Since is small, by trying all points within a sufficiently tight mesh, we can guess a set of candidate means which are sufficiently close to the true means. Combining the approximations to the means with Corollary 2.13 and the triangle inequality, we can guarantee that at least one of our guesses is sufficiently close to the true distribution. Additional details are provided in Section 6.7.
To conclude our algorithm, we can apply Lemma 2.23. We note that this hypothesis selection problem has been studied before (see, e.g., [DL01, DK14]), but we must adapt it for our agnostic setting. This allows us to select a hypothesis which is sufficiently close to the true distribution, thus concluding the proof. We note that the statement of Lemma 2.23 requires the hypotheses to come from some fixed finite set, while there are an infinite number of Gaussian mixture models. In Section 6.8, we discuss how to limit the number of hypotheses based on the set of uncorrupted samples in order to satisfy the conditions of Lemma 2.23.
4 Naive Clustering
We give a very naive clustering algorithm, the generalization of NaivePrune, which recursively allows us to cluster components if they are extremely far away. The algorithm is very simple: for each , add all points within distance to a cluster . Let be the set of clusters which contain at least points, and let the final clustering be be formed by merging clusters in if they overlap, and stopping if no clusters overlap. We give the pseudocode in Algorithm 10.
We prove here that this process (which may throw away points) throws away only at most a fraction of good points, and moreover, the resulting clustering only misclassifies at most an -fraction of the good points, assuming (32).
Thus, we have that by applying this algorithm, given an -corrupted set of samples from , we may cluster them in a way which misclassifies at most an fraction of the samples from any component, and such that within each cluster, the means of the associated components differ by at most . Thus, each separate cluster is simply a -corrupted set of samples from the mixture restricted to the components within that cluster; moreover, the number of components in each cluster must be strictly smaller than . Therefore, we may simply recursively apply our algorithm on these clusters to agnostically learn the mixture for each cluster, since if , this is a single Gaussian, which we know how to learn agnostically.
Thus, for the remainder of this section, let us assume that for all , we have . Moreover, we may assume that there are no points (corrupted or uncorrupted), such that .
5 Estimating the Covariance Using Convex Programming
In this section, we will apply our separation oracle framework to estimate the covariance matrix. While in the non-agnostic case, the empirical covariance will approximate the actual covariance, this is not necessarily true in our case. As such, we will focus on determining a weight vector over the samples such that the weighted empirical covariance is a good estimate for the true covariance.
We first define the convex set for which we want an interior point:
In Section 6.5.1, we prove lemmata indicating important properties of this set. In Section 6.5.2, we give a separation oracle for this convex set. We conclude with Lemma 6.13, which shows that we have obtained an accurate estimate of the true covariance.
We start by proving the following lemma, which states that for any weight vector which is not in our set, the weighted empirical covariance matrix is noticeably larger than it should be (in Schatten top- norm).
Suppose that (33) holds, and . Then
We also require the following lemma, which shows that if a set of weights poorly approximates , then it is not in our convex set.
Suppose that (33) and (34) hold. Let and set and . Furthermore, suppose that . Then
By contraposition, if a set of weights is in our set, then it provides a good approximation for :
Suppose that (33) and (34) hold. Let for . Then
5.2 Separation Oracle
In this section, we provide a separation oracle for . In particular, we have the following theorem:
Fix , and let Suppose that (33) and (34) hold. Let denote the weights which are uniform on the uncorrupted points. Then there is a constant and an algorithm such that:
(Completeness) If , then it outputs “YES”.
These two facts imply that the ellipsoid method with this separation oracle will terminate in steps, and moreover, will with high probability output a such that for some . Moreover, it will do so in polynomially many iterations.
After running this procedure, we technically do not have a set of weights in . But by the same argument as in Section 4.3, because the maximum distance between two points within any cluster is bounded, and we have the guarantee that for all , we may assume we have a set of weights satisfying
We require the following lemma, describing the accuracy of the empirical covariance matrix with the obtained weights.
Let . After running the algorithm above, we have a vector such that
By triangle inequality and Corollary 6.11,
6 Spectral Clustering
Now that we have a good estimate of the true covariance matrix, we will perform spectral clustering while is sufficiently large. We will adapt Lemma 6 from [SOAJ14], giving the following lemma:
Given a weight vector as output by Algorithm 11, if , there exists an algorithm which produces a unit vector with the following guarantees:
There exists a non-trival partition of into and such that for all and for all ;
The probability of a sample being misclassified is at most , where a misclassification is defined as a sample generated from a component in having , or a sample generated from a component in having .
The algorithm will be as follows. Let be the top eigenvector of
For a sample , cluster it based on the sign of . After performing this clustering, recursively perform our algorithm from the start on the two clusters.
The proof is very similar to that of Lemma 6 in [SOAJ14]. Their main concentration lemma is Lemma 30, which states that they obtain a good estimate of the true covariance matrix, akin to our Lemma 6.13. Lemma 31 argues that the largest eigenvector of this estimate is highly correlated with the top eigenvector of the true covariance matrix. Since is large, this implies there is a large margin between the mean and the hyperplane. However, by standard Gaussian tail bounds, the probability of a sample landing on the opposite side of this hyperplane is small.
We highlight the main difference between our approach and theirs. For their clustering step, they require that no sample is misclustered with high probability. As such, they may perform spectral clustering while . We note that, in the next step of our algorithm, we will perform an exhaustive search. This will result in an approximation which depends on the value of at the start of the step, and as such, using the same approach as them would result in an overall approximation which depends logarithmically on the dimension.
We may avoid paying this cost by noting that our algorithm is agnostic. They require that no sample is misclustered with high probability, while our algorithm tolerates that a -fraction of points are misclustered. As such, we can continue spectral clustering until .
7 Exhaustive Search
The final stage of the algorithm is when we know that all ’s are sufficiently small. We can directly apply the following lemma:
Given a weight vector as output by Algorithm 11, then the projection of onto the space orthogonal to the span of the top eigenvectors of
By taking a -wise Cartesian product of this set, we are guaranteed to obtain a vector which has this guarantee simultaneously for all .
8 Applying the Tournament Lemma
In this section, we discuss details about how to apply our hypothesis selection algorithm. First, in Section 6.8.1, we describe how to guess the mixing weights and the variance of the components. Then in Section 6.8.2, we discuss how to ensure our hypotheses come from some fixed finite set, in order to deal with technicalities which arise when performing hypothesis selection with our adversary model.
The majority of our algorithm is focused on generating guesses for the means of the Gaussians. In this section, we guess the remaining parameters: the mixing weights and the variance. While most of these guessing arguments are standard, we emphasize that we reap an additional benefit because our algorithm is agnostic. In particular, most algorithms must deal with error incurred due to misspecification of the parameters. Since our algorithm is agnostic, we can pretend the misspecified parameter is the true one, at the cost of increasing the value of the agnostic parameter . If our misspecified parameters are accurate enough, the agnostic learning guarantee remains unchanged.
Guessing the mixing weights is fairly straightforward. For some sufficiently small, our algorithm generates a set of at most possible mixing weights by guessing the values for each . Note that we may assume each weight is at least , since components with weights less than this can be specified arbitrarily at a total cost of in total variation distance.
Next, we need to guess the variance of the components. To accomplish this, we will take samples (hoping to find only uncorrupted ones) and compute the minimum distance between any pair of them. Since we assume , we can repeatedly draw samples until we have the guarantee that at least one set is uncorrupted. If none of the samples are corrupted, then at least two of them came from the same component, and in our high-dimensional setting the distance between any pair of samples from the same component concentrates around . After rescaling this distance, we can then multiplicatively enumerate around this value with granularity to get an estimate for that is sufficiently good for our purposes. Applying Corollary 2.14 bounds the cost of this misspecification by . By rescaling the points, we may assume that .
8.2 Pruning Our Hypotheses
In this section, we describe how to prune our set of hypotheses in order to apply Lemma 2.23. Recall that this lemma requires our hypotheses to come from some fixed finite set, rather than the potentially infinite set of GMM hypotheses. We describe how to prune and discretize the set of hypotheses obtained during the rest of the algorithm to satisfy this condition. For the purposes of this section, a hypothesis will be a -tuple of -dimensional points, corresponding only to the means of the components. While the candidate mixing weights already come from a fixed finite set (so no further work is needed), the unknown variance must be handled similarly to the means. The details for handling the variance are similar to (and simpler than) those for handling the means, and are omitted.
More precisely, this section will describe a procedure to generate a set of hypotheses , which is exponentially large in and , efficiently searchable, and comes from a finite set of hypotheses which are fixed with respect to the true distribution. Then, given our set of hypotheses generated by the main algorithm (which is exponentially large in but polynomial in ), we iterate over this set, either replacing each hypothesis with a “close” hypothesis from (i.e., one which is within total variation distance), or discarding the hypothesis if none exists. Finally, we run the tournament procedure of Lemma 2.23 on the resulting set of hypotheses.
At a high level, the approach will be as follows. We will take a small set of samples, and remove any samples from this set which are clear outliers (due to having too few nearby neighbors). This will give us a set of points, each of which are within a reasonable distance from some component mean. Taking a union of balls around these samples will give us a space that is a subset of a union of (larger) balls centered at the component centers. We take a discrete mesh over this space to obtain a fixed finite set of possible means, and round each hypothesis such that its means are within this set.
We start by taking samples, which is sufficient to ensure that the number of (uncorrupted) samples from component will be for all with probability . Recall that we are assuming that for all , as all other components may be defined arbitrarily at the cost of in total variation distance. This implies that even after corruption, each component has generated at least uncorrupted samples.
By standard Gaussian concentration bounds, we know that if samples are taken from a Gaussian, the maximum distance between a sample and the Gaussian’s mean will be at most with probability . Assume this condition holds, and thus each component’s mean will have at least points within distance . We prune our set of samples by removing any point with fewer than other points at distance less than . This will not remove any uncorrupted points, by the above assumption, and triangle inequality. However, this will remove any corrupted points at distance at least from all component means, due to the fact that the adversary may only move an -fraction of the points, and reverse triangle inequality.
Now, we consider the union of the balls of radius centered at each of the remaining points. This set contains all of the component means, and is also a subset of the union of the balls of radius centered at the component means. We discretize this set by taking its intersection with a lattice of side-length . We note that any two points in this discretization are at distance at most . By a volume argument, the number of points in the intersection is at most . Each hypothesis will be described by the -wise Cartesian product of these points, giving us a set of at most hypotheses.
Given a set of hypotheses from the main algorithm, we prune it using as a reference. For each , we see if there exists some such that the means in are at distance at most from the corresponding means in .We observe that the complexity of this step is polynomial in and , not exponential, if one searches for the nearest lattice point in the sphere surrounding each unpruned sample, rather than performing a naive linear scan over the entire list. If such an exists, we replace with – otherwise, is simply removed. By Corollary 2.13 and the triangle inequality, this replacement incurs a cost of in total variation distance. At this point, the conditions of Lemma 2.23 are satisfied and we may run this procedure to select a sufficiently accurate hypothesis.
9 Handling Unequal Variances
In this section, we describe the changes required to allow the algorithm to handle different variances for the Gaussians. The main idea is to find the minimum variance of any component and perform clustering so we only have uncorrupted samples from Gaussians with variances within some known, polynomially-wide interval. This allows us to grid within this interval in order to guess the variances, and the rest of the algorithm proceeds with minor changes.
The first step is to locate the minimum variance of any component. Again using standard Gaussian concentration, in sufficiently high dimensions, if samples are taken from a Gaussian with variance , the distance between any two samples will be concentrated around . With this in hand, we use the following procedure to estimate the minimum variance. For each sample , record the distance to the st closest sample. We take the st smallest of these values, rescale it by , and similar to before, guess around it using a multiplicative grid, which will give us an estimate for the smallest variance. We note that discarding the smallest fraction of the points prevents this statistic from being grossly corrupted by the adversary. For the remainder of this section, assume that is known exactly.
At this point, we partition the points into those that come from components with small variance, and those with large variance. We will rely upon the following concentration inequality from [SOAJ14], which gives us the distance between samples from different components:
Given samples from a collection of Gaussian distributions, with probability , the following holds for every pair of samples :
where and .
This procedure works due to the following argument. When we compute , we are guaranteed that it will contain all samples from components with variance , by the upper bound in Lemma 6.16. However, it may also contain samples from other components – in particular, those with variance at most , for
After this clustering step, the algorithm follows similarly to before. The main difference is in the convex programming steps and concentration bounds. For instance, before, we considered the set
Now, to reflect the different expression for the covariance of the GMM, we replace with ; for example:
We note that since all variances in each cluster are off by a factor of at most , this will only affect our concentration and agnostic guarantees by a constant factor.
Agnostically Learning Binary Product Distributions, via Filters
where is the modified empirical covariance. Ultimately, we show that this yields an estimate that is close in total variation distance. See Section 7.2 for further details.
The main result of this section is the following theorem:
Let be a binary product distribution in dimensions and Let be a multiset of independent samples from and be a multiset obtained by arbitrarily changing an -fraction of the points in There exists a polynomial time algorithm that returns a product distribution such that, with probability at least , we have where and are the mean vectors of and respectively.
For we say that a binary product distribution is -balanced if the expectation of each coordinate is in
For -balanced binary product distributions, we have the following corollary of Lemma 2.17:
We start by defining a condition on the uncorrupted set of samples under which our algorithm will succeed.
The following simple lemma shows that a sufficiently large set of independent samples from is -good (with respect to ) with high probability.
Let be an arbitrary distribution on and If the multiset is obtained by taking independent samples from it is -good with respect to with probability at least
Recall (see Definition 5.4) that is the size of the symmetric difference of and divided by the cardinality of
Our agnostic learning algorithm establishing Theorem 7.1 is obtained by repeated application of the efficient procedure whose performance guarantee is given in the following proposition:
Let be a binary product distribution with mean vector and be sufficiently small. Let be -good with respect to , and be any multiset with There exists a polynomial time algorithm Filter-Balanced-Product that, given and returns one of the following:
A mean vector such that
A multiset such that
We start by showing how Theorem 7.1 follows easily from Proposition 7.7.
The proof of Theorem 7.1 is very similar to that of Theorem 5.1, however, we include it here for completeness. By the definition of since has been obtained from by corrupting an -fraction of the points in we have that By Lemma 7.6, the set of uncorrupted samples is -good with respect to with probability at least We henceforth condition on this event.
Our algorithm iteratively applies the Filter-Balanced-Product procedure of Proposition 7.7 until it terminates returning a mean vector with We claim that we need at most iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets such that Thus, if the algorithm does not terminate in the first iterations, we have and in the next iteration we output the sample mean of . ∎
In this section, we describe the efficient procedure establishing Proposition 7.7 followed by its proof of correctness. Our algorithm Filter-Balanced-Product is very simple: We consider the empirical distribution defined by the (corrupted) sample multiset We calculate its mean vector and covariance matrix . If the matrix has no large eigenvalues, we return Otherwise, we use the eigenvector corresponding to the maximum magnitude eigenvalue of and the mean vector to define a filter. We zero out the diagonal elements of the covariance matrix for the following reason: The diagonal elements could contribute up to to the spectral norm, even without noise. This would prevent us from obtaining the desired error of Our efficient filtering procedure is presented in detailed pseudocode below.
Tightness of our Analysis. We remark that the analysis of our filter-based algorithm is tight, and more generally our bound of is a bottleneck for filter-based approaches.
More specifically, we note that our algorithm will never successfully add points back to after they have been removed by the adversary. Therefore, if an -fraction of the points in are changed, our algorithm may be able to remove these outliers from but will not be able to replace them with their original values. These changed values can alter the sample mean by as much as
The rest of this section is devoted to the proof of correctness of algorithm Filter-Balanced-Product.
1.2 Setup and Basic Structural Lemmas
By definition, there exist disjoint multisets of points in , where such that With this notation, we can write Our assumption is equivalent to and the definition of directly implies that Throughout the proof, we assume that is a sufficiently small constant. Our analysis will make essential use of the following matrices:
Our first claim follows from the Chernoff bound and the definition of a good set:
The following sequence of lemmata bound from above the spectral norms of the associated matrices. Our first simple lemma says that the (diagonally reduced) empirical covariance matrix , where is the set of uncorrupted samples drawn from the binary product distribution is a good approximation to the matrix in spectral norm.
If is -good,
It suffices to show that for all . Then, we have that
A similar expression holds for except with probabilities for Since is -good with respect to , we have . This completes the proof. ∎
As a simple consequence of the above lemma, we obtain the following:
If is -good,
First note that we can write where and are obtained from and by zeroing out the diagonal. Observe that This follows from the assumption that and the definition of Now note that the matrices and are diagonal with entries at most and thus have spectral norm at most The claim now follows from Lemma 7.9. ∎
We have that
Note that for and otherwise. Therefore, is the difference of and the diagonal matrix with entries This in turn implies that
Note that both bounding matrices have spectral norm at most , hence so does ∎
The following lemma, bounding from above the spectral norm of is the main structural result of this section. This is the core result needed to establish that the subtractive error cannot change the sample mean by much:
We have that hence
Since , for any , we have that
The RHS is in turn bounded from above as follows:
where the second line follows from (35) and the third line follows from Claim 7.8. This establishes the first part of the lemma.
The bound follows from the previously established bound using the monotonicity of the function and the fact that The observation completes the proof of the second part of the lemma. ∎
Claim 7.10 combined with Lemmas 7.11 and 7.12 and the triangle inequality yield the following:
We have that
We are now ready to analyze the two cases of the algorithm Filter-Balanced-Product.
1.3 The Case of Small Spectral Norm
We start by analyzing the case where the mean vector is returned. This corresponds to the case that the spectral norm of is appropriately small, namely We start with the following simple claim:
Let be the mean vectors of and respectively. Then, and
We prove the first inequality, the proof of the second being identical. Note that is a symmetric matrix, so Moreover, for any vector we have that
Let and take We conclude that as desired. ∎
The following crucial lemma, bounding from above the distance as a function of and will be important for both this and the following subsections.
We have that
First we observe that the mean vector of the uncorrupted sample set is close to Since is -good, this follows from the fact that for any we have
Therefore, we get that
Consider and , the mean vectors of and , respectively. By definition, we have that
and thus by the triangle inequality we obtain
Therefore, we have the following sequence of inequalities:
where the first line follows from the triangle inequality, the second uses Claim 7.14, while the third uses Lemma 7.12 and Corollary 7.13. Finally, the last couple of lines assume that is sufficiently small. The proof of Lemma 7.15 is now complete. ∎
We can now deduce the correctness of Step 7 of the algorithm Filter-Balanced-Product, since for Lemma 7.15 directly implies that
1.4 The Case of Large Spectral Norm
We next show the correctness of the algorithm Filter-Balanced-Product if it returns a filter (rejecting an appropriate subset of ) in Step 8. This corresponds to the case that for a sufficiently large universal constant We will show that the multiset computed in Step 8 satisfies
We start by noting that, as a consequence of Lemma 7.15, we have the following:
We have that
By Lemma 7.15, we have that Recalling that if is sufficiently large, the term is at most ∎
By construction, is the unit eigenvector corresponding to the maximum magnitude eigenvalue of Thus, we have We thus obtain that
where the equality holds by definition, and the inequality follows from Corollary 7.13 and Claim 7.16 using the fact that is sufficiently small and the constant is sufficiently large (noting that the constant in the RHS of Corollary 7.13 does not depend on ).
We show that (36) implies the existence of a with the properties specified in Step 8 of the algorithm Filter-Balanced-Product. More specifically, we have the following crucial lemma:
If for a sufficiently large constant there exists a satisfying the property in Step 8 of the algorithm Filter-Balanced-Product, i.e., such that
Assume for the sake of contradiction that this is not the case, i.e., that for all we have that
Since for all , we have that This fact combined with (37) implies that for all
Using (36) and (38), we have the following sequence of inequalities:
This yields the desired contradiction recalling that the assumption and the definition of imply that for an appropriately large ∎
The following simple claim completes the proof of Proposition 7.7:
We have that
Recall that with and disjoint multisets such that We can similarly write with and Since
it suffices to show that Note that is the number of points rejected by the filter that lie in By Claim 7.8 and Claim 7.16, it follows that the fraction of elements that are removed to produce (i.e., satisfy ) is at most Hence, it holds that On the other hand, Step 8 of the algorithm ensures that the fraction of elements of that are rejected by the filter is at least Note that is the number of points rejected by the filter that lie in Therefore, we can write:
where the second line uses the fact that and the last line uses the fact that This completes the proof of the claim. ∎
2 Agnostically Learning Arbitrary Binary Product Distributions
In this subsection, we build on the approach of the previous subsection to show the following:
Similarly to the case of balanced product distributions, we will require a notion of a “good” set for our distribution. For technical reasons, the definition in this setting turns out to be more complicated. Roughly speaking, this is to allow us to ignore coordinates for which the small fraction of errors is sufficient to drastically change the sample mean.
We note that a sufficiently large set of samples from will satisfy the above properties with high probability:
If is obtained by taking independent samples from it is -good with respect to with probability at least
The proof of this lemma is deferred to Section D.
We will also require a notion of the number of coordinates on which non-trivially depends:
Similarly to the balanced case, our algorithm is obtained by repeated application of an efficient filter procedure, whose precise guarantee is described below.
Let be a binary product distribution in dimensions and Suppose that is an -good multiset with respect to with and be any multiset with There exists a polynomial time algorithm which, given and , returns one of the following:
A multiset of elements of such that there exists a product distribution with mean and a multiset that is -good with respect to such that
Our agnostic learning algorithm is then obtained by iterating this procedure. We can prove Theorem 7.19 given Proposition 7.23.
When , we will need to draw fresh -corrupted samples and repeat this procedure times, and then one of the resulting output distributions is within total variation distance with probability at least . Then we use the agnostic hypothesis selection procedure of Lemma 2.23. ∎
In this section, we describe and analyze the efficient routine establishing Proposition 7.23. Our efficient filtering procedure is presented in detailed pseudocode below.
This completes the description of the algorithm. We now proceed to prove correctness.
2.2 Chi-Squared Distance and Basic Reductions
As previously mentioned, our algorithm will use the -distance between the mean vectors as a proxy for the total variation distance between two binary product distributions. Since the mean vector of the target distribution is not known to us, we will not be able to use the symmetric definition of the -distance used in Lemma 2.17 We will instead require the following asymmetric version of the -distance:
The following fact follows directly from Lemma 2.17.
There are two problems with using the distance between the mean vectors as a proxy for the total variation distance. The first is that the -distance between the means is a very loose approximation of the total variational distance when the means are close to or in some coordinate. To circumvent this obstacle, we remove such coordinates via an appropriate pre-processing in Step 8. The second is that the above asymmetric notion of the -distance may be quite far from the symmetric definition. To overcome this issue, it suffices to have that and To ensure this condition is satisfied, we appropriately modify the target product distribution (that we aim to be close to). Next, we will show how we deal with these problems in detail.
The next reduction will be slightly more complicated and depends on the following idea: Suppose that there is a new product distribution with mean and an -good multiset for such that
Then, it suffices to show that our algorithm works for and instead of and (note that the input to the algorithm, and in the same in either case). This is because the conditions imposed by the output in this case would be strictly stronger. In particular, we may assume that for all :
There is a product distribution whose mean vector satisfies and for all , and a set that is -good for and satisfies
If all coordinates have and then we can take and
Suppose that the coordinate has Let be the product whose mean vector has and for Let be obtained by removing from all of the entries with in the -coordinate. Then, we claim that is -good for and has Note that here we have
First, we show that is -good for For any affine function and set with we need to show that
Let We may or may not have but, from the definition of ,
Moreover, note that and Thus, is -good for
Next, we show that We write We write for the subset of respectively, where the coordinate is Since is -good for we have that Recall that we are already assuming that Thus, Therefore, we have that On the other hand, we have that Thus, This means that However, and . This gives
Similarly, suppose that instead the -coordinate has Let be the product whose mean vector has and for Let be obtained by removing from all of the entries with in the -coordinate. Then, by a similar proof we have that is -good for and has Note that here we have
By an easy induction, we can set all coordinates with and to or respectively, giving an and such that is -good for and
In conclusion, throughout the rest of the proof we may and will assume that for all
and
2.3 Setup and Basic Structural Lemmas
As in the balanced case, we can write for disjoint multisets and Similarly, we define the following matrices:
Note that we no longer zero-out the diagonals of and . This will turn out to allow us to more naturally relate spectral properties of these matrices to the -distance between the means. We start with the following simple claim:
and and
where we used that and the assumption in the claim statement. For the second part of (i), note that
where the first inequality is Cauchy-Schwarz, and the second follows from the assumption in the claim statement that This proves (i).
To prove (ii), we note that Chebyshev’s inequality gives
where the second inequality follows from (i). To complete the proof note the inequality
where we used the triangle inequality and the second part of (i). ∎
Let denote the sample covariance matrix with respect to and denote the covariance matrix of We will need the following lemma:
and
For (i): Since is good, for any we have
Therefore, by the triangle inequality we get
where the second inequality uses the fact that
For (ii): Since is good, for any we have
Combined with the bound above, this gives
Note that Therefore,
Combining Claim 7.27 and Lemma 7.28 we obtain:
and and
where the second line uses Lemma 7.28 (ii), and the assumption and the third line holds for small enough Thus, using Claim 7.27 (i), we get that
By the Cauchy-Schwarz inequality and Lemma 7.28, we get
Part (ii) follows directly from Claim 7.27 (ii) and the assumption that is good for ∎
We have that .
We can show that for all , by expanding the LHS in terms of the differences of linear threshold functions on and in the same way as the proof of Lemma 7.28. Thus,
Finally note that , and so
We have that
This follows from Lemma 7.30 combined with the fact that and the observation ∎
We have that
The following crucial lemma bounds from above the contribution to the error from :
The spectral norm
where the first line uses the triangle inequality, the second line uses Claim 7.27 (i), the third line follows from the fact that , the fourth line uses Corollary 7.29 (i), and the last line uses the fact that is small enough. ∎
The above lemmata and the triangle inequality yield the following:
We have that
We are now ready to analyze the two cases of the algorithm Filter-Product.
2.4 The Case of Small Spectral Norm
We begin by bounding various -distances by the spectral norm of appropriate matrices.
Let be the mean vector of and , respectively. Then, and
We prove the first inequality, the proof of the second being very similar.
We can now prove that the output in Step 10 has the desired guarantee:
We have that
Since we have that Recalling that are disjoint, the latter implies that
First note that, by Lemma 7.28, Lemma 7.35 and Corollary 7.34 give that
For sufficiently small, we have that the terms satisfy
Recalling that , we now have:
Let . For some universal constant , if then Otherwise, we have
If is sufficiently large, when this is at most ∎
By Corollary 7.37, we have that Thus, by Corollary 7.25, the total variation distance between the product distributions with means and is ∎
2.5 The Case of Large Spectral Norm
We next need to show the correctness of the algorithm if it returns a filter, If we reach this step, then we have indeed and by Corollary 7.37, it follows that where
Since satisfies Thus, we can apply Corollary 7.29 to it.
Let . Firstly, we look at the expected number of samples we reject:
Next, we look at the expected number of false positive samples we reject. If we write for disjoint multisets and , then these are the elements of . We have:
Agnostically Learning Mixtures of Two Balanced Binary Products, via Filters
Unfortunately, bounds on the top absolute eigenvalue do not translate as well into bounds on the total variation distance of our estimate to the true distribution, as they did in all previous cases (e.g., if the top absolute eigenvalue is small in the case of learning the mean of a Gaussian with identity covariance, we can just use the empirical mean, etc). In fact, an eigenvalue could just mean that and differ by along the direction . However, we can proceed by zeroing out the diagonals. If has any large value along the diagonal, this operation can itself produce large eigenvalues. So, this strategy only works when is appropriately bounded. When is large, there is a separate strategy to deal with large entries in by guessing a coordinate whose value is large and conditioning on it, and once again setting up a modified eigenvalue problem. Our overall algorithm then follows from balancing all of these different cases, and we describe the technical components in more detail in the next subsection.
This section is devoted to the proof of the following theorem:
Recall that our overall approach is based on two strategies that succeed under different assumptions. Our first algorithm (Section 8.2) assumes that there exists a coordinate in which the means of the two component product distributions differ by a substantial amount. Under this assumption, we can use the empirical mean vectors conditioned on this coordinate being and . We show that the difference between these conditional mean vectors is almost parallel to the difference between the mean vectors of the product distributions. Considering eigenvectors perpendicular to this difference will prove a critical part of the analysis of this case. Our second algorithm (Section 8.3) succeeds under the assumption that the mean vectors of the two product distributions are close in all coordinates. This assumption allows us to zero out the diagonal of the covariance matrix without introducing too much error.
We note that together these algorithms will cover all possible cases. Our final algorithm runs all of these procedures in parallel, obtaining a polynomial number of candidate hypothesis distributions, such that at least one is sufficiently close to . We then run the tournament described by Lemma 2.23 in order to find a particular candidate that is sufficiently close to the target. To ensure that all the distributions returned are in some finite set , we round each of the probabilities of each of the products to the nearest multiple of , and similarly round the mixing weight to the nearest multiple of . This introduces at most additional error.
2 Mixtures of Products Whose Means Differ Significantly in One Coordinate
We will use the following notation. Let be a mixture of two -balanced binary product distributions. We will write as where are binary product distributions with mean vectors , and . In this subsection, we prove the following theorem:
For simplicity of the analysis, we will assume without loss of generality that unless otherwise specified. First, we determine some conditions under which our sample set will be sufficient. We start by recalling our condition of a good set for a balanced binary product distribution:
We will also need this to hold after conditioning on the last coordinate.
Finally, we define the notion of a good set for a mixture of two balanced products.
Let be a mixture of two binary product distributions. We say that a multiset of elements of is -good with respect to if we can write , where is -good with respect to , is -good with respect to , and .
We now show that taking random samples from produces such a set with high probability.
Let be a mixture of binary product distributions, where are binary product distributions with mean vectors . Let be a set obtained by taking independent samples from . Then, with probability at least , is -good with respect to for all .
The proof of this lemma is deferred to Section E.
We claim that given a good set with an -fraction of its entries corrupted, we can still determine from it. In particular, this is achieved by iterating the following proposition.
We note that iteratively applying this algorithm until it outputs a set of mixtures gives Theorem 8.2.
Notation. All vectors in this section should be assumed to be over the first coordinates only. We will write and for the first coordinates of and , but for other vectors we will use the similar notation to that used elsewhere to denote -dimensional vectors.
The algorithm, written in terms of instead of for generality, is as follows:
We now proceed to prove correctness. We note that given , we can write
where , and is disjoint from and . Thus, we have
We next need some basic lemmata relating the means of some of these distributions.
We next show that is approximately parallel to . Note that if we had and , then would be a multiple of . Since is -good, we can bound the error introduced by , and Lemma 8.8 allow us to bound the error in taking instead of . However, we still have terms in the conditional means of :
For some scalars , , , we have
where is the subset of with last entry , is the mean of with coordinate removed.
Let denote the subset of the appropriate set in which the last coordinate is . Let denote the means of , and with the last entry truncated, respectively.
Taking the means of the subsets of , we find that
Therefore, using this and Lemma 8.8, we have that
Since , we have:
Thus, the sum of the coefficients of the and terms in Equation (40) is , which is bounded in absolute value by . Meanwhile, the coefficient of Equation (40) has:
Noting that and , we can write Equation (40) as:
We write and so, dividing by and recalling that , we get
If , then we would be done. So, we must bound the error introduced by making this substitution. We can express as
Substituting this into Equation (41), gives the lemma. ∎
We now show that, for any vector perpendicular to , if the variance of in the -direction is small, then and are both approximately .
For any with , , we have that and for for a sufficiently large constant as defined in the algorithm.
Next, since , we have by Lemma 8.9 that
However, we have that , and so
Inserting our assumptions that , and gives
We can now show that if we return some distribution returned is close to . First, we show that there are points on close to and .
It is clear that even discretizing and , we can still find such a pair that satisfies this condition.
There are such that
Similarly, we show that there is a such that . ∎
Now, we are ready to analyze the second part of our algorithm. The basic idea will be to show that if is large, then a large fraction of the variance in the -direction is due to points in .
Since and are -good, we have that
Since is larger than a sufficiently large constant, this completes the proof. ∎
We next show that the threshold required by our algorithm exists.
If , there exists a such that
Assume for the sake of contradiction that this is not the case, i.e., that for all we have that
Using the fact that , this implies that for all
Finally, we show that is closer to than was.
If the algorithm returns then .
Since , we can write for , and , where has disjoint support from and . Thus, we need to show that
By Lemma 8.10, we have that and so
Since is -good, we have
We get the same bound for , and so
Since but any has , we have that
3 Mixtures of Products Whose Means Are Close in Every Coordinate
In this section, we prove the following theorem:
We will assume without loss of generality that We may also assume that since otherwise, we can make use of our algorithm for learning a single product distribution.
In this context, we require the following slightly different definition of a good set:
Let be a multiset in . We say that is -good for the mixture if there exists a partition such that and that and are -good for the component product distributions and , respectively.
If has mixing weights , with probability at least , a set of samples drawn from is good for .
Our theorem will follow from the following proposition:
Before we present the algorithm, we give one final piece of notation. For a set of points, we let be the sample covariance matrix of and be the sample covariance matrix with zeroed out diagonal. Our algorithm is presented in detailed pseudocode in Algorithm 16.
To analyze this algorithm, we begin with a few preliminaries. Firstly, we recall that . We can write , where , , and
Let and be the sample means of and , respectively.
We have that
We will require that the matrix is close to being PSD. The proof of this fact is rather technical and we defer it to the appendix.
Let be the multiset obtained from by replacing all points of with copies of and all points of with copies of . Then,
We are now prepared to show that the first return condition outputs a correct answer. We begin by showing that vectors with large inner products with or correspond to large eigenvectors of .
Using Lemma 8.22, we have . From the definition of it follows that
Next, we show that, if there is only one large eigenvalue of , the means in question are both close to a given line.
Next, we analyze the second case of the algorithm. We must show that Step 8 will find an and . First, we claim that there is a which makes nearly perpendicular to .
There exists a , with a multiple of , that has
Let . If , then suffices. Otherwise, we take Then, let be the nearest multiple of to . Note that and . Then, we have
We now need to show that for this , Step 8 will find a . For this , and are close. We need to show that contains many elements whose is far from these. We can express this in terms of :
Let be a unit vector in with Then, there is a such that
for sufficiently large and we also have that is positive.
Suppose for a contradiction that this lemma does not hold. Then, since , we have
Since we assumed that , this is a contradiction.
To get a similar result for , we first need to show that and are suitably concentrated about their means:
If , this is strictly less than
Since we have
Noting that , we have
for sufficiently small. If , this expression is . ∎
Now we can finally show that a exists for this , so Step 8 will succeed:
There is a such that
By Lemma 8.27, there exists a such that
Using the definition of , the points when or do not contribute to this probability so all points in that satisfy come from . Since and , we have
Noting that , all except a fraction of points have . So, . Therefore, .
Thus, by Lemma 8.28, we have . Again, using that , we have that
Consequently, if satisfies , then it satisfies . Substituting this condition into Equation (43) gives the lemma. ∎
Again we need to show that any filter does not remove too many points of . We need to show this for an arbitrary , not just one nearly parallel to .
For any unit vector and , we have
Every point with has for all with . Thus, for with , we have
When , we have
Thus, we have
But inequality (44) gives a bound on the number of in that do not satisfy this condition. That is,
Similarly, every point with has
Dividing by and noting that completes the proof. ∎
Now, we can show that the filter improves and such that the algorithm is correct in the filter case.
If we reach Step 9 and return , then
We can write , where has disjoint support from and . Note that, since we have , we can define these sets such that , and . We assume that we do. Now we have that
In Step 8, we found a vector and such that
Then in Step 9, we remove at least a fraction of points. That is,
The fact that allows us to use Lemma 8.30, with , yielding that:
since . ∎
Acknowledgements
J.L. would like to thank Michael B. Cohen and Samuel B. Hopkins for some very helpful discussions. We thank the reviewers for their detailed feedback, and Lili Su for pointing out an error in the proof of Lemma 5.3.
References
Appendix A Deferred Proofs from Section 4
This section contains deferred proofs of several concentration inequalities.
Therefore, by the triangle inequality, we have
Observe that the first term on the right hand side does not depend on the choice of . Let denote the event that
By Lemma 2.22, this happens with probability so long as
For any so that , let denote the event that
Fix any such . By multiplying both sides by , the event is equivalent to the event that
Let be as in Lemma 2.22. Observe that for sufficiently small. Then, by Lemma 2.22, we have that for any fixed ,
Let denote the binary entropy function. We now have
as claimed, where (a) follows by a union bound over all sets of size , (b) follows from the bound , (c) follows since as , (d) follows from our choice of , and (e) follows from our choice of . This completes the proof.
Proof of Theorem 4.12: We first recall Isserlis’ theorem, which we will require in this proof.
where the is over all matchings of .
Since is symmetric, it has a eigenvalue expansion , which immediately implies that . Let . We compute the quadratic form:
where the last line follows by invoking Isserlis’s theorem. We now manage both sums individually. We have
Proof of Corollary 4.9: Let denote the set of subsets of of size . The same Bernstein-style analysis as in the proof of Lemma 4.3 yields that there exist universal constants so that:
Thus, union bounding over all yields that
by the same manipulations as in the proof of Lemma 4.3.
This follows immediately from Lemmas 5.17 and 5.20.
Appendix B Deferred Proofs from Section 5
Proof of Lemma 5.3: Let be the number of samples drawn from . For (i), the probability that a coordinate of a sample is at least is at most by Fact 5.6. By a union bound, the probability that all coordinates of all samples are smaller than is at least . In this case, .
After translating by , we note that (iii) follows immediately from Lemma 2.21 and (iv) follows from Theorem 5.50 of [Ver10], as long as , with probability at least . It remains to show that, conditioned on (i), (ii) holds with probability at least .
To simplify some expressions, let and . We need to show that for all unit vectors and all that
Firstly, we show that for all unit vectors and
with probability at least . Since the VC-dimension of the set of all halfspaces is , this follows from the VC inequality [DL01], since we have more than samples. We thus only need to consider the case when .
For any fixed unit vector and , except with probability , we have that
Let be the event that . Since is sub-gaussian, Fact 5.6 yields that . Note that, thanks to our assumption on , we have that , and therefore .
where (a) follows from sub-gaussianity, (b) follows from our choice of , and (c) comes from the fact that for all .
Thus, if is a sufficiently small constant and is sufficiently large, this yields the desired bound. ∎
Now let be a -cover in Euclidean distance for the set of unit vectors of size . By a union bound, for all and a power of 2 between and , we have that
Then, by a union bound, (46) holds simultaneously for all unit vectors and all , with probability a least . This completes the proof.
B.2 Proof of Lemma 5.16
Let , where is orthogonal and is diagonal be an eigen-decomposition of the symmetric matrix . Then, . Let and . Then, and for independent Gaussians . Thus, follows a generalized -distribution.
Let , where are independent random variables distributed as . Let be the vector with coordinates . Then,
Noting that for , we have
Applying this for instead of and putting these together, we get
Substituting , and gives:
The final property is a consequence of the following anti-concentration inequality:
B.3 Proof of Lemma 5.17
Proof of Lemma 5.17: Firstly, we note that it suffices to prove this for the case , since for , is distributed as , and all the conditions transform to those for under this transformation.
Condition 1 follows by standard concentration bounds on . Condition 2 follows by estimating the entry-wise error between and . These two conditions hold by Lemma 5.3, since they follow from (i), (iii), and (iv) of goodness in the sense of Definition 5.2.
Condition 4 is substantially the most difficult of these conditions to prove. Naively, we would want to find a cover of all possible and all possible , and bound the probability that the desired condition fails. Unfortunately, the best a priori bound on are on the order of . As our cover would need to be of size or so, to make this work with , we would require on the order of samples in order to make this argument work.
However, we will note that this argument is sufficient to cover the case of .
Let be the set of even, mean-, degree- polynomials, such that the associated matrix satisfies:
Note that for that .
Importantly, any polynomial can be written in terms of these sets.
Let be the associated matrix to . Note that . Let be the matrix corresponding to the top eigenvalues of . We now let be the polynomial associated to , be associated to , be associated to , and so on. It is clear that . It is also clear that the matrix associated to has rank at most . If the matrix associated to had an eigenvalue more than , it would need to be the case that the largest eigenvalue of had size at least . This is impossible since the sum of the squares of the eigenvalues of is at most .
We will also need covers of each of these sets . We will assume that condition (1) holds, i.e., that , where . Under this condition, cannot be too large and this affects how small a variance polynomial we can ignore.
For each , there exists a set such that
For each there exists a such that .
We note that any such is associated to a matrix of the form , for and orthonormal. It suffices to let correspond to the matrix for with and for all . It is easy to let and range over covers of the interval and the sphere with appropriate errors. This gives a set of possible ’s of size as desired. Unfortunately, some of these will not be in as they will have eigenvalues that are too large. However, this is easily fixed by replacing each such by the closest element of . This completes our proof. ∎
We next will show that these covers are sufficient to express any polynomial.
Combining the above two lemmata we have that any such can be written as
where above is in and . Thus, has . This completes the proof. ∎
The key observation now is that if for , then writing as above, it must be the case that for some . Therefore, to prove our main result, it suffices to show that, with high probability over the choice of , for any and any for some , that . Equivalently, it suffices to show that for it holds . Note that this holds automatically for , as cannot possibly be that large for . Furthermore, note that losing a constant factor in the probability, it suffices to show this only for a power of .
Therefore, it suffices to show for every , every and every that with probability at least over the choice of we have that . However, by the Hanson-Wright inequality, we have that
Therefore, by Chernoff bounds, the probability that more than a -fraction of the elements of satisfy this property is at most
Appendix C Deferred Proofs from Section 6
Construct an auxiliary graph on vertices, where we put an edge between nodes and . By the above, there must be a path from to in this graph. Since this graph has nodes, there must be a path of length at most from to ; moreover, by the above, we know that this implies that .
Finally, the fourth property follows from the same argument as the proof of the third.
Proof of Lemma 6.9: Let . Let be the top eigenvector of
In particular, since , we must have
Let be an matrix with orthonormal columns, where the columns span the set of vectors . We note the rank of this set is at most due to the definition of .
Using the dual characterization of the Schatten top- norm, we have that
Observe that since , we have
Proof of Lemma 6.10: By Fact 4.2 and (34) we have . Now, by the triangle inequality, we have
Using the fact that variance is nonnegative we have
where in the last inequality we have used Fact 4.2 and (33). Hence altogether this implies that
Once more, let and expand the formula for :
Suppose that . Then .
are the weights that are uniform on the uncorrupted points. Because , we have that . Using (33), this implies that . By Corollary 6.11, .
Suppose that . Then .
We split into two cases. In the first case, . By Lemma 6.9, we have that
In the other case, . Recall that from (29). Write as follows:
Now taking the Schatten top- norm of , we have
The first inequality is the triangle inequality, the second is by (33) and Fact 4.2, the third is because the summed matrices are positive semidefinite, the fourth follows from Lemma 6.10, and the fifth holds for all sufficiently large. ∎
If , then
But this is true for sufficiently large, as . Therefore,
where the second inequality holds since .
On the other hand, consider when . By (47), we know that
The first and third terms are immediately dominated by , it remains to show that
Or equivalently, This follows since
Appendix D Deferred Proofs from Section 7
Proof of Lemma 7.21: By Lemma 7.6 applied with in place of since we have samples from with probability at least the set is such that for all affine functions it holds We henceforth condition on this event.
Let be the event that all coordinates in take their most common value. For a single coordinate the probability that it does not take its most common value, satisfies
Thus, by a union bound, we have that Let be the number of coordinates of in which do not have their most common value, and observe that is an affine function of Noting that for , we have that if and only if holds for it follows that Hence, we have that
Note that for , we have that if and only if and holds for Therefore, we can write
This completes the proof of Lemma 7.21.
Appendix E Deferred Proofs from Section 8
Proof of Lemma 8.6: Let be the set of samples drawn from and be the set of samples drawn from . Firstly, we note that by a Chernoff bound, with probability . Assuming this holds, it follows that . Similarly, . By Lemma 7.6 applied with in place of , since we have samples, with probability , the set is -good for , i.e., it satisfies that for all affine functions , We show that assuming is -good, it is good for each .
Note that and . Since , it follows that . For any affine function , define and . Then, we have the following:
So, we have that is good for for all with probability . Similarly, is good for for all with probability . Thus, we have that , is good for and is good for for all with probability . That is, is -good for for all with probability at least .
Proof of Lemma 8.19: Let be the set of samples drawn from and be the set of samples drawn from . Firstly, we note that by a Chernoff bound, with probability at least . Assuming this holds, . Similarly, .
By Lemma 7.6 applied with , since we have samples, with probability at least , the set is -good for . Similarly, with probability at least , the set is -good for . Thus, with probability , we have that and that and are -good for and respectively.
Noting that the mean of is and , we have:
Since and are product distributions, and can have large diagonal elements but small off-diagonal ones. On the other hand, we bound the elements on the diagonal of , but may still be large due to off-diagonal elements.
By the triangle inequality, and Equation (48) with zeroed diagonal, we have:
We will bound each of these terms separately.
Note that is a diagonal matrix and its non-zero entries are
Note that satisfies . Since we have Using that , , , we have
Since is -good for , it follows that and Also,
Finally, Thus, by the triangle inequality, we get
We have the following sequence of inequalities:
It remains to bound the terms in (49). To analyze the first of these terms, note that We have that
Note that since is good, the -th entry of has absolute value at most Thus,
Therefore, combining the above we have that
By the assumption on in Theorem 8.17, , and the proof is complete. ∎