Being Robust (in High Dimensions) Can Be Practical

Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, Alistair Stewart

Introduction

Robust statistics was founded in the seminal works of [Tuk60] and [Hub64]. The overarching motto is that any model (especially a parametric one) is only approximately valid, and that any estimator designed for a particular distribution that is to be used in practice must also be stable in the presence of model misspecification. The standard setup is to assume that the samples we are given come from a nice distribution, but that an adversary has the power to arbitrarily corrupt a constant fraction of the observed data. After several decades of work, the robust statistics community has discovered a myriad of estimators that are provably robust. An important feature of this line of work is that it can tolerate a constant fraction of corruptions independent of the dimension and that there are estimators for both the location (e.g., the mean) and scale (e.g., the covariance). See [HR09] and [HRRS86] for further background.

It turns out that there are vast gaps in our understanding of robustness, when computational considerations are taken into account. In one dimension, robustness and computational efficiency are in perfect harmony. The empirical mean and empirical variance are not robust, because a single corruption can arbitrarily bias these estimates, but alternatives such as the median and the interquartile range are straightforward to compute and are provably robust.

But in high dimensions, there is a striking tension between robustness and computational efficiency. Let us consider estimators for location. The Tukey median [Tuk60] is a natural generalization of the one-dimensional median to high-dimensions. It is known that it behaves well (i.e., it needs few samples) when estimating the mean for various symmetric distributions [DG92, CGR16]. However, it is hard to compute in general [JP78, AK95] and the many heuristics for computing it degrade badly in the quality of their approximation as the dimension scales [CEM+93, Cha04, MS10]. The same issues plague estimators for scale. The minimum volume ellipsoid [Rou85] is a natural generalization of the one-dimensional interquartile range and is provably robust in high-dimensions, but is also hard to compute. And once again, heuristics for computing it [VAR09, RS98] work poorly in high dimensions.

The fact that robustness in high dimensions seems to come at such a steep price has long been a point of consternation within robust statistics. In a 1997 retrospective on the development of robust statistics [Hub97], Huber laments:

“It is one thing to design a theoretical algorithm whose purpose is to prove [large fractions of corruptions can be tolerated] and quite another thing to design a practical version that can be used not merely on small, but also on medium sized regression problems, with a $2000$ by $50$ matrix or so. This last requirement would seem to exclude all of the recently proposed [techniques].”

The goal of this paper is to answer Huber’s call to action and design estimators for both the mean and covariance that are highly practical, provably robust, and work in high-dimensions. Such estimators make the promise of robust statistics – estimators that work in high-dimensions and guarantee that their output has not been heavily biased by some small set of noisy samples – much closer to a reality.

First, we make some remarks to dispel some common misconceptions. There has been a considerable amount of recent work on robust principal component analysis, much of it making use of semidefinite programming. Some of these works can tolerate a constant fraction of corruptions [CLMW11], however require that the locations of the corruptions are evenly spread throughout the dataset so that no individual sample is entirely corrupted. In contrast, the usual models in robust statistics are quite rigid in what they require and they do this for good reason. A common scenario that is used to motivate robust statistical methods is if two studies are mixed together, and one subpopulation does not fit the model. Then one wants estimators that work without assuming anything at all about these outliers.

There have also been semidefinite programming methods proposed for robust principal component analysis with outliers [XCS10]. These methods assume that the uncorrupted matrix is rank $r$ and that the fraction of outliers is at most $1/r$ , which again degrades badly as the rank of the matrix increases. Moreover, any method that uses semidefinite programming will have difficulty scaling to the sizes of the problems we consider here. For sake of comparison – even with state-of-the-art interior point methods – it is not currently feasible to solve the types of semidefinite programs that have been proposed when the matrices have dimension larger than a hundred.

Recent works in theoretical computer science have sought to circumvent the usual difficulties of designing efficient and robust algorithms by instead working in a generative model. The starting point for our paper is the work of [DKK+16] who gave an efficient algorithm for the problem of agnostically learning a Gaussian:

Total variation distance is the natural metric to use to measure closeness of the parameters, since a $(1-\varepsilon)$ -fraction of the observed samples came from a Gaussian. [DKK+16] gave an algorithm for the above problem (note that the guarantees are dimension independent), whose running time and sample complexity are polynomial in the dimension $d$ and $1/\varepsilon$ . [LRV16] independently gave an algorithm for the unknown mean case that achieves $d_{TV}(\mathcal{N},\mathcal{N}^{\prime})\leq\widetilde{O}(\varepsilon\sqrt{\log d})$ , and in the unknown covariance case achieves guarantees in a weaker metric that is not affine invariant. A crucial feature is that both algorithms work even when the moments of the underlying distribution satisfy certain conditions, and thus are not necessarily brittle to the modeling assumption that the inliers come from a Gaussian distribution.

A more conceptual way to view such work is as a proof-of-concept that the Tukey median and minimum volume ellipsoid can be computed efficiently in a natural family of distributional models. This follows because not only would these be good estimates for the mean and covariance in the above model, but in fact any estimates that are good must also be close to them. Thus, these works fit into the emerging research direction of circumventing worst-case lower bounds by going beyond worst-case analysis.

Since the dissemination of the aforementioned works [DKK+16, LRV16], there has been a flurry of research activity on computationally efficient robust estimation in a variety of high-dimensional settings [DKS16, DKS17, CSV17, DKK+17, Li17, DBS17, BDLS17, SCV18, DKK+18], including studying graphical distributional models [DKS16], understanding the computation-robustness tradeoff for statistical query algorithms [DKS17], tolerating much more noise by allowing the algorithm to output a list of candidate hypotheses [CSV17], and developing robust algorithms under sparsity assumptions [Li17, DBS17, BDLS17], where the number of samples is sublinear in the dimension.

2 Our Results

Our goal in this work is to show that high-dimensional robust estimation can be highly practical. However, there are two major obstacles to achieving this. First, the sample complexity and running time of the algorithms in [DKK+16] is prohibitively large for high-dimensional applications. We just would not be able to store as many samples as we would need, in order to compute accurate estimates, in high-dimensional applications.

Our first main contribution is to show essentially tight bounds on the sample complexity of the filtering based algorithm of [DKK+16]. Roughly speaking, we accomplish this with a new definition of the good set which plugs into the existing analysis in a straightforward manner and shows that it is possible to estimate the mean with $\widetilde{O}(d/\varepsilon^{2})$ samples (when the covariance is known) and the covariance with $\widetilde{O}(d^{2}/\varepsilon^{2})$ samples. Both of these bounds are information-theoretically optimal, up to logarithmic factors.

Our second main contribution is to vastly improve the fraction of adversarial corruptions that can be tolerated in applications. The fraction of errors that the algorithms of [DKK+16] can tolerate is indeed a constant that is independent of the dimension, but it is very small both in theory and in practice. This is due to the fact that many of the steps in the algorithm are overly conservative. In fact, we found that a naive implementation of the algorithm did not remove any outliers in many realistic scenarios. We combat this by giving new ways to empirically tune the threshold for where to remove points from the sample set. These optimizations dramatically improve the empirical performance.

Finally, we show that the same bounds on the error guarantee continue to work even when the underlying distribution is sub-Gaussian. This theoretically confirms that the robustness guarantees of such algorithms are in fact not overly brittle to the distributional assumptions. In fact, the filtering algorithm of [DKK+16] is easily shown to be robust under much weaker distributional assumptions, while retaining near-optimal sample and error guarantees. As an example, we show that it yields a near sample-optimal efficient estimator for robustly estimating the mean of a distribution, under the assumption that its covariance is bounded. Even in this regime, the filtering algorithm guarantees optimal error, up to a constant factor. Furthermore we empirically corroborate this finding by showing that the algorithm works well on real world data, as we describe below.

Now we come to the task of testing out our algorithms. To the best of our knowledge, there have been no experimental evaluations of the performance of the myriad of approaches to robust estimation. It remains mostly a mystery which ones perform well in high-dimensions, and which do not. To test out our algorithms, we design a synthetic experiment where a $(1-\varepsilon)$ -fraction of the samples come from a Gaussian and the rest are noise and sampled from another distribution (in many cases, Bernoulli). This gives us a baseline to compare how well various algorithms recover $\mu$ and $\Sigma$ , and how their performance degrades based on the dimension. Our plots show a predictable and yet striking phenomenon: All earlier approaches have error rates that scale polynomially with the dimension and ours is a constant that is almost indistinguishable from the error that comes from sample noise alone. Moreover, our algorithms are able to scale to hundreds of dimensions.

But are algorithms for agnostically learning a Gaussian unduly sensitive to the distributional assumptions they make? We are able to give an intriguing visual demonstration of our techniques on real data. The famous study of [NJB+08] showed that performing principal component analysis on a matrix of genetic data recovers a map of Europe. More precisely, the top two singular vectors define a projection into the plane and when the groups of individuals are color-coded with where they are from, we recover familiar country boundaries that corresponds to the map of Europe. The conclusion from their study was that genes mirror geography. Given that one of the most important applications of robust estimation ought to be in exploratory data analysis, we ask: To what extent can we recover the map of Europe in the presence of noise? We show that when a small number of corrupted samples are added to the dataset, the picture becomes entirely distorted (and this continues to hold even for many other methods that have been proposed). In contrast, when we run our algorithm, we are able to once again recover the map of Europe. Thus, even when some fraction of the data has been corrupted (e.g., medical studies were pooled together even though the subpopulations studied were different), it is still possible to perform principal component analysis and recover qualitatively similar conclusions as if there were no noise at all!

Formal Framework

Notation. For a vector $v$ , we will let $\|v\|_{2}$ denote its Euclidean norm. If $M$ is a matrix, we will let $\|M\|_{2}$ denote its spectral norm and $\|M\|_{F}$ denote its Frobenius norm. We will write $X\in_{u}S$ to denote that $X$ is drawn from the empirical distribution defined by $S$ .

Robust Estimation. We consider the following powerful model of robust estimation that generalizes many other existing models, including Huber’s contamination model:

Given $\varepsilon>0$ and a distribution family $\mathcal{D}$ , the adversary operates as follows: The algorithm specifies some number of samples $m$ . The adversary generates $m$ samples $X_{1},X_{2},\ldots,X_{m}$ from some (unknown) $D\in\mathcal{D}$ . It then draws $m^{\prime}$ from an appropriate distribution. This distribution is allowed to depend on $X_{1},X_{2},\ldots,X_{m}$ , but when marginalized over the $m$ samples satisfies $m^{\prime}\sim\mbox{Bin}(\varepsilon,m)$ . The adversary is allowed to inspect the samples, removes $m^{\prime}$ of them, and replaces them with arbitrary points. The set of $m$ points is then given to the algorithm.

In summary, the adversary is allowed to inspect the samples before corrupting them, both by adding corrupted points and deleting uncorrupted points. In contrast, in Huber’s model the adversary is oblivious to the samples and is only allowed to add corrupted points.

We remark that there are no computational restrictions on the adversary. The goal is to return the parameters of a distribution $\widehat{D}$ in $\mathcal{D}$ that are close to the true parameters in an appropriate metric. For the case of the mean, our metric will be the Euclidean distance. For the covariance, we will use the Mahalanobis distance, i.e., $\|\Sigma^{-1/2}{\widehat{\Sigma}}\Sigma^{-1/2}-I\|_{F}$ . This is a strong affine invariant distance that implies corresponding bounds in total variation distance.

We say that a set of samples is $\varepsilon$ -corrupted if it is generated by the process described in Definition 2.1.

Nearly Sample-Optimal Efficient Robust Learning

In this section, we present near sample-optimal efficient robust estimators for the mean and the covariance of high-dimensional distributions under various structural assumptions of varying strength. Our estimators rely on the filtering technique introduced in [DKK+16].

We note that [DKK+16] gave two algorithmic techniques: the first one was a spectral technique to iteratively remove outliers from the dataset (filtering), and the second one was a soft-outlier removal method relying on convex programming. The filtering technique seemed amenable to practical implementation (as it only uses simple eigenvalue computations), but the corresponding sample complexity bounds given in [DKK+16] are polynomially worse than the information-theoretic minimum. On the other hand, the convex programming technique of [DKK+16] achieved better sample complexity bounds (e.g., near sample-optimal for robust mean estimation), but relied on the ellipsoid method, which seemed to preclude a practically efficient implementation.

In this work, we achieve the best of both worlds: we provide a more careful analysis of the filter technique that yields sample-optimal bounds (up to logarithmic factors) for both the mean and the covariance. Moreover, we show that the filtering technique easily extends to much weaker distributional assumptions (e.g., under bounded second moments). Roughly speaking, the filtering technique follows a general iterative recipe: (1) via spectral methods, find some univariate test which is violated by the corrupted points, (2) find some concrete tail bound violated by the corrupted set of points, and (3) throw away all points which violate this tail bound.

[DKK+16] gave algorithms for robustly estimating the mean of a Gaussian distribution with known covariance and for robustly estimating the mean of a binary product distribution. The main motivation for considering these specific distribution families is that robustly estimating the mean within Euclidean distance immediately implies total variation distance bounds for these families. The above theorem establishes that these guarantees hold in a more general setting with near sample-optimal bounds. Under a bounded second moment assumption, we show:

A similar result on mean estimation under bounded second moments was concurrently shown in [SCV18]. The sample size above is optimal, up to a logarithmic factor, and the error guarantee is easily seen to be the best possible up to a constant factor. The main difference between the filtering algorithm establishing the above theorem and the filtering algorithm for the sub-gaussian case is how we choose the threshold for the filter. Instead of looking for a violation of a concentration inequality, here we will choose a threshold at random. In this case, randomly choosing a threshold weighted towards higher thresholds suffices to throw out more corrupted samples than uncorrupted samples in expectation. Although it is possible to reject many good samples this way, we show that the algorithm still only rejects a total of $O(\varepsilon)$ samples with high probability.

Finally, for robustly estimating the covariance of a Gaussian distribution, we have:

Let $G\sim\operatorname{\mathcal{N}}(0,\Sigma)$ be a Gaussian in $d$ dimensions, and let $\varepsilon>0$ . Let $S$ be an $\varepsilon$ -corrupted set of samples from $G$ of size $\Omega((d^{2}/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon))$ . There exists an efficient algorithm that, given $S$ and $\varepsilon$ , returns the parameters of a Gaussian distribution $G^{\prime}\sim\operatorname{\mathcal{N}}(0,{\widehat{\Sigma}})$ so that with probability at least $9/10$ , it holds $\|I-\Sigma^{-1/2}{\widehat{\Sigma}}\Sigma^{-1/2}\|_{F}=O(\varepsilon\log(1/\varepsilon)).$

We now provide a high-level description of the main ingredient which yields these improved sample complexity bounds. The initial analysis of [DKK+16] established sample complexity bounds which were sub-optimal by polynomial factors because it insisted that the set of good samples (i.e., before the corruption) satisfied very tight tail bounds. To some degree such bounds are necessary, as when we perform our filtering procedure, we need to ensure that not too many good samples are thrown away. However, the old analysis required that fairly strong tail bounds hold uniformly. The idea for the improvement is as follows: If the errors are sufficient to cause the variance of some polynomial $p$ (linear in the unknown mean case or quadratic in the unknown covariance case) to increase by more than $\varepsilon$ , it must be the case that for some $T$ , roughly an $\varepsilon/T^{2}$ fraction of samples are error points with $|p(x)|>T$ . As long as we can ensure that less than an $\varepsilon/T^{2}$ fraction of our good sample points have $|p(x)|>T$ , this will suffice for our filtering procedure to work. For small values of $T$ , these are much weaker tail bounds than were needed previously and can be achieved with a smaller number of samples. For large values of $T$ , these tail bounds are comparable to those used in previous work [DKK+16] , but in such cases we can take advantage of the fact that $|p(G)|>T$ only with very small probability, again allowing us to reduce the sample complexity. The details are deferred to Appendix A.

Filtering

We now describe the filtering technique more rigorously. We also describe some additional heuristics we found useful in practice.

We first consider mean estimation. The algorithms which achieve Theorems 3.1 and 3.2 both follow the general recipe in Algorithm 1. We must specify three parameter functions:

$\delta(\varepsilon,s)$ is a slack function, which we require for technical reasons.

where $\nu$ is the subgaussian parameter. See Section A.1 for details.

2 Robust Covariance Estimation

Our algorithm for robust covariance follows the exact recipe outlined above, with one key difference—we check for deviations in the empirical fourth moment tensor. Intuitively, just as in the robust mean setting, we used degree- $2$ information to detect outliers for the mean (the degree- $1$ moment), here we use degree- $4$ information to detect outliers for the covariance (the degree- $2$ moment).

3 Better Univariate Tests

In the algorithms described above for robust mean estimation, after projecting onto one dimension, we center the points at the empirical mean along this direction. This is theoretically sufficient, however, introduces additional constant factors since the empirical mean along this direction may be corrupted. Instead, one can use a robust estimate for the mean in one direction. Namely, it is well known that the median is a provably robust estimator for the mean for symmetric distributions [HR09, HRRS86], and under certain models it is in fact optimal in terms of its resilience to noise [DKW56, Mas90, Che98, DK14, DKK+17]. By centering the points at the median instead of the mean, we are able to achieve better error in practice.

4 Adaptive Tail Bounding

We found that depending on the setting, it was useful to change the constant $C_{2}$ . In particular, in low dimensions, we could be more stringent, and enforce a stronger tail bound (which corresponds to a higher $C_{2}$ ), but in higher dimensions, we must be more lax with the tail bound. To do this in a principled manner, we introduced a heuristic we call adaptive tail bounding. Our goal is to find a choice of $C_{2}$ which throws away roughly an $\varepsilon$ -fraction of points. The heuristic is fairly simple: we start with some initial guess for $C_{2}$ . We then run our filter with this $C_{2}$ . If we throw away too many data points, we increase our $C_{2}$ , and retry. If we throw away too few, then we decrease our $C_{2}$ and retry. Since increasing $C_{2}$ strictly decreases the number of points thrown away, and vice versa, we binary search over our choice of $C_{2}$ until we reach something close to our target accuracy. In our current implementation, we stop when the fraction of points we throw away is between $\varepsilon/2$ and $3\varepsilon/2$ , or if we’ve binary searched for too long. We found that this heuristic drastically improves our accuracy, and allows our algorithm to scale fairly smoothly from low to high dimension.

Experiments

We performed an empirical evaluation of the above algorithms on synthetic and real data sets with and without synthetic noise. All experiments were done on a laptop computer with a 2.7 GHz Intel Core i5 CPU and 8 GB of RAM. The focus of this evaluation was on statistical accuracy, not time efficiency. In this measure, our algorithm performs the best of all algorithms we tried. In all synthetic trials, our algorithm consistently had the smallest error. In fact, in some of the synthetic benchmarks, our error was orders of magnitude better than any other algorithms. In the semi-synthetic benchmark, our algorithm also (arguably) performs the best, though there is no way to tell for sure, since there is no ground truth. We also note that despite not optimizing our code for runtime, the runtime of our algorithm is always comparable, and in many cases, better than the alternatives which provided comparable error. Code of our implementation is available at https://github.com/hoonose/robust-filter.

Experiments with synthetic data allow us to verify the error guarantees and the sample complexity rates proven in Section 3 for unknown mean and unknown covariance. In both cases, the experiments validate the accuracy and usefulness of our algorithm, almost exactly matching the best rate without noise.

We tried various noise distributions, and found that the same qualitative pattern arose for all of them. In the reported experiment, our noise distribution was a mixture of two binary product distributions, where one had a couple of large coordinates (see Section B.1 for a detailed description). For all (nontrivial) error distributions we tried, we observed that indeed the empirical mean, pruning, geometric median, and RANSAC all have error which diverges as $d$ grows, as the theory predicts. On the other hand, both our algorithm and LRVMean have markedly smaller error as a function of dimension. Indeed, our algorithm’s error is almost identical to that of the empirical mean of the uncorrupted sample points.

On this corrupted data, we compared the performance of our Filter algorithm to that of (1) the empirical covariance of all the points, (2) a trivial pruning procedure, (3) a RANSAC-based minimal volume ellipsoid (MVE) algorithm, and (5) a recently proposed robust estimator for the covariance due to [LRV16], which we will call LRVCov. For (5), we again obtained the implementation from their Github repository.

We tried various choices of $\Sigma$ and noise distribution. Figure 2 shows two choices of $\Sigma$ and noise. Again, the x-axis indicates the dimension of the experiment and the y-axis indicates the estimator’s excess Mahalanobis error over the sampling error. In the left figure, we set $\Sigma=I$ , and our noise points are simply all located at the all-zeros vector. In the right figure, we set $\Sigma=I+10e_{1}e_{1}^{T}$ , where $e_{1}$ is the first basis vector, and our noise distribution is a somewhat more complicated distribution, which is similarly spiked, but in a different, random, direction. We formally define this distribution in Section B.1. For all choices of $\Sigma$ and noise we tried, the qualitative behavior of our algorithm and LRVCov was unchanged. Namely, we seem to match the empirical error without noise up to a very small slack, for all dimensions. On the other hand, the performance of empirical mean, pruning, and RANSAC varies widely with the noise distribution. The performance of all these algorithms degrades substantially with dimension, and their error gets worse as we increase the skew of the underlying data. The performance of LRVCov is the most similar to ours, but again is worse by a large constant factor. In particular, our excess risk was on the order of $10^{-4}$ for large $d$ , for both experiments, whereas the excess risk achieved by LRVCov was in all cases a constant between $0.1$ and $2$ .

These experiments demonstrate that our statistical guarantees are in fact quite strong. In particular, since our excess error is almost zero (and orders of magnitude smaller than other approaches), this suggests that our sample complexity is indeed close to optimal, since we match the rate without noise, and that the constants and logarithmic factors in the theoretical recovery guarantee are often small or non-existent.

2 Semi-synthetic Data

To demonstrate the efficacy of our method on real data, we revisit the famous study of [NJB+08]. In this study, the authors investigated data collected as part of the Population Reference Sample (POPRES) project. This dataset consists of the genotyping of thousands of individuals using the Affymetrix 500K single nucleotide polymorphism (SNP) chip. The authors pruned the dataset to obtain the genetic data of over 1387 European individuals, annotated by their country of origin. Using principal components analysis, they produce a two-dimensional summary of the genetic variation, which bears a striking resemblance to the map of Europe.

Our experimental setup is as follows. While the original dataset is very high dimensional, we use a 20 dimensional version of the dataset as found in the authors’ GitHubhttps://github.com/NovembreLab/Novembre_etal_2008_misc. We first randomly rotate the data, as then 20 dimensional data was diagonalized, and the high dimensional data does not follow such structure. We then add an additional $\frac{\varepsilon}{1-\varepsilon}$ fraction of points (so that they make up an $\varepsilon$ -fraction of the final points). These added points were discrete points, following a simple product distribution (see Section B.1 for full details). We used a number of methods to obtain a covariance matrix for this dataset, and we projected the data onto the top two singular vectors of this matrix. In Figure 3, we show the results when we compare our techniques to pruning. In particular, our output was able to more or less reproduce the map of Europe, whereas pruning fails to. In Section B.2, we also compare our result with a number of other techniques, including those we tested against in the unknown covariance experiments, and other robust PCA techniques. The only alternative algorithm which was able to produce meaningful output was LRVCov, which produced output that was similar to ours, but which produced a map which was somewhat more skewed. We believe that our algorithm produces the best picture.

In Figure 3, we also display the actual points which were output by our algorithm’s Filter. While it manages to remove most of the noise points, it also seems to remove some of the true data points, particularly those from Eastern Europe and Turkey. We attribute this to a lack of samples from these regions, and thus one could consider them as outliers to a dataset consisting of Western European individuals. For instance, Turkey had 4 data points, so it seems quite reasonable that any robust algorithm would naturally consider these points outliers.

We view our experiments as a proof of concept demonstration that our techniques can be useful in real world exploratory data analysis tasks, particularly those in high-dimensions. Our experiments reveal that a minimal amount of noise can completely disrupt a data analyst’s ability to notice an interesting phenomenon, thus limiting us to only very well-curated data sets. But with robust methods, this noise does not interfere with scientific discovery, and we can still recover interesting patterns which otherwise would have been obscured by noise.

Acknowledgments

We would like to thank Simon Du and Lili Su for helpful comments on a previous version of this work.

References

Appendix A Omitted Details from Section 3

In this section, we use our filter technique to give a near sample-optimal computationally efficient algorithm to robustly estimate the mean of a sub-gaussian density with a known covariance matrix, thus proving Theorem 3.1.

We emphasize that the algorithm and its analysis is essentially identical to the filtering algorithm given in Section 8.1 of [DKK+16] for the case of a Gaussian $\mathcal{N}(\mu,I)$ . The only difference is a weaker definition of the “good set of samples” (Definition A.4) and a simple concentration argument (Lemma A.5) showing that a random set of uncorrupted samples of the appropriate size is good with high probability. Given these, the analysis of this subsection follows straightforwardly from the analysis in Section 8.1 of [DKK+16] by plugging in the modified parameters. For the sake of completeness, we provide the details below.

We start by formally defining sub-gaussian distributions:

We will use the following simple fact about the concentration of sub-gaussian random variables:

The following theorem is a high probability version of Theorem 3.1:

Notation. We will denote $\mu^{S}=\frac{1}{|S|}\sum_{X\in S}X$ and $M_{S}=\frac{1}{|S|}\sum_{X\in S}(X-\mu^{G})(X-\mu^{G})^{T}$ for the sample mean and modified sample covariance matrix of the set $S$ .

We start by defining our modified notion of good sample, i.e, a set of conditions on the uncorrupted set of samples under which our algorithm will succeed.

For all $x\in S$ we have $\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log(|S|/\tau)})$ .

We have that $\|\mu^{S}-\mu^{G}\|_{2}\leq\varepsilon.$

We have that $\left\|M_{S}-I\right\|_{2}\leq{\varepsilon}.$

We show in the following subsection that a sufficiently large set of independent samples from $G$ is $(\varepsilon,\tau)$ -good (with respect to $G$ ) with high probability. Specifically, we prove:

Let $G$ be sub-gaussian distribution with parameter $\nu=\Theta(1)$ and with identity covariance, and $\varepsilon,\tau>0.$ If the multiset $S$ is obtained by taking $\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau))$ independent samples from $G,$ it is $(\varepsilon,\tau)$ -good with respect to $G$ with probability at least $1-\tau.$

We require the following definition that quantifies the extent to which a multiset has been corrupted:

Given finite multisets $S$ and $S^{\prime}$ we let $\Delta(S,S^{\prime})$ be the size of the symmetric difference of $S$ and $S^{\prime}$ divided by the cardinality of $S.$

The starting point of our algorithm will be a simple NaivePrune routine (Section 4.3.1 of [DKK+16]) that removes obvious outliers, i.e., points which are far from the mean. Then, we iterate the algorithm whose performance guarantee is given by the following:

A mean vector ${\widehat{\mu}}$ such that $\|{\widehat{\mu}}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).$

We start by showing how Theorem A.3 follows easily from Proposition A.7.

By the definition of $\Delta(S,S^{\prime}),$ since $S^{\prime}$ has been obtained from $S$ by corrupting an $\varepsilon$ -fraction of the points in $S,$ we have that $\Delta(S,S^{\prime})\leq 2\varepsilon.$ By Lemma A.5, the set $S$ of uncorrupted samples is $(\varepsilon,\tau)$ -good with respect to $G$ with probability at least $1-\tau.$ We henceforth condition on this event.

Since $S$ is $(\varepsilon,\tau)$ -good, all $x\in S$ have $\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log|S|/\tau})$ . Thus, the NaivePrune procedure does not remove from $S^{\prime}$ any member of $S$ . Hence, its output, $S^{\prime\prime}$ , has $\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime})$ and for any $x\in S^{\prime\prime}$ , there is a $y\in S$ with $\|x-y\|_{2}\leq O(\sqrt{d\log|S|/\tau})$ . By the triangle inequality, for any $x,z\in S^{\prime\prime}$ , $\|x-z\|_{2}\leq O(\sqrt{d\log|S|/\tau})=O(\sqrt{d\log(d/\varepsilon\tau}))$ .

Then, we iteratively apply the Filter-Sub-Gaussian-Unknown-Mean procedure of Proposition A.7 until it terminates returning a mean vector $\mu$ with $\|{\widehat{\mu}}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).$ We claim that we need at most $O(\alpha)$ iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets $S_{i}^{\prime},$ so that $\Delta(S,S_{i}^{\prime})\leq\Delta(S,S^{\prime})-i\cdot\varepsilon/{\alpha}.$ Thus, if we do not output the empirical mean in the first $2\alpha$ iterations, in the next iteration there are no outliers left and the algorithm terminates outputting the sample mean of the remaining set. ∎

In this subsection, we describe the efficient algorithm establishing Proposition A.7 and prove its correctness. Our algorithm calculates the empirical mean vector $\mu^{S^{\prime}}$ and empirical covariance matrix $\Sigma$ . If the matrix $\Sigma$ has no large eigenvalues, it returns $\mu^{S^{\prime}}.$ Otherwise, it uses the eigenvector $v^{\ast}$ corresponding to the maximum magnitude eigenvalue of $\Sigma$ and the mean vector $\mu^{S^{\prime}}$ to define a filter. Our efficient filtering procedure is presented in detailed pseudocode below.

A.1.2 Proof of Correctness of Filter-Sub-Gaussian-Unknown-Mean

We define $\mu^{G},\mu^{S},\mu^{S^{\prime}},\mu^{L},$ and $\mu^{E}$ to be the means of $G,S,S^{\prime},L,$ and $E$ , respectively.

Our analysis will make essential use of the following matrices:

Our analysis will hinge on proving the important claim that $\Sigma-I$ is approximately $(|E|/|S^{\prime}|)M_{E}$ . This means two things for us. First, it means that if the positive errors align in some direction (causing $M_{E}$ to have a large eigenvalue), there will be a large eigenvalue in $\Sigma-I$ . Second, it says that any large eigenvalue of $\Sigma-I$ will correspond to an eigenvalue of $M_{E}$ , which will give an explicit direction in which many error points are far from the empirical mean.

Useful Structural Lemmas. We begin by noting that we have concentration bounds on $G$ and therefore, on $S$ due to its goodness.

The first line is Fact A.2, and the former follows from it using the goodness of $S$ . ∎

By using the above fact, we obtain the following simple claim:

This follows from Fact A.8 upon noting that $|w\cdot(X-\mu^{S^{\prime}})|>T+\|\mu^{S^{\prime}}-\mu^{G}\|_{2}$ only if $|w\cdot(X-\mu^{G})|>T$ . ∎

We can use the above facts to prove concentration bounds for $L$ . In particular, we have the following lemma:

We have that $\|M_{L}\|_{2}={O\left(\log(|S|/|L|)+\varepsilon|S|/|L|\right)}.$

For unit vectors $v$ , the RHS is bounded from above as follows:

where the second line follows from the fact that $\|v\|_{2}=1$ , $L\subset S$ , and $S$ satisfies condition (i) of Definition A.4; the third line follows from (1); and the fourth line follows from Fact A.8. ∎

As a corollary, we can relate the matrices $M_{S^{\prime}}$ and $M_{E}$ , in spectral norm:

We have that $M_{S^{\prime}}-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon))$ , where the $O(\varepsilon\log(1/\varepsilon))$ term denotes a matrix of spectral norm $O(\varepsilon\log(1/\varepsilon))$ .

By definition, we have that $|S^{\prime}|M_{S^{\prime}}=|S|M_{S}-|L|M_{L}+|E|M_{E}$ . Thus, we can write

where the second line uses the fact that $1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon$ , the goodness of $S$ (condition (iv) in Definition A.4), and Lemma A.10. Specifically, Lemma A.10 implies that $(|L|/|S^{\prime}|)\|M_{L}\|_{2}=O(\varepsilon\log(1/\varepsilon))$ . Therefore, we have that

We now establish a similarly useful bound on the difference between the mean vectors:

Since $S$ is a good set, by condition (iii) of Definition A.4, we have $\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon)$ . Since $1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon$ , it follows that $(|S|/|S^{\prime}|)\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon)$ . Using the valid inequality $\|M_{L}\|_{2}\geq\|\mu^{L}-\mu^{G}\|_{2}^{2}$ and Lemma A.10, we obtain that $\|\mu^{L}-\mu^{G}\|_{2}\leq O\left(\sqrt{\log(|S|/|L|)}+\sqrt{\varepsilon|S|/|L|}\right)$ . Therefore,

as desired. This completes the proof of the lemma. ∎

By combining the above, we can conclude that $\Sigma-I$ is approximately proportional to $M_{E}$ . More formally, we obtain the following corollary:

We have $\Sigma-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon))+O(|E|/|S^{\prime}|)^{2}\|M_{E}\|_{2}$ , where the additive terms denote matrices of appropriately bounded spectral norm.

By definition, we can write $\Sigma-I=M_{S^{\prime}}-I-(\mu^{S^{\prime}}-\mu^{G})(\mu^{S^{\prime}}-\mu^{G})^{T}.$ Using Corollary A.11 and Lemma A.12, we obtain:

where the second line follows from the valid inequality $\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}$ . This completes the proof. ∎

On the other hand, since $\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}$ , Lemma A.12 gives that

Case of Large Spectral Norm. We next show the correctness of the algorithm when it returns a filter in Step 7.

Moreover, using the inequality $\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}$ and Lemma A.12 as above, we get that

Suppose for the sake of contradiction that for all $T>0$ we have that

Using (3), we obtain that for all $T>0$ we have that

We now have the following sequence of inequalities:

Combined with (2), we obtain $\lambda^{\ast}=O(\varepsilon\log(1/\varepsilon))$ , which is a contradiction if $C$ is sufficiently large. Therefore, it must be the case that for some value of $T$ the condition in Step 7 is satisfied.

Recall that $S^{\prime}=(S\setminus L)\cup E,$ with $E$ and $L$ disjoint multisets such that $L\subset S.$ We can similarly write $S^{\prime\prime}=(S\setminus L^{\prime})\cup E^{\prime},$ with $L^{\prime}\supseteq L$ and $E^{\prime}\subset E.$ Since

it suffices to show that $|E\setminus E^{\prime}|\geq|L^{\prime}\setminus L|+\varepsilon|S|/{\alpha}.$ Note that $|L^{\prime}\setminus L|$ is the number of points rejected by the filter that lie in $S\cap S^{\prime}.$ Note that the fraction of elements of $S$ that are removed to produce $S^{\prime\prime}$ (i.e., satisfy $|v^{\ast}\cdot(x-\mu^{S^{\prime}})|>T+\delta$ ) is at most $2\exp(-T^{2}/2\nu)+\varepsilon/{\alpha}$ . This follows from Claim A.9 and the fact that $T=O(\sqrt{d\log(d/\varepsilon\tau)})$ .

Hence, it holds that $|L^{\prime}\setminus L|\leq(2\exp(-T^{2}/2\nu)+\varepsilon/{\alpha})|S|.$ On the other hand, Step 7 of the algorithm ensures that the fraction of elements of $S^{\prime}$ that are rejected by the filter is at least $8\exp(-T^{2}/2\nu)+8\varepsilon/{\alpha})$ . Note that $|E\setminus E^{\prime}|$ is the number of points rejected by the filter that lie in $S^{\prime}\setminus S.$ Therefore, we can write:

where the second line uses the fact that $|S^{\prime}|\geq|S|/2$ and the last line uses the fact that $|L^{\prime}\setminus L|/|S|\leq 2\exp(-T^{2}/2\nu)+\varepsilon/{\alpha}.$ Noting that $\log(d/\varepsilon\tau)\geq 1$ , this completes the proof of the claim. ∎

A.1.3 Proof of Lemma A.5

Let $N=\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau))$ be the number of samples drawn from $G$ . For (i), the probability that a coordinate of a sample is at least $\sqrt{2\nu\log(Nd/3\tau)}$ is at most $\tau/3dN$ by Fact A.2. By a union bound, the probability that all coordinates of all samples are smaller than $\sqrt{2\nu\log(Nd/3\tau)}$ is at least $1-\tau/3$ . In this case, $\|x\|_{2}\leq\sqrt{2\nu d\log(Nd/3\tau)}=O(\sqrt{d\nu\log(N\nu/\tau)})$ .

After translating by $\mu^{G}$ , we note that (iii) follows immediately from Lemmas 4.3 of [DKK+16] and (iv) follows from Theorem 5.50 of [Ver10], as long as $N=\Omega(\nu^{4}d\log(1/\tau)/\varepsilon^{2})$ , with probability at least $1-\tau/3$ . It remains to show that, conditioned on (i), (ii) holds with probability at least $1-\tau/3$ .

To simplify some expressions, let $\delta:=\varepsilon/(\log(d\log d/\varepsilon\tau))$ and $R=C\sqrt{d\log(|S|/\tau)}$ . We need to show that for all unit vectors $v$ and all $0\leq T\leq R$ that

Firstly, we show that for all unit vectors $v$ and $T>0$

with probability at least $1-\tau/6$ . Since the VC-dimension of the set of all halfspaces is $d+1$ , this follows from the VC inequality [DL01], since we have more than $\Omega(d/(\delta/(10\nu\log(1/\delta))^{2})$ samples. We thus only need to consider the case when $T\geq\sqrt{10\nu\ln(1/\delta)}$ .

For any fixed unit vector $v$ and $T>\sqrt{10\nu\ln(1/\delta)}$ , except with probability $\exp(-N\delta/(6C\nu))$ , we have that

Let $E$ be the event that $|v\cdot(X-\mu^{G})|>T$ . Since $G$ is sub-gaussian, Fact A.2 yields that $\Pr_{G}[E]=\Pr_{Y\sim G}[|v\cdot(X-\mu^{G})|>T]\leq\exp(-T^{2}/(2\nu))$ . Note that, thanks to our assumption on $T$ , we have that $T\leq\exp(T^{2}/(4\nu))/2C$ , and therefore $T^{2}\Pr_{G}[E]\leq\exp(-T^{2}/(4\nu))/2C\leq\delta/2C$ .

where (a) follows from sub-gaussianity, (b) follows from our choice of $T$ , and (c) comes from the fact that $1+x\leq e^{x}$ for all $x$ .

Thus, if $\delta$ is a sufficiently small constant and $C$ is sufficiently large, this yields the desired bound. ∎

Now let $\mathcal{C}$ be a $1/2$ -cover in Euclidean distance for the set of unit vectors of size $2^{O(d)}$ . By a union bound, for all $v^{\prime}\in\mathcal{C}$ and $T^{\prime}$ a power of 2 between $\sqrt{4\nu\ln(1/\delta)}$ and $R$ , we have that

Then, by a union bound, (6) holds simultaneously for all unit vectors $v$ and all $0\leq T\leq R$ , with probability a least $1-\tau/3$ . This completes the proof. ∎

A.2 Robust Mean Estimation Under Second Moment Assumptions

In this section, we use our filtering technique to give a near sample-optimal computationally efficient algorithm to robustly estimate the mean of a density with a second moment assumption. We show:

Note that Theorem 3.2 follows straightforwardly from the above (divide every sample by $\sigma$ , run the algorithm of Theorem A.16, and multiply its output by $\sigma$ ).

We would like our good set of samples to have mean close to that of $P$ and bounded variance in all directions. This motivates the following definition:

We call a set $S$ $\varepsilon$ -good for a distribution $P$ with mean $\mu^{P}$ and covariance $\Sigma_{P}\preceq I$ if the mean $\mu^{S}$ and covariance $\Sigma^{S}$ of $S$ satisfy $\|\mu^{S}-\mu^{P}\|_{2}\leq\sqrt{\varepsilon}$ and $\|\Sigma^{S}\|_{2}\leq 2$ .

However, since we have no assumptions about higher moments, it may be be possible for outliers to affect our sample covariance too much. Fortunately, such outliers have small probability and do not contribute too much to the mean, so we will later reclassify them as errors.

Let $S$ be $N=\Theta((d/\varepsilon)\log d)$ samples drawn from $P$ . Then, with probability at least $9/10$ , a random $X\in_{u}S$ satisfies

$\Pr_{S}\left[\|X-\mu^{P}\|_{2}\geq 80\sqrt{d/\varepsilon}\right]\leq\varepsilon/160$ ,

for $Y\sim P$ . By Markov’s inequality, $\Pr[\|Y-\mu^{P}\|_{2}\geq 80\sqrt{d/\varepsilon}]\leq\varepsilon/160$ with probability at least $39/40$ .

where (a) follows from Cauchy-Schwarz, and (b) follows from the definition of the covariance, and (c) follows from the assumption that $\Sigma_{P}\preceq I$ and from Markov’s inequality.

For (iv), we require the following Matrix Chernoff bound:

We apply this lemma with $X_{k}=(x_{k}-\mu^{P})(x_{k}-\mu^{P})^{T}1_{\|x_{k}-\mu^{P}\|_{2}\leq 80\sqrt{d/\varepsilon}}$ for $\{x_{1},\dots,x_{N}\}=S$ . Note that $\|X_{k}\|_{2}\leq(80)^{2}d/\varepsilon=L$ and that $\mu^{\max}\leq N\|\Sigma_{P}\|_{2}\leq N$ .

Suppose that $\mu^{\max}\leq N/80$ . Then, taking $\theta=1$ , we have

By Markov’s inequality, except with probability $39/40$ , we have $\|\sum_{k}X_{k}\|_{2}\leq N+O(d\log(d)/\varepsilon)\leq 3N/2$ , for $N$ a sufficiently high multiple of $d\log(d)/\varepsilon$ .

Suppose that $\mu^{\max}\geq N/80$ , then we take $\delta=1/2$ and obtain

For $N$ a sufficiently high multiple of $d\log(d)/\varepsilon$ , we get that $\Pr[\left\|\sum_{k}X_{k}\right\|_{2}\geq 3\mu^{\max}/2]\leq 1/40$ . Since $\mu^{\max}\leq N$ , we have with probability at least $39/40$ , $\left\|\sum_{k}X_{k}\right\|_{2}\leq 3N/2$ .

Now we can get a $2\varepsilon$ -corrupted good set from an $\varepsilon$ -corrupted set of samples satisfying Lemma A.18, by reclassifying outliers as errors:

Let $S=R\cup E\setminus L$ , where $R$ is a set of $N=\Theta(d\log d/\varepsilon)$ samples drawn from $P$ and $E$ and $L$ are disjoint sets with $|E|,|L|\leq\varepsilon$ . Then, with probability $9/10$ , we can also write $S=G\cup E^{\prime}\setminus L^{\prime}$ , where $G\subseteq R$ is $\varepsilon$ -good, $L^{\prime}\subseteq L$ and $E^{\prime}\subseteq E^{\prime}$ has $|E^{\prime}|\leq 2\varepsilon|S|$ .

Let $G=\{x\in R:\|x\|_{2}\leq 80\sqrt{d/\varepsilon}\}$ . Condition on the event that $R$ satisfies Lemma A.18. By Lemma A.18, this occurs with probability at least $9/10$ .

Since $R$ satisfies (ii) of Lemma A.18, $|G|-|R|\leq\varepsilon|R|/160\leq\varepsilon|S|$ . Thus, $E^{\prime}=E\cup(R\setminus G)$ has $|E^{\prime}|\leq 3\varepsilon/2$ . Note that (iv) of Lemma A.18 for $R$ in terms of $G$ is exactly $|G|\|\Sigma^{G}\|_{2}/|R|\leq 3/2$ , and so $\|\Sigma^{G}\|_{2}\leq 3|R|/(2|G|)\leq 2$ .

It remains to check that $\|\mu^{G}-\mu^{P}\|_{2}\leq\sqrt{\varepsilon}$ . We have

where the last line follows from (iii) of Lemma A.18. Since we argued above that $|R|/|G|\geq 2/3$ , dividing this expression by $|G|$ yields the desired claim.

An iteration of FilterUnder2ndMoment may throw out more samples from $G$ than corrupted samples. However, in expectation, we throw out many more corrupted samples than from the good set:

If $G$ is an $\varepsilon$ -good set with $x\leq 40\sqrt{d/\varepsilon}$ for $x\in S\cup G$ , then $\|M_{G}\|_{2}\leq 2\|\mu^{G}-\mu^{S}\|_{2}^{2}+2\;.$

We have that $|L|\|M_{L}\|_{2}\leq 2|G|(1+\|\mu^{G}-\mu^{S}\|_{2}^{2})\;.$

Since $L\subseteq G$ , for any unit vector $v$ , we have

$\|\mu^{G}-\mu^{S}\|_{2}\leq\sqrt{2\varepsilon\|M_{S}\|_{2}}+12\sqrt{\varepsilon}$ .

We have that $|E|M_{E}\leq|S|M_{S}+|L|M_{L}$ and so

By Cauchy Schwarz, we have that $\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{S}\|_{2}^{2}$ , and so

By Cauchy-Schwarz and Lemma A.23, we have that

Since $|S|\mu^{S}=|G|\mu^{G}+|E|\mu^{E}-|L|\mu^{L}$ and $|S|=|G|+|E|-|L|$ , we get

Since for $x,y>0$ , $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ , we have

Since $||G|-|S||\leq\varepsilon|S|$ and $|E|\leq 2\varepsilon|S|,|L|\leq 9\varepsilon|S|$ , we have

Moving the $\|\mu^{G}-\mu^{S}\|_{2}$ terms to the LHS, using $6\sqrt{\varepsilon}\leq 1/2$ , gives

Since $\lambda^{\ast}=\|M_{S}\|_{2}$ , the correctness if we return the empirical mean is immediate.

If $\lambda^{\ast}\leq 9$ , we have that $\|\mu^{G}-\mu^{S}\|_{2}=O(\sqrt{\varepsilon})$ .

From now on, we assume $\lambda^{\ast}>9$ . In this case we have $\|\mu^{G}-\mu^{S}\|_{2}^{2}\leq O(\varepsilon\lambda^{\ast})$ . Using Lemma A.22, we have

for sufficiently small $\varepsilon$ . Thus, we have that

Now we can show that in expectation, we throw out many more corrupted points from $E$ than from $G\setminus L$ :

Let $a=\max_{x\in S}|v^{\ast}\cdot x-\mu^{S}|$ . Firstly, we look at the expected number of samples we reject:

Next, we look at the expected number of false positive samples we reject, i.e., those in $L^{\prime}\setminus L$ .

If $\lambda^{\ast}\leq 9$ , then we return the mean in Step 5, and by Corollary A.25, $\|\mu^{S}-\mu^{P}\|_{2}\leq O(\sqrt{\varepsilon})$ .

Our input is a set $S$ of $N=\Theta((d/\varepsilon)\log d)$ $\varepsilon$ -corrupted samples so that with probability $9/10$ , $S$ is a $2\varepsilon$ -corrupted set of $\varepsilon$ -good samples for $P$ by Lemmas A.18 and A.20. We have a set $S=G\cup E^{\prime}\setminus L$ , where $G^{\prime}$ is an $\varepsilon$ -good set, $|E|\leq 2\varepsilon$ , and $|L|\leq\varepsilon$ . Then, we iteratively apply FilterUnder2ndMoment until it outputs an approximation to the mean. Since each iteration removes a sample, this must happen within $N$ iterations. The algorithm takes at most $\operatorname{poly}(N,d)=\operatorname{poly}(d,1/\varepsilon)$ time.

As long as we can show that the conditions of Proposition A.21 hold in each iteration, it ensures that $\|\mu^{S}-\mu^{P}\|_{2}\leq O(\sqrt{\varepsilon})$ . However, the condition that $|L|\leq 9\varepsilon|S|$ need not hold in general. Although in expectation we reject many more samples in $E$ than $G$ , it is possible that we are unlucky and reject many samples in $G$ , which could make $L$ large in the next iteration. Thus, we need a bound on the probability that we ever have $|L|>9\varepsilon$ .

By a union bound, the probability that the uncorrupted samples satisfy Lemma A.18 and Proposition A.21 applies to every iteration is at least $9/10-1/6\geq 2/3$ . Thus, with at least $2/3$ probability, the algorithm outputs a vector ${\widehat{\mu}}$ with $\|{\widehat{\mu}}-\mu^{P}\|_{2}\leq O(\sqrt{\varepsilon})$ . ∎

A.3 Robust Covariance Estimation

In this subsection, we give a near sample-optimal efficient robust estimator for the covariance of a zero-mean Gaussian density, thus proving Theorem 3.3. Our algorithm is essentially identical to the filtering algorithm given in Section 8.2 of [DKK+16]. As in Section A.1 the only difference is a weaker definition of the “good set of samples” (Definition A.27) and a concentration argument (Lemma A.28) showing that a random set of uncorrupted samples of the appropriate size is good with high probability. Given these, the analysis of this subsection follows straightforwardly from the analysis in Section 8.2 of [DKK+16] by plugging in the modified parameters.

The algorithm Filter-Gaussian-Unknown-Covariance to robustly estimate the covariance of a mean Gaussian in [DKK+16] is as follows:

Firstly, we state the new, weaker, definition of a good set:

For all $x\in S$ , $x^{T}\Sigma^{-1}x<d+O(\sqrt{d}\log(d/\varepsilon))$ .

We have that $\|\Sigma^{-1/2}\mbox{Cov}(S)\Sigma^{-1/2}-I\|_{F}=O(\varepsilon)$ .

For all even degree- $2$ polynomials $p$ , we have that $\operatorname*{Var}(p(S))=\operatorname*{Var}(p(G))(1+O(\varepsilon))$ .

Let $N$ be a sufficiently large constant multiple of $d^{2}\log^{5}(d/\varepsilon)/\varepsilon^{2}$ . Then a set $S$ of $N$ independent samples from $G$ is $\varepsilon$ -good with respect to $G$ with high probability.

First, note that it suffices to prove this when $G=N(0,I)$ .

Condition 1 follows by standard concentration bounds on $\|x\|_{2}^{2}$ .

Condition 2 follows by estimating the entry-wise error between $\mbox{Cov}(S)$ and $I$ .

Condition 4 is substantially the most difficult of these conditions to prove. Naively, we would want to find a cover of all possible $p$ and all possible $T$ , and bound the probability that the desired condition fails. Unfortunately, the best a priori bound on $\Pr(|p(G)|>T)$ are on the order of $\exp(-T)$ . As our cover would need to be of size $2^{d^{2}}$ or so, to make this work with $T=d$ , we would require on the order of $d^{3}$ samples in order to make this argument work.

However, we will note that this argument is sufficient to cover the case of $T<10\log(1/\varepsilon)\log^{2}(d/\varepsilon)$ .

Let $\mathcal{P}_{k}$ be the set of even, mean-, degree- $2$ polynomials, so that the associated matrix $A$ satisfies:

Note that for $p\in\mathcal{P}_{k}$ that $|p(x)|\leq|x|^{2}/\sqrt{k}+\sqrt{k}$ .

Importantly, any polynomial can be written in terms of these sets.

Let $A$ be the associated matrix to $p$ . Note that $\|A\|_{F}=\operatorname*{Var}{p}=1$ . Let $A_{k}$ be the matrix corresponding to the top $k$ eigenvalues of $A$ . We now let $p_{1}$ be the polynomial associated to $A_{1}/2$ , $p_{2}$ be associated to $(A_{2}-A_{1})/2$ , $p_{4}$ be associated to $(A_{4}-A_{2})/2$ , and so on. It is clear that $p=2(p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d})$ . It is also clear that the matrix associated to $p_{k}$ has rank at most $k$ . If the matrix associated to $p_{k}$ had an eigenvalue more than $1/\sqrt{k}$ , it would need to be the case that the $k/2^{nd}$ largest eigenvalue of $A$ had size at least $2/\sqrt{k}$ . This is impossible since the sum of the squares of the eigenvalues of $A$ is at most $1$ .

We will also need covers of each of these sets $\mathcal{P}_{k}$ .

For each $k$ , there exists a set $\mathcal{C}_{k}\subset\mathcal{P}_{k}$ so that

For each $p\in\mathcal{P}_{k}$ there exists a $q\in\mathcal{C}_{k}$ so that $\|p(G)-q(G)\|_{2}\leq(\varepsilon/d)^{2}$ .

$|\mathcal{C}_{k}|=2^{O(dk\log(d/\varepsilon))}.$

We note that any such $p$ is associated to a matrix $A$ of the form $A=\sum_{i=1}^{k}\lambda_{i}v_{i}v_{i}^{T}$ , for $\lambda_{i}\in[0,1/\sqrt{k}]$ and $v_{i}$ orthonormal. It suffices to let $q$ correspond to the matrix $A^{\prime}=\sum_{i=1}^{k}\mu_{i}w_{i}w_{i}^{T}$ for with $|\lambda_{i}-\mu_{i}|<(\varepsilon/d)^{3}$ and $|v_{i}-w_{i}|<(\varepsilon/d)^{3}$ for all $i$ . It is easy to let $\mu_{i}$ and $w_{i}$ range over covers of the interval and the sphere with appropriate errors. This gives a set of possible $q$ ’s of size $2^{O(dk\log(d/\varepsilon))}$ as desired. Unfortunately, some of these $q$ will not be in $\mathcal{P}_{k}$ as they will have eigenvalues that are too large. However, this is easily fixed by replacing each such $q$ by the closest element of $\mathcal{P}_{k}$ . This completes our proof. ∎

We next will show that these covers are sufficient to express any polynomial.

Combining the above two lemmas we have that any such $p$ can be written as

where $q_{k}$ above is in $\mathcal{C}_{k}$ and $\|p_{k}(G)\|_{2}<(\varepsilon/d)^{2}$ . Thus, $p^{\prime}=p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d}$ has $\|p^{\prime}(G)\|_{2}\leq(\varepsilon/d)$ . This completes the proof. ∎

The key observation now is that if $|p(x)|\geq T$ for $\|x\|_{2}\leq\sqrt{d/\varepsilon}$ , then writing $p=q_{1}+q_{2}+q_{4}+\ldots+q_{d}+p^{\prime}$ as above, it must be the case that $|q_{k}(x)|>(T-1)/(2\log(d))$ for some $k$ . Therefore, to prove our main result, it suffices to show that, with high probability over the choice of $S$ , for any $T\geq 10\log(1/\varepsilon)\log^{2}(d/\varepsilon)$ and any $q\in\mathcal{C}_{k}$ for some $k$ , that $\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log(d))$ . Equivalently, it suffices to show that for $T\geq 10\log(1/\varepsilon)\log(d/\varepsilon)$ it holds $\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log^{2}(d))$ . Note that this holds automatically for $T>(d/\varepsilon)$ , as $p(x)$ cannot possibly be that large for $\|x\|_{2}\leq\sqrt{d/\varepsilon}$ . Furthermore, note that losing a constant factor in the probability, it suffices to show this only for $T$ a power of $2$ .

Therefore, it suffices to show for every $k\leq d$ , every $q\in\mathcal{C}_{k}$ and every $d/\sqrt{k\varepsilon}\gg T\gg\log(1/\varepsilon)\log(d/\varepsilon)$ that with probability at least $1-2^{-\Omega(dk\log(d/\varepsilon))}$ over the choice of $S$ we have that $\Pr_{x\in_{u}S}(|q(x)|>T)\ll\varepsilon/(T^{2}\log^{4}(d/\varepsilon))$ . However, by the Hanson-Wright inequality, we have that

Therefore, by Chernoff bounds, the probability that more than a $\varepsilon/(T^{2}\log^{4}(d/\varepsilon))$ -fraction of the elements of $S$ satisfy this property is at most

Appendix B Omitted Details from Section 5

Here we formally describe the distributions we used in our experiments. In all settings, our goal was to find noise distributions so that noise points were not “obvious” outliers, in the sense that there is no obvious pointwise pruning process which could throw away the noise points, which still gave the algorithms we tested the most difficulty. We again remark that while other algorithms had varying performances depending on the noise distribution, it seemed that the performance of ours was more or less unaffected by it.

Our uncorrupted points were generated by $\operatorname{\mathcal{N}}(\mu,I)$ , where $\mu$ is the all-ones vector. Our noise distribution is given as

where $\Pi_{1}$ is the product distribution over the hypercube where every coordinate is or $1$ with probability $1/2$ , and $\Pi_{2}$ is a product distribution where the first coordinate is ether or $12$ with equal probability, the second coordinate is $-2$ or with equal probability, and all remaining coordinates are zero.

For the isotropic synthetic covariance experiment, our uncorrupted points were generated by $\operatorname{\mathcal{N}}(0,I)$ , and the noise points were all zeros. For the skewed synthetic covariance experiment, our uncorrupted points were generated by $\operatorname{\mathcal{N}}(0,I+100e_{1}e_{1}^{T})$ , where $e_{1}$ is the first unit vector, and our noise points were generated as follows: we took a fixed random rotation of points of the form $Y_{i}\sim\Pi$ , where $\Pi$ is a product distribution whose first $d/2$ coordinates are each uniformly selected from $\{-0.5,0,0.5\}$ , and whose next $d/2-1$ coordinates are each $0.8\times A_{i}$ , where for each coordinate $i$ , $A_{i}$ is an independent random integer between $-2$ and $2$ , and whose last coordinate is a uniformly random integer between $$.

We took the 20 dimensional data from [NJB+08], which was diagonalized, and randomly rotated it. This was to simulate the higher dimensional case, since the singular vectors that [NJB+08] obtained did not seem to be sparse or analytically sparse. Our noise was distributed as $\Pi$ , where $\Pi$ is a product distribution whose first $d/2$ coordinates are each uniformly random integers between and $2$ and whose last $d/2$ coordinates are each uniformly randomly either $2$ or $3$ , all scaled by a factor of $1/24$ .

B.2 Comparison with other robust PCA methods on semi-synthetic data

In addition to comparing our results with simple pruning techniques, as we did in Figure 3 in the main text, we also compared our algorithm with implementations of other robust PCA techniques from the literature with accessible implementations. In particular, we compared our technique with RANSAC-based techniques, LRVCov, two SDPs ([CLMW11, XCS10]) for variants of robust PCA, and an algorithm proposed by [CLMW11] to speed up their SDP based on alternating descent. For the SDPs, since black box methods were too slow to run on the full data set (as [CLMW11] mentions, black-box solvers for the SDPs are impractical above perhaps 100 data points), we subsample the data, and run the SDP on the subsampled data. For each of these methods, we ran the algorithm on the true data points plus noise, where the noise was generated as described above. We then take the estimate of the covariance it outputs, and project the data points onto the top two singular values of this matrix, and plot the results in Figure 4.

Similar results occurred for most noise patterns we tried. We found that only our algorithm and LRVCov were able to reasonably reconstruct Europe, in the presence of this noise. It is hard to judge qualitatively which of the two maps generated is preferable, but it seems that ours stretches the picture somewhat less than LRVCov.