Being Robust (in High Dimensions) Can Be Practical

Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, Alistair Stewart

Introduction

Robust statistics was founded in the seminal works of [Tuk60] and [Hub64]. The overarching motto is that any model (especially a parametric one) is only approximately valid, and that any estimator designed for a particular distribution that is to be used in practice must also be stable in the presence of model misspecification. The standard setup is to assume that the samples we are given come from a nice distribution, but that an adversary has the power to arbitrarily corrupt a constant fraction of the observed data. After several decades of work, the robust statistics community has discovered a myriad of estimators that are provably robust. An important feature of this line of work is that it can tolerate a constant fraction of corruptions independent of the dimension and that there are estimators for both the location (e.g., the mean) and scale (e.g., the covariance). See [HR09] and [HRRS86] for further background.

It turns out that there are vast gaps in our understanding of robustness, when computational considerations are taken into account. In one dimension, robustness and computational efficiency are in perfect harmony. The empirical mean and empirical variance are not robust, because a single corruption can arbitrarily bias these estimates, but alternatives such as the median and the interquartile range are straightforward to compute and are provably robust.

But in high dimensions, there is a striking tension between robustness and computational efficiency. Let us consider estimators for location. The Tukey median [Tuk60] is a natural generalization of the one-dimensional median to high-dimensions. It is known that it behaves well (i.e., it needs few samples) when estimating the mean for various symmetric distributions [DG92, CGR16]. However, it is hard to compute in general [JP78, AK95] and the many heuristics for computing it degrade badly in the quality of their approximation as the dimension scales [CEM+93, Cha04, MS10]. The same issues plague estimators for scale. The minimum volume ellipsoid [Rou85] is a natural generalization of the one-dimensional interquartile range and is provably robust in high-dimensions, but is also hard to compute. And once again, heuristics for computing it [VAR09, RS98] work poorly in high dimensions.

The fact that robustness in high dimensions seems to come at such a steep price has long been a point of consternation within robust statistics. In a 1997 retrospective on the development of robust statistics [Hub97], Huber laments:

“It is one thing to design a theoretical algorithm whose purpose is to prove [large fractions of corruptions can be tolerated] and quite another thing to design a practical version that can be used not merely on small, but also on medium sized regression problems, with a 20002000 by 5050 matrix or so. This last requirement would seem to exclude all of the recently proposed [techniques].”

The goal of this paper is to answer Huber’s call to action and design estimators for both the mean and covariance that are highly practical, provably robust, and work in high-dimensions. Such estimators make the promise of robust statistics – estimators that work in high-dimensions and guarantee that their output has not been heavily biased by some small set of noisy samples – much closer to a reality.

First, we make some remarks to dispel some common misconceptions. There has been a considerable amount of recent work on robust principal component analysis, much of it making use of semidefinite programming. Some of these works can tolerate a constant fraction of corruptions [CLMW11], however require that the locations of the corruptions are evenly spread throughout the dataset so that no individual sample is entirely corrupted. In contrast, the usual models in robust statistics are quite rigid in what they require and they do this for good reason. A common scenario that is used to motivate robust statistical methods is if two studies are mixed together, and one subpopulation does not fit the model. Then one wants estimators that work without assuming anything at all about these outliers.

There have also been semidefinite programming methods proposed for robust principal component analysis with outliers [XCS10]. These methods assume that the uncorrupted matrix is rank rr and that the fraction of outliers is at most 1/r1/r, which again degrades badly as the rank of the matrix increases. Moreover, any method that uses semidefinite programming will have difficulty scaling to the sizes of the problems we consider here. For sake of comparison – even with state-of-the-art interior point methods – it is not currently feasible to solve the types of semidefinite programs that have been proposed when the matrices have dimension larger than a hundred.

Recent works in theoretical computer science have sought to circumvent the usual difficulties of designing efficient and robust algorithms by instead working in a generative model. The starting point for our paper is the work of [DKK+16] who gave an efficient algorithm for the problem of agnostically learning a Gaussian:

Total variation distance is the natural metric to use to measure closeness of the parameters, since a (1ε)(1-\varepsilon)-fraction of the observed samples came from a Gaussian. [DKK+16] gave an algorithm for the above problem (note that the guarantees are dimension independent), whose running time and sample complexity are polynomial in the dimension dd and 1/ε1/\varepsilon. [LRV16] independently gave an algorithm for the unknown mean case that achieves dTV(N,N)O~(εlogd)d_{TV}(\mathcal{N},\mathcal{N}^{\prime})\leq\widetilde{O}(\varepsilon\sqrt{\log d}), and in the unknown covariance case achieves guarantees in a weaker metric that is not affine invariant. A crucial feature is that both algorithms work even when the moments of the underlying distribution satisfy certain conditions, and thus are not necessarily brittle to the modeling assumption that the inliers come from a Gaussian distribution.

A more conceptual way to view such work is as a proof-of-concept that the Tukey median and minimum volume ellipsoid can be computed efficiently in a natural family of distributional models. This follows because not only would these be good estimates for the mean and covariance in the above model, but in fact any estimates that are good must also be close to them. Thus, these works fit into the emerging research direction of circumventing worst-case lower bounds by going beyond worst-case analysis.

Since the dissemination of the aforementioned works [DKK+16, LRV16], there has been a flurry of research activity on computationally efficient robust estimation in a variety of high-dimensional settings [DKS16, DKS17, CSV17, DKK+17, Li17, DBS17, BDLS17, SCV18, DKK+18], including studying graphical distributional models [DKS16], understanding the computation-robustness tradeoff for statistical query algorithms [DKS17], tolerating much more noise by allowing the algorithm to output a list of candidate hypotheses [CSV17], and developing robust algorithms under sparsity assumptions [Li17, DBS17, BDLS17], where the number of samples is sublinear in the dimension.

2 Our Results

Our goal in this work is to show that high-dimensional robust estimation can be highly practical. However, there are two major obstacles to achieving this. First, the sample complexity and running time of the algorithms in [DKK+16] is prohibitively large for high-dimensional applications. We just would not be able to store as many samples as we would need, in order to compute accurate estimates, in high-dimensional applications.

Our first main contribution is to show essentially tight bounds on the sample complexity of the filtering based algorithm of [DKK+16]. Roughly speaking, we accomplish this with a new definition of the good set which plugs into the existing analysis in a straightforward manner and shows that it is possible to estimate the mean with O~(d/ε2)\widetilde{O}(d/\varepsilon^{2}) samples (when the covariance is known) and the covariance with O~(d2/ε2)\widetilde{O}(d^{2}/\varepsilon^{2}) samples. Both of these bounds are information-theoretically optimal, up to logarithmic factors.

Our second main contribution is to vastly improve the fraction of adversarial corruptions that can be tolerated in applications. The fraction of errors that the algorithms of [DKK+16] can tolerate is indeed a constant that is independent of the dimension, but it is very small both in theory and in practice. This is due to the fact that many of the steps in the algorithm are overly conservative. In fact, we found that a naive implementation of the algorithm did not remove any outliers in many realistic scenarios. We combat this by giving new ways to empirically tune the threshold for where to remove points from the sample set. These optimizations dramatically improve the empirical performance.

Finally, we show that the same bounds on the error guarantee continue to work even when the underlying distribution is sub-Gaussian. This theoretically confirms that the robustness guarantees of such algorithms are in fact not overly brittle to the distributional assumptions. In fact, the filtering algorithm of [DKK+16] is easily shown to be robust under much weaker distributional assumptions, while retaining near-optimal sample and error guarantees. As an example, we show that it yields a near sample-optimal efficient estimator for robustly estimating the mean of a distribution, under the assumption that its covariance is bounded. Even in this regime, the filtering algorithm guarantees optimal error, up to a constant factor. Furthermore we empirically corroborate this finding by showing that the algorithm works well on real world data, as we describe below.

Now we come to the task of testing out our algorithms. To the best of our knowledge, there have been no experimental evaluations of the performance of the myriad of approaches to robust estimation. It remains mostly a mystery which ones perform well in high-dimensions, and which do not. To test out our algorithms, we design a synthetic experiment where a (1ε)(1-\varepsilon)-fraction of the samples come from a Gaussian and the rest are noise and sampled from another distribution (in many cases, Bernoulli). This gives us a baseline to compare how well various algorithms recover μ\mu and Σ\Sigma, and how their performance degrades based on the dimension. Our plots show a predictable and yet striking phenomenon: All earlier approaches have error rates that scale polynomially with the dimension and ours is a constant that is almost indistinguishable from the error that comes from sample noise alone. Moreover, our algorithms are able to scale to hundreds of dimensions.

But are algorithms for agnostically learning a Gaussian unduly sensitive to the distributional assumptions they make? We are able to give an intriguing visual demonstration of our techniques on real data. The famous study of [NJB+08] showed that performing principal component analysis on a matrix of genetic data recovers a map of Europe. More precisely, the top two singular vectors define a projection into the plane and when the groups of individuals are color-coded with where they are from, we recover familiar country boundaries that corresponds to the map of Europe. The conclusion from their study was that genes mirror geography. Given that one of the most important applications of robust estimation ought to be in exploratory data analysis, we ask: To what extent can we recover the map of Europe in the presence of noise? We show that when a small number of corrupted samples are added to the dataset, the picture becomes entirely distorted (and this continues to hold even for many other methods that have been proposed). In contrast, when we run our algorithm, we are able to once again recover the map of Europe. Thus, even when some fraction of the data has been corrupted (e.g., medical studies were pooled together even though the subpopulations studied were different), it is still possible to perform principal component analysis and recover qualitatively similar conclusions as if there were no noise at all!

Formal Framework

Notation. For a vector vv, we will let v2\|v\|_{2} denote its Euclidean norm. If MM is a matrix, we will let M2\|M\|_{2} denote its spectral norm and MF\|M\|_{F} denote its Frobenius norm. We will write XuSX\in_{u}S to denote that XX is drawn from the empirical distribution defined by SS.

Robust Estimation. We consider the following powerful model of robust estimation that generalizes many other existing models, including Huber’s contamination model:

Given ε>0\varepsilon>0 and a distribution family D\mathcal{D}, the adversary operates as follows: The algorithm specifies some number of samples mm. The adversary generates mm samples X1,X2,,XmX_{1},X_{2},\ldots,X_{m} from some (unknown) DDD\in\mathcal{D}. It then draws mm^{\prime} from an appropriate distribution. This distribution is allowed to depend on X1,X2,,XmX_{1},X_{2},\ldots,X_{m}, but when marginalized over the mm samples satisfies m\mboxBin(ε,m)m^{\prime}\sim\mbox{Bin}(\varepsilon,m). The adversary is allowed to inspect the samples, removes mm^{\prime} of them, and replaces them with arbitrary points. The set of mm points is then given to the algorithm.

In summary, the adversary is allowed to inspect the samples before corrupting them, both by adding corrupted points and deleting uncorrupted points. In contrast, in Huber’s model the adversary is oblivious to the samples and is only allowed to add corrupted points.

We remark that there are no computational restrictions on the adversary. The goal is to return the parameters of a distribution D^\widehat{D} in D\mathcal{D} that are close to the true parameters in an appropriate metric. For the case of the mean, our metric will be the Euclidean distance. For the covariance, we will use the Mahalanobis distance, i.e., Σ1/2Σ^Σ1/2IF\|\Sigma^{-1/2}{\widehat{\Sigma}}\Sigma^{-1/2}-I\|_{F}. This is a strong affine invariant distance that implies corresponding bounds in total variation distance.

We say that a set of samples is ε\varepsilon-corrupted if it is generated by the process described in Definition 2.1.

Nearly Sample-Optimal Efficient Robust Learning

In this section, we present near sample-optimal efficient robust estimators for the mean and the covariance of high-dimensional distributions under various structural assumptions of varying strength. Our estimators rely on the filtering technique introduced in [DKK+16].

We note that [DKK+16] gave two algorithmic techniques: the first one was a spectral technique to iteratively remove outliers from the dataset (filtering), and the second one was a soft-outlier removal method relying on convex programming. The filtering technique seemed amenable to practical implementation (as it only uses simple eigenvalue computations), but the corresponding sample complexity bounds given in [DKK+16] are polynomially worse than the information-theoretic minimum. On the other hand, the convex programming technique of [DKK+16] achieved better sample complexity bounds (e.g., near sample-optimal for robust mean estimation), but relied on the ellipsoid method, which seemed to preclude a practically efficient implementation.

In this work, we achieve the best of both worlds: we provide a more careful analysis of the filter technique that yields sample-optimal bounds (up to logarithmic factors) for both the mean and the covariance. Moreover, we show that the filtering technique easily extends to much weaker distributional assumptions (e.g., under bounded second moments). Roughly speaking, the filtering technique follows a general iterative recipe: (1) via spectral methods, find some univariate test which is violated by the corrupted points, (2) find some concrete tail bound violated by the corrupted set of points, and (3) throw away all points which violate this tail bound.

[DKK+16] gave algorithms for robustly estimating the mean of a Gaussian distribution with known covariance and for robustly estimating the mean of a binary product distribution. The main motivation for considering these specific distribution families is that robustly estimating the mean within Euclidean distance immediately implies total variation distance bounds for these families. The above theorem establishes that these guarantees hold in a more general setting with near sample-optimal bounds. Under a bounded second moment assumption, we show:

A similar result on mean estimation under bounded second moments was concurrently shown in [SCV18]. The sample size above is optimal, up to a logarithmic factor, and the error guarantee is easily seen to be the best possible up to a constant factor. The main difference between the filtering algorithm establishing the above theorem and the filtering algorithm for the sub-gaussian case is how we choose the threshold for the filter. Instead of looking for a violation of a concentration inequality, here we will choose a threshold at random. In this case, randomly choosing a threshold weighted towards higher thresholds suffices to throw out more corrupted samples than uncorrupted samples in expectation. Although it is possible to reject many good samples this way, we show that the algorithm still only rejects a total of O(ε)O(\varepsilon) samples with high probability.

Finally, for robustly estimating the covariance of a Gaussian distribution, we have:

Let GN(0,Σ)G\sim\operatorname{\mathcal{N}}(0,\Sigma) be a Gaussian in dd dimensions, and let ε>0\varepsilon>0. Let SS be an ε\varepsilon-corrupted set of samples from GG of size Ω((d2/ε2)polylog(d/ε))\Omega((d^{2}/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon)). There exists an efficient algorithm that, given SS and ε\varepsilon, returns the parameters of a Gaussian distribution GN(0,Σ^)G^{\prime}\sim\operatorname{\mathcal{N}}(0,{\widehat{\Sigma}}) so that with probability at least 9/109/10, it holds IΣ1/2Σ^Σ1/2F=O(εlog(1/ε)).\|I-\Sigma^{-1/2}{\widehat{\Sigma}}\Sigma^{-1/2}\|_{F}=O(\varepsilon\log(1/\varepsilon)).

We now provide a high-level description of the main ingredient which yields these improved sample complexity bounds. The initial analysis of [DKK+16] established sample complexity bounds which were sub-optimal by polynomial factors because it insisted that the set of good samples (i.e., before the corruption) satisfied very tight tail bounds. To some degree such bounds are necessary, as when we perform our filtering procedure, we need to ensure that not too many good samples are thrown away. However, the old analysis required that fairly strong tail bounds hold uniformly. The idea for the improvement is as follows: If the errors are sufficient to cause the variance of some polynomial pp (linear in the unknown mean case or quadratic in the unknown covariance case) to increase by more than ε\varepsilon, it must be the case that for some TT, roughly an ε/T2\varepsilon/T^{2} fraction of samples are error points with p(x)>T|p(x)|>T. As long as we can ensure that less than an ε/T2\varepsilon/T^{2} fraction of our good sample points have p(x)>T|p(x)|>T, this will suffice for our filtering procedure to work. For small values of TT, these are much weaker tail bounds than were needed previously and can be achieved with a smaller number of samples. For large values of TT, these tail bounds are comparable to those used in previous work [DKK+16] , but in such cases we can take advantage of the fact that p(G)>T|p(G)|>T only with very small probability, again allowing us to reduce the sample complexity. The details are deferred to Appendix A.

Filtering

We now describe the filtering technique more rigorously. We also describe some additional heuristics we found useful in practice.

We first consider mean estimation. The algorithms which achieve Theorems 3.1 and 3.2 both follow the general recipe in Algorithm 1. We must specify three parameter functions:

δ(ε,s)\delta(\varepsilon,s) is a slack function, which we require for technical reasons.

where ν\nu is the subgaussian parameter. See Section A.1 for details.

2 Robust Covariance Estimation

Our algorithm for robust covariance follows the exact recipe outlined above, with one key difference—we check for deviations in the empirical fourth moment tensor. Intuitively, just as in the robust mean setting, we used degree-22 information to detect outliers for the mean (the degree-11 moment), here we use degree-44 information to detect outliers for the covariance (the degree-22 moment).

3 Better Univariate Tests

In the algorithms described above for robust mean estimation, after projecting onto one dimension, we center the points at the empirical mean along this direction. This is theoretically sufficient, however, introduces additional constant factors since the empirical mean along this direction may be corrupted. Instead, one can use a robust estimate for the mean in one direction. Namely, it is well known that the median is a provably robust estimator for the mean for symmetric distributions [HR09, HRRS86], and under certain models it is in fact optimal in terms of its resilience to noise [DKW56, Mas90, Che98, DK14, DKK+17]. By centering the points at the median instead of the mean, we are able to achieve better error in practice.

4 Adaptive Tail Bounding

We found that depending on the setting, it was useful to change the constant C2C_{2}. In particular, in low dimensions, we could be more stringent, and enforce a stronger tail bound (which corresponds to a higher C2C_{2}), but in higher dimensions, we must be more lax with the tail bound. To do this in a principled manner, we introduced a heuristic we call adaptive tail bounding. Our goal is to find a choice of C2C_{2} which throws away roughly an ε\varepsilon-fraction of points. The heuristic is fairly simple: we start with some initial guess for C2C_{2}. We then run our filter with this C2C_{2}. If we throw away too many data points, we increase our C2C_{2}, and retry. If we throw away too few, then we decrease our C2C_{2} and retry. Since increasing C2C_{2} strictly decreases the number of points thrown away, and vice versa, we binary search over our choice of C2C_{2} until we reach something close to our target accuracy. In our current implementation, we stop when the fraction of points we throw away is between ε/2\varepsilon/2 and 3ε/23\varepsilon/2, or if we’ve binary searched for too long. We found that this heuristic drastically improves our accuracy, and allows our algorithm to scale fairly smoothly from low to high dimension.

Experiments

We performed an empirical evaluation of the above algorithms on synthetic and real data sets with and without synthetic noise. All experiments were done on a laptop computer with a 2.7 GHz Intel Core i5 CPU and 8 GB of RAM. The focus of this evaluation was on statistical accuracy, not time efficiency. In this measure, our algorithm performs the best of all algorithms we tried. In all synthetic trials, our algorithm consistently had the smallest error. In fact, in some of the synthetic benchmarks, our error was orders of magnitude better than any other algorithms. In the semi-synthetic benchmark, our algorithm also (arguably) performs the best, though there is no way to tell for sure, since there is no ground truth. We also note that despite not optimizing our code for runtime, the runtime of our algorithm is always comparable, and in many cases, better than the alternatives which provided comparable error. Code of our implementation is available at https://github.com/hoonose/robust-filter.

Experiments with synthetic data allow us to verify the error guarantees and the sample complexity rates proven in Section 3 for unknown mean and unknown covariance. In both cases, the experiments validate the accuracy and usefulness of our algorithm, almost exactly matching the best rate without noise.

We tried various noise distributions, and found that the same qualitative pattern arose for all of them. In the reported experiment, our noise distribution was a mixture of two binary product distributions, where one had a couple of large coordinates (see Section B.1 for a detailed description). For all (nontrivial) error distributions we tried, we observed that indeed the empirical mean, pruning, geometric median, and RANSAC all have error which diverges as dd grows, as the theory predicts. On the other hand, both our algorithm and LRVMean have markedly smaller error as a function of dimension. Indeed, our algorithm’s error is almost identical to that of the empirical mean of the uncorrupted sample points.

On this corrupted data, we compared the performance of our Filter algorithm to that of (1) the empirical covariance of all the points, (2) a trivial pruning procedure, (3) a RANSAC-based minimal volume ellipsoid (MVE) algorithm, and (5) a recently proposed robust estimator for the covariance due to [LRV16], which we will call LRVCov. For (5), we again obtained the implementation from their Github repository.

We tried various choices of Σ\Sigma and noise distribution. Figure 2 shows two choices of Σ\Sigma and noise. Again, the x-axis indicates the dimension of the experiment and the y-axis indicates the estimator’s excess Mahalanobis error over the sampling error. In the left figure, we set Σ=I\Sigma=I, and our noise points are simply all located at the all-zeros vector. In the right figure, we set Σ=I+10e1e1T\Sigma=I+10e_{1}e_{1}^{T}, where e1e_{1} is the first basis vector, and our noise distribution is a somewhat more complicated distribution, which is similarly spiked, but in a different, random, direction. We formally define this distribution in Section B.1. For all choices of Σ\Sigma and noise we tried, the qualitative behavior of our algorithm and LRVCov was unchanged. Namely, we seem to match the empirical error without noise up to a very small slack, for all dimensions. On the other hand, the performance of empirical mean, pruning, and RANSAC varies widely with the noise distribution. The performance of all these algorithms degrades substantially with dimension, and their error gets worse as we increase the skew of the underlying data. The performance of LRVCov is the most similar to ours, but again is worse by a large constant factor. In particular, our excess risk was on the order of 10410^{-4} for large dd, for both experiments, whereas the excess risk achieved by LRVCov was in all cases a constant between 0.10.1 and 22.

These experiments demonstrate that our statistical guarantees are in fact quite strong. In particular, since our excess error is almost zero (and orders of magnitude smaller than other approaches), this suggests that our sample complexity is indeed close to optimal, since we match the rate without noise, and that the constants and logarithmic factors in the theoretical recovery guarantee are often small or non-existent.

2 Semi-synthetic Data

To demonstrate the efficacy of our method on real data, we revisit the famous study of [NJB+08]. In this study, the authors investigated data collected as part of the Population Reference Sample (POPRES) project. This dataset consists of the genotyping of thousands of individuals using the Affymetrix 500K single nucleotide polymorphism (SNP) chip. The authors pruned the dataset to obtain the genetic data of over 1387 European individuals, annotated by their country of origin. Using principal components analysis, they produce a two-dimensional summary of the genetic variation, which bears a striking resemblance to the map of Europe.

Our experimental setup is as follows. While the original dataset is very high dimensional, we use a 20 dimensional version of the dataset as found in the authors’ GitHubhttps://github.com/NovembreLab/Novembre_etal_2008_misc. We first randomly rotate the data, as then 20 dimensional data was diagonalized, and the high dimensional data does not follow such structure. We then add an additional ε1ε\frac{\varepsilon}{1-\varepsilon} fraction of points (so that they make up an ε\varepsilon-fraction of the final points). These added points were discrete points, following a simple product distribution (see Section B.1 for full details). We used a number of methods to obtain a covariance matrix for this dataset, and we projected the data onto the top two singular vectors of this matrix. In Figure 3, we show the results when we compare our techniques to pruning. In particular, our output was able to more or less reproduce the map of Europe, whereas pruning fails to. In Section B.2, we also compare our result with a number of other techniques, including those we tested against in the unknown covariance experiments, and other robust PCA techniques. The only alternative algorithm which was able to produce meaningful output was LRVCov, which produced output that was similar to ours, but which produced a map which was somewhat more skewed. We believe that our algorithm produces the best picture.

In Figure 3, we also display the actual points which were output by our algorithm’s Filter. While it manages to remove most of the noise points, it also seems to remove some of the true data points, particularly those from Eastern Europe and Turkey. We attribute this to a lack of samples from these regions, and thus one could consider them as outliers to a dataset consisting of Western European individuals. For instance, Turkey had 4 data points, so it seems quite reasonable that any robust algorithm would naturally consider these points outliers.

We view our experiments as a proof of concept demonstration that our techniques can be useful in real world exploratory data analysis tasks, particularly those in high-dimensions. Our experiments reveal that a minimal amount of noise can completely disrupt a data analyst’s ability to notice an interesting phenomenon, thus limiting us to only very well-curated data sets. But with robust methods, this noise does not interfere with scientific discovery, and we can still recover interesting patterns which otherwise would have been obscured by noise.

Acknowledgments

We would like to thank Simon Du and Lili Su for helpful comments on a previous version of this work.

References

Appendix A Omitted Details from Section 3

In this section, we use our filter technique to give a near sample-optimal computationally efficient algorithm to robustly estimate the mean of a sub-gaussian density with a known covariance matrix, thus proving Theorem 3.1.

We emphasize that the algorithm and its analysis is essentially identical to the filtering algorithm given in Section 8.1 of [DKK+16] for the case of a Gaussian N(μ,I)\mathcal{N}(\mu,I). The only difference is a weaker definition of the “good set of samples” (Definition A.4) and a simple concentration argument (Lemma A.5) showing that a random set of uncorrupted samples of the appropriate size is good with high probability. Given these, the analysis of this subsection follows straightforwardly from the analysis in Section 8.1 of [DKK+16] by plugging in the modified parameters. For the sake of completeness, we provide the details below.

We start by formally defining sub-gaussian distributions:

We will use the following simple fact about the concentration of sub-gaussian random variables:

The following theorem is a high probability version of Theorem 3.1:

Notation. We will denote μS=1SXSX\mu^{S}=\frac{1}{|S|}\sum_{X\in S}X and MS=1SXS(XμG)(XμG)TM_{S}=\frac{1}{|S|}\sum_{X\in S}(X-\mu^{G})(X-\mu^{G})^{T} for the sample mean and modified sample covariance matrix of the set SS.

We start by defining our modified notion of good sample, i.e, a set of conditions on the uncorrupted set of samples under which our algorithm will succeed.

For all xSx\in S we have xμG2O(dlog(S/τ))\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log(|S|/\tau)}).

We have that μSμG2ε.\|\mu^{S}-\mu^{G}\|_{2}\leq\varepsilon.

We have that MSI2ε.\left\|M_{S}-I\right\|_{2}\leq{\varepsilon}.

We show in the following subsection that a sufficiently large set of independent samples from GG is (ε,τ)(\varepsilon,\tau)-good (with respect to GG) with high probability. Specifically, we prove:

Let GG be sub-gaussian distribution with parameter ν=Θ(1)\nu=\Theta(1) and with identity covariance, and ε,τ>0.\varepsilon,\tau>0. If the multiset SS is obtained by taking Ω((d/ε2)polylog(d/ετ))\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau)) independent samples from G,G, it is (ε,τ)(\varepsilon,\tau)-good with respect to GG with probability at least 1τ.1-\tau.

We require the following definition that quantifies the extent to which a multiset has been corrupted:

Given finite multisets SS and SS^{\prime} we let Δ(S,S)\Delta(S,S^{\prime}) be the size of the symmetric difference of SS and SS^{\prime} divided by the cardinality of S.S.

The starting point of our algorithm will be a simple NaivePrune routine (Section 4.3.1 of [DKK+16]) that removes obvious outliers, i.e., points which are far from the mean. Then, we iterate the algorithm whose performance guarantee is given by the following:

A mean vector μ^{\widehat{\mu}} such that μ^μG2=O(εlog(1/ε)).\|{\widehat{\mu}}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}).

We start by showing how Theorem A.3 follows easily from Proposition A.7.

By the definition of Δ(S,S),\Delta(S,S^{\prime}), since SS^{\prime} has been obtained from SS by corrupting an ε\varepsilon-fraction of the points in S,S, we have that Δ(S,S)2ε.\Delta(S,S^{\prime})\leq 2\varepsilon. By Lemma A.5, the set SS of uncorrupted samples is (ε,τ)(\varepsilon,\tau)-good with respect to GG with probability at least 1τ.1-\tau. We henceforth condition on this event.

Since SS is (ε,τ)(\varepsilon,\tau)-good, all xSx\in S have xμG2O(dlogS/τ)\|x-\mu^{G}\|_{2}\leq O(\sqrt{d\log|S|/\tau}). Thus, the NaivePrune procedure does not remove from SS^{\prime} any member of SS. Hence, its output, SS^{\prime\prime}, has Δ(S,S)Δ(S,S)\Delta(S,S^{\prime\prime})\leq\Delta(S,S^{\prime}) and for any xSx\in S^{\prime\prime}, there is a ySy\in S with xy2O(dlogS/τ)\|x-y\|_{2}\leq O(\sqrt{d\log|S|/\tau}). By the triangle inequality, for any x,zSx,z\in S^{\prime\prime}, xz2O(dlogS/τ)=O(dlog(d/ετ))\|x-z\|_{2}\leq O(\sqrt{d\log|S|/\tau})=O(\sqrt{d\log(d/\varepsilon\tau})).

Then, we iteratively apply the Filter-Sub-Gaussian-Unknown-Mean procedure of Proposition A.7 until it terminates returning a mean vector μ\mu with μ^μG2=O(εlog(1/ε)).\|{\widehat{\mu}}-\mu^{G}\|_{2}=O(\varepsilon\sqrt{\log(1/\varepsilon)}). We claim that we need at most O(α)O(\alpha) iterations for this to happen. Indeed, the sequence of iterations results in a sequence of sets Si,S_{i}^{\prime}, so that Δ(S,Si)Δ(S,S)iε/α.\Delta(S,S_{i}^{\prime})\leq\Delta(S,S^{\prime})-i\cdot\varepsilon/{\alpha}. Thus, if we do not output the empirical mean in the first 2α2\alpha iterations, in the next iteration there are no outliers left and the algorithm terminates outputting the sample mean of the remaining set. ∎

In this subsection, we describe the efficient algorithm establishing Proposition A.7 and prove its correctness. Our algorithm calculates the empirical mean vector μS\mu^{S^{\prime}} and empirical covariance matrix Σ\Sigma. If the matrix Σ\Sigma has no large eigenvalues, it returns μS.\mu^{S^{\prime}}. Otherwise, it uses the eigenvector vv^{\ast} corresponding to the maximum magnitude eigenvalue of Σ\Sigma and the mean vector μS\mu^{S^{\prime}} to define a filter. Our efficient filtering procedure is presented in detailed pseudocode below.

A.1.2 Proof of Correctness of Filter-Sub-Gaussian-Unknown-Mean

We define μG,μS,μS,μL,\mu^{G},\mu^{S},\mu^{S^{\prime}},\mu^{L}, and μE\mu^{E} to be the means of G,S,S,L,G,S,S^{\prime},L, and EE, respectively.

Our analysis will make essential use of the following matrices:

Our analysis will hinge on proving the important claim that ΣI\Sigma-I is approximately (E/S)ME(|E|/|S^{\prime}|)M_{E}. This means two things for us. First, it means that if the positive errors align in some direction (causing MEM_{E} to have a large eigenvalue), there will be a large eigenvalue in ΣI\Sigma-I. Second, it says that any large eigenvalue of ΣI\Sigma-I will correspond to an eigenvalue of MEM_{E}, which will give an explicit direction in which many error points are far from the empirical mean.

Useful Structural Lemmas. We begin by noting that we have concentration bounds on GG and therefore, on SS due to its goodness.

The first line is Fact A.2, and the former follows from it using the goodness of SS. ∎

By using the above fact, we obtain the following simple claim:

This follows from Fact A.8 upon noting that w(XμS)>T+μSμG2|w\cdot(X-\mu^{S^{\prime}})|>T+\|\mu^{S^{\prime}}-\mu^{G}\|_{2} only if w(XμG)>T|w\cdot(X-\mu^{G})|>T. ∎

We can use the above facts to prove concentration bounds for LL. In particular, we have the following lemma:

We have that ML2=O(log(S/L)+εS/L).\|M_{L}\|_{2}={O\left(\log(|S|/|L|)+\varepsilon|S|/|L|\right)}.

For unit vectors vv, the RHS is bounded from above as follows:

where the second line follows from the fact that v2=1\|v\|_{2}=1, LSL\subset S, and SS satisfies condition (i) of Definition A.4; the third line follows from (1); and the fourth line follows from Fact A.8. ∎

As a corollary, we can relate the matrices MSM_{S^{\prime}} and MEM_{E}, in spectral norm:

We have that MSI=(E/S)ME+O(εlog(1/ε))M_{S^{\prime}}-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon)), where the O(εlog(1/ε))O(\varepsilon\log(1/\varepsilon)) term denotes a matrix of spectral norm O(εlog(1/ε))O(\varepsilon\log(1/\varepsilon)).

By definition, we have that SMS=SMSLML+EME|S^{\prime}|M_{S^{\prime}}=|S|M_{S}-|L|M_{L}+|E|M_{E}. Thus, we can write

where the second line uses the fact that 12εS/S1+2ε1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon, the goodness of SS (condition (iv) in Definition A.4), and Lemma A.10. Specifically, Lemma A.10 implies that (L/S)ML2=O(εlog(1/ε))(|L|/|S^{\prime}|)\|M_{L}\|_{2}=O(\varepsilon\log(1/\varepsilon)). Therefore, we have that

We now establish a similarly useful bound on the difference between the mean vectors:

Since SS is a good set, by condition (iii) of Definition A.4, we have μSμG2=O(ε)\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon). Since 12εS/S1+2ε1-2\varepsilon\leq|S|/|S^{\prime}|\leq 1+2\varepsilon, it follows that (S/S)μSμG2=O(ε)(|S|/|S^{\prime}|)\|\mu^{S}-\mu^{G}\|_{2}=O(\varepsilon). Using the valid inequality ML2μLμG22\|M_{L}\|_{2}\geq\|\mu^{L}-\mu^{G}\|_{2}^{2} and Lemma A.10, we obtain that μLμG2O(log(S/L)+εS/L)\|\mu^{L}-\mu^{G}\|_{2}\leq O\left(\sqrt{\log(|S|/|L|)}+\sqrt{\varepsilon|S|/|L|}\right). Therefore,

as desired. This completes the proof of the lemma. ∎

By combining the above, we can conclude that ΣI\Sigma-I is approximately proportional to MEM_{E}. More formally, we obtain the following corollary:

We have ΣI=(E/S)ME+O(εlog(1/ε))+O(E/S)2ME2\Sigma-I=(|E|/|S^{\prime}|)M_{E}+O(\varepsilon\log(1/\varepsilon))+O(|E|/|S^{\prime}|)^{2}\|M_{E}\|_{2}, where the additive terms denote matrices of appropriately bounded spectral norm.

By definition, we can write ΣI=MSI(μSμG)(μSμG)T.\Sigma-I=M_{S^{\prime}}-I-(\mu^{S^{\prime}}-\mu^{G})(\mu^{S^{\prime}}-\mu^{G})^{T}. Using Corollary A.11 and Lemma A.12, we obtain:

where the second line follows from the valid inequality ME2μEμG22\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}. This completes the proof. ∎

On the other hand, since ME2μEμG22\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2}, Lemma A.12 gives that

Case of Large Spectral Norm. We next show the correctness of the algorithm when it returns a filter in Step 7.

Moreover, using the inequality ME2μEμG22\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{G}\|_{2}^{2} and Lemma A.12 as above, we get that

Suppose for the sake of contradiction that for all T>0T>0 we have that

Using (3), we obtain that for all T>0T>0 we have that

We now have the following sequence of inequalities:

Combined with (2), we obtain λ=O(εlog(1/ε))\lambda^{\ast}=O(\varepsilon\log(1/\varepsilon)), which is a contradiction if CC is sufficiently large. Therefore, it must be the case that for some value of TT the condition in Step 7 is satisfied.

Recall that S=(SL)E,S^{\prime}=(S\setminus L)\cup E, with EE and LL disjoint multisets such that LS.L\subset S. We can similarly write S=(SL)E,S^{\prime\prime}=(S\setminus L^{\prime})\cup E^{\prime}, with LLL^{\prime}\supseteq L and EE.E^{\prime}\subset E. Since

it suffices to show that EELL+εS/α.|E\setminus E^{\prime}|\geq|L^{\prime}\setminus L|+\varepsilon|S|/{\alpha}. Note that LL|L^{\prime}\setminus L| is the number of points rejected by the filter that lie in SS.S\cap S^{\prime}. Note that the fraction of elements of SS that are removed to produce SS^{\prime\prime} (i.e., satisfy v(xμS)>T+δ|v^{\ast}\cdot(x-\mu^{S^{\prime}})|>T+\delta) is at most 2exp(T2/2ν)+ε/α2\exp(-T^{2}/2\nu)+\varepsilon/{\alpha}. This follows from Claim A.9 and the fact that T=O(dlog(d/ετ))T=O(\sqrt{d\log(d/\varepsilon\tau)}).

Hence, it holds that LL(2exp(T2/2ν)+ε/α)S.|L^{\prime}\setminus L|\leq(2\exp(-T^{2}/2\nu)+\varepsilon/{\alpha})|S|. On the other hand, Step 7 of the algorithm ensures that the fraction of elements of SS^{\prime} that are rejected by the filter is at least 8exp(T2/2ν)+8ε/α)8\exp(-T^{2}/2\nu)+8\varepsilon/{\alpha}). Note that EE|E\setminus E^{\prime}| is the number of points rejected by the filter that lie in SS.S^{\prime}\setminus S. Therefore, we can write:

where the second line uses the fact that SS/2|S^{\prime}|\geq|S|/2 and the last line uses the fact that LL/S2exp(T2/2ν)+ε/α.|L^{\prime}\setminus L|/|S|\leq 2\exp(-T^{2}/2\nu)+\varepsilon/{\alpha}. Noting that log(d/ετ)1\log(d/\varepsilon\tau)\geq 1, this completes the proof of the claim. ∎

A.1.3 Proof of Lemma A.5

Let N=Ω((d/ε2)polylog(d/ετ))N=\Omega((d/\varepsilon^{2})\operatorname{poly}\log(d/\varepsilon\tau)) be the number of samples drawn from GG. For (i), the probability that a coordinate of a sample is at least 2νlog(Nd/3τ)\sqrt{2\nu\log(Nd/3\tau)} is at most τ/3dN\tau/3dN by Fact A.2. By a union bound, the probability that all coordinates of all samples are smaller than 2νlog(Nd/3τ)\sqrt{2\nu\log(Nd/3\tau)} is at least 1τ/31-\tau/3. In this case, x22νdlog(Nd/3τ)=O(dνlog(Nν/τ))\|x\|_{2}\leq\sqrt{2\nu d\log(Nd/3\tau)}=O(\sqrt{d\nu\log(N\nu/\tau)}).

After translating by μG\mu^{G}, we note that (iii) follows immediately from Lemmas 4.3 of [DKK+16] and (iv) follows from Theorem 5.50 of [Ver10], as long as N=Ω(ν4dlog(1/τ)/ε2)N=\Omega(\nu^{4}d\log(1/\tau)/\varepsilon^{2}), with probability at least 1τ/31-\tau/3. It remains to show that, conditioned on (i), (ii) holds with probability at least 1τ/31-\tau/3.

To simplify some expressions, let δ:=ε/(log(dlogd/ετ))\delta:=\varepsilon/(\log(d\log d/\varepsilon\tau)) and R=Cdlog(S/τ)R=C\sqrt{d\log(|S|/\tau)}. We need to show that for all unit vectors vv and all 0TR0\leq T\leq R that

Firstly, we show that for all unit vectors vv and T>0T>0

with probability at least 1τ/61-\tau/6. Since the VC-dimension of the set of all halfspaces is d+1d+1, this follows from the VC inequality [DL01], since we have more than Ω(d/(δ/(10νlog(1/δ))2)\Omega(d/(\delta/(10\nu\log(1/\delta))^{2}) samples. We thus only need to consider the case when T10νln(1/δ)T\geq\sqrt{10\nu\ln(1/\delta)}.

For any fixed unit vector vv and T>10νln(1/δ)T>\sqrt{10\nu\ln(1/\delta)}, except with probability exp(Nδ/(6Cν))\exp(-N\delta/(6C\nu)), we have that

Let EE be the event that v(XμG)>T|v\cdot(X-\mu^{G})|>T. Since GG is sub-gaussian, Fact A.2 yields that PrG[E]=PrYG[v(XμG)>T]exp(T2/(2ν))\Pr_{G}[E]=\Pr_{Y\sim G}[|v\cdot(X-\mu^{G})|>T]\leq\exp(-T^{2}/(2\nu)). Note that, thanks to our assumption on TT, we have that Texp(T2/(4ν))/2CT\leq\exp(T^{2}/(4\nu))/2C, and therefore T2PrG[E]exp(T2/(4ν))/2Cδ/2CT^{2}\Pr_{G}[E]\leq\exp(-T^{2}/(4\nu))/2C\leq\delta/2C.

where (a) follows from sub-gaussianity, (b) follows from our choice of TT, and (c) comes from the fact that 1+xex1+x\leq e^{x} for all xx.

Thus, if δ\delta is a sufficiently small constant and CC is sufficiently large, this yields the desired bound. ∎

Now let C\mathcal{C} be a 1/21/2-cover in Euclidean distance for the set of unit vectors of size 2O(d)2^{O(d)}. By a union bound, for all vCv^{\prime}\in\mathcal{C} and TT^{\prime} a power of 2 between 4νln(1/δ)\sqrt{4\nu\ln(1/\delta)} and RR, we have that

Then, by a union bound, (6) holds simultaneously for all unit vectors vv and all 0TR0\leq T\leq R, with probability a least 1τ/31-\tau/3. This completes the proof. ∎

A.2 Robust Mean Estimation Under Second Moment Assumptions

In this section, we use our filtering technique to give a near sample-optimal computationally efficient algorithm to robustly estimate the mean of a density with a second moment assumption. We show:

Note that Theorem 3.2 follows straightforwardly from the above (divide every sample by σ\sigma, run the algorithm of Theorem A.16, and multiply its output by σ\sigma).

We would like our good set of samples to have mean close to that of PP and bounded variance in all directions. This motivates the following definition:

We call a set SS ε\varepsilon-good for a distribution PP with mean μP\mu^{P} and covariance ΣPI\Sigma_{P}\preceq I if the mean μS\mu^{S} and covariance ΣS\Sigma^{S} of SS satisfy μSμP2ε\|\mu^{S}-\mu^{P}\|_{2}\leq\sqrt{\varepsilon} and ΣS22\|\Sigma^{S}\|_{2}\leq 2.

However, since we have no assumptions about higher moments, it may be be possible for outliers to affect our sample covariance too much. Fortunately, such outliers have small probability and do not contribute too much to the mean, so we will later reclassify them as errors.

Let SS be N=Θ((d/ε)logd)N=\Theta((d/\varepsilon)\log d) samples drawn from PP. Then, with probability at least 9/109/10, a random XuSX\in_{u}S satisfies

PrS[XμP280d/ε]ε/160\Pr_{S}\left[\|X-\mu^{P}\|_{2}\geq 80\sqrt{d/\varepsilon}\right]\leq\varepsilon/160,

for YPY\sim P. By Markov’s inequality, Pr[YμP280d/ε]ε/160\Pr[\|Y-\mu^{P}\|_{2}\geq 80\sqrt{d/\varepsilon}]\leq\varepsilon/160 with probability at least 39/4039/40.

where (a) follows from Cauchy-Schwarz, and (b) follows from the definition of the covariance, and (c) follows from the assumption that ΣPI\Sigma_{P}\preceq I and from Markov’s inequality.

For (iv), we require the following Matrix Chernoff bound:

We apply this lemma with Xk=(xkμP)(xkμP)T1xkμP280d/εX_{k}=(x_{k}-\mu^{P})(x_{k}-\mu^{P})^{T}1_{\|x_{k}-\mu^{P}\|_{2}\leq 80\sqrt{d/\varepsilon}} for {x1,,xN}=S\{x_{1},\dots,x_{N}\}=S. Note that Xk2(80)2d/ε=L\|X_{k}\|_{2}\leq(80)^{2}d/\varepsilon=L and that μmaxNΣP2N\mu^{\max}\leq N\|\Sigma_{P}\|_{2}\leq N.

Suppose that μmaxN/80\mu^{\max}\leq N/80. Then, taking θ=1\theta=1, we have

By Markov’s inequality, except with probability 39/4039/40, we have kXk2N+O(dlog(d)/ε)3N/2\|\sum_{k}X_{k}\|_{2}\leq N+O(d\log(d)/\varepsilon)\leq 3N/2, for NN a sufficiently high multiple of dlog(d)/εd\log(d)/\varepsilon.

Suppose that μmaxN/80\mu^{\max}\geq N/80, then we take δ=1/2\delta=1/2 and obtain

For NN a sufficiently high multiple of dlog(d)/εd\log(d)/\varepsilon, we get that Pr[kXk23μmax/2]1/40\Pr[\left\|\sum_{k}X_{k}\right\|_{2}\geq 3\mu^{\max}/2]\leq 1/40. Since μmaxN\mu^{\max}\leq N, we have with probability at least 39/4039/40, kXk23N/2\left\|\sum_{k}X_{k}\right\|_{2}\leq 3N/2.

Now we can get a 2ε2\varepsilon-corrupted good set from an ε\varepsilon-corrupted set of samples satisfying Lemma A.18, by reclassifying outliers as errors:

Let S=RELS=R\cup E\setminus L, where RR is a set of N=Θ(dlogd/ε)N=\Theta(d\log d/\varepsilon) samples drawn from PP and EE and LL are disjoint sets with E,Lε|E|,|L|\leq\varepsilon. Then, with probability 9/109/10, we can also write S=GELS=G\cup E^{\prime}\setminus L^{\prime}, where GRG\subseteq R is ε\varepsilon-good, LLL^{\prime}\subseteq L and EEE^{\prime}\subseteq E^{\prime} has E2εS|E^{\prime}|\leq 2\varepsilon|S|.

Let G={xR:x280d/ε}G=\{x\in R:\|x\|_{2}\leq 80\sqrt{d/\varepsilon}\}. Condition on the event that RR satisfies Lemma A.18. By Lemma A.18, this occurs with probability at least 9/109/10.

Since RR satisfies (ii) of Lemma A.18, GRεR/160εS|G|-|R|\leq\varepsilon|R|/160\leq\varepsilon|S|. Thus, E=E(RG)E^{\prime}=E\cup(R\setminus G) has E3ε/2|E^{\prime}|\leq 3\varepsilon/2. Note that (iv) of Lemma A.18 for RR in terms of GG is exactly GΣG2/R3/2|G|\|\Sigma^{G}\|_{2}/|R|\leq 3/2, and so ΣG23R/(2G)2\|\Sigma^{G}\|_{2}\leq 3|R|/(2|G|)\leq 2.

It remains to check that μGμP2ε\|\mu^{G}-\mu^{P}\|_{2}\leq\sqrt{\varepsilon}. We have

where the last line follows from (iii) of Lemma A.18. Since we argued above that R/G2/3|R|/|G|\geq 2/3, dividing this expression by G|G| yields the desired claim.

An iteration of FilterUnder2ndMoment may throw out more samples from GG than corrupted samples. However, in expectation, we throw out many more corrupted samples than from the good set:

If GG is an ε\varepsilon-good set with x40d/εx\leq 40\sqrt{d/\varepsilon} for xSGx\in S\cup G, then MG22μGμS22+2  .\|M_{G}\|_{2}\leq 2\|\mu^{G}-\mu^{S}\|_{2}^{2}+2\;.

We have that LML22G(1+μGμS22)  .|L|\|M_{L}\|_{2}\leq 2|G|(1+\|\mu^{G}-\mu^{S}\|_{2}^{2})\;.

Since LGL\subseteq G, for any unit vector vv, we have

μGμS22εMS2+12ε\|\mu^{G}-\mu^{S}\|_{2}\leq\sqrt{2\varepsilon\|M_{S}\|_{2}}+12\sqrt{\varepsilon}.

We have that EMESMS+LML|E|M_{E}\leq|S|M_{S}+|L|M_{L} and so

By Cauchy Schwarz, we have that ME2μEμS22\|M_{E}\|_{2}\geq\|\mu^{E}-\mu^{S}\|_{2}^{2}, and so

By Cauchy-Schwarz and Lemma A.23, we have that

Since SμS=GμG+EμELμL|S|\mu^{S}=|G|\mu^{G}+|E|\mu^{E}-|L|\mu^{L} and S=G+EL|S|=|G|+|E|-|L|, we get

Since for x,y>0x,y>0, x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}, we have

Since GSεS||G|-|S||\leq\varepsilon|S| and E2εS,L9εS|E|\leq 2\varepsilon|S|,|L|\leq 9\varepsilon|S|, we have

Moving the μGμS2\|\mu^{G}-\mu^{S}\|_{2} terms to the LHS, using 6ε1/26\sqrt{\varepsilon}\leq 1/2, gives

Since λ=MS2\lambda^{\ast}=\|M_{S}\|_{2}, the correctness if we return the empirical mean is immediate.

If λ9\lambda^{\ast}\leq 9, we have that μGμS2=O(ε)\|\mu^{G}-\mu^{S}\|_{2}=O(\sqrt{\varepsilon}).

From now on, we assume λ>9\lambda^{\ast}>9. In this case we have μGμS22O(ελ)\|\mu^{G}-\mu^{S}\|_{2}^{2}\leq O(\varepsilon\lambda^{\ast}). Using Lemma A.22, we have

for sufficiently small ε\varepsilon. Thus, we have that

Now we can show that in expectation, we throw out many more corrupted points from EE than from GLG\setminus L:

Let a=maxxSvxμSa=\max_{x\in S}|v^{\ast}\cdot x-\mu^{S}|. Firstly, we look at the expected number of samples we reject:

Next, we look at the expected number of false positive samples we reject, i.e., those in LLL^{\prime}\setminus L.

If λ9\lambda^{\ast}\leq 9, then we return the mean in Step 5, and by Corollary A.25, μSμP2O(ε)\|\mu^{S}-\mu^{P}\|_{2}\leq O(\sqrt{\varepsilon}).

Our input is a set SS of N=Θ((d/ε)logd)N=\Theta((d/\varepsilon)\log d) ε\varepsilon-corrupted samples so that with probability 9/109/10, SS is a 2ε2\varepsilon-corrupted set of ε\varepsilon-good samples for PP by Lemmas A.18 and A.20. We have a set S=GELS=G\cup E^{\prime}\setminus L, where GG^{\prime} is an ε\varepsilon-good set, E2ε|E|\leq 2\varepsilon, and Lε|L|\leq\varepsilon. Then, we iteratively apply FilterUnder2ndMoment until it outputs an approximation to the mean. Since each iteration removes a sample, this must happen within NN iterations. The algorithm takes at most poly(N,d)=poly(d,1/ε)\operatorname{poly}(N,d)=\operatorname{poly}(d,1/\varepsilon) time.

As long as we can show that the conditions of Proposition A.21 hold in each iteration, it ensures that μSμP2O(ε)\|\mu^{S}-\mu^{P}\|_{2}\leq O(\sqrt{\varepsilon}). However, the condition that L9εS|L|\leq 9\varepsilon|S| need not hold in general. Although in expectation we reject many more samples in EE than GG, it is possible that we are unlucky and reject many samples in GG, which could make LL large in the next iteration. Thus, we need a bound on the probability that we ever have L>9ε|L|>9\varepsilon.

By a union bound, the probability that the uncorrupted samples satisfy Lemma A.18 and Proposition A.21 applies to every iteration is at least 9/101/62/39/10-1/6\geq 2/3. Thus, with at least 2/32/3 probability, the algorithm outputs a vector μ^{\widehat{\mu}} with μ^μP2O(ε)\|{\widehat{\mu}}-\mu^{P}\|_{2}\leq O(\sqrt{\varepsilon}). ∎

A.3 Robust Covariance Estimation

In this subsection, we give a near sample-optimal efficient robust estimator for the covariance of a zero-mean Gaussian density, thus proving Theorem 3.3. Our algorithm is essentially identical to the filtering algorithm given in Section 8.2 of [DKK+16]. As in Section A.1 the only difference is a weaker definition of the “good set of samples” (Definition A.27) and a concentration argument (Lemma A.28) showing that a random set of uncorrupted samples of the appropriate size is good with high probability. Given these, the analysis of this subsection follows straightforwardly from the analysis in Section 8.2 of [DKK+16] by plugging in the modified parameters.

The algorithm Filter-Gaussian-Unknown-Covariance to robustly estimate the covariance of a mean Gaussian in [DKK+16] is as follows:

Firstly, we state the new, weaker, definition of a good set:

For all xSx\in S, xTΣ1x<d+O(dlog(d/ε))x^{T}\Sigma^{-1}x<d+O(\sqrt{d}\log(d/\varepsilon)).

We have that Σ1/2\mboxCov(S)Σ1/2IF=O(ε)\|\Sigma^{-1/2}\mbox{Cov}(S)\Sigma^{-1/2}-I\|_{F}=O(\varepsilon).

For all even degree-22 polynomials pp, we have that Var(p(S))=Var(p(G))(1+O(ε))\operatorname*{Var}(p(S))=\operatorname*{Var}(p(G))(1+O(\varepsilon)).

Let NN be a sufficiently large constant multiple of d2log5(d/ε)/ε2d^{2}\log^{5}(d/\varepsilon)/\varepsilon^{2}. Then a set SS of NN independent samples from GG is ε\varepsilon-good with respect to GG with high probability.

First, note that it suffices to prove this when G=N(0,I)G=N(0,I).

Condition 1 follows by standard concentration bounds on x22\|x\|_{2}^{2}.

Condition 2 follows by estimating the entry-wise error between \mboxCov(S)\mbox{Cov}(S) and II.

Condition 4 is substantially the most difficult of these conditions to prove. Naively, we would want to find a cover of all possible pp and all possible TT, and bound the probability that the desired condition fails. Unfortunately, the best a priori bound on Pr(p(G)>T)\Pr(|p(G)|>T) are on the order of exp(T)\exp(-T). As our cover would need to be of size 2d22^{d^{2}} or so, to make this work with T=dT=d, we would require on the order of d3d^{3} samples in order to make this argument work.

However, we will note that this argument is sufficient to cover the case of T<10log(1/ε)log2(d/ε)T<10\log(1/\varepsilon)\log^{2}(d/\varepsilon).

Let Pk\mathcal{P}_{k} be the set of even, mean-, degree-22 polynomials, so that the associated matrix AA satisfies:

Note that for pPkp\in\mathcal{P}_{k} that p(x)x2/k+k|p(x)|\leq|x|^{2}/\sqrt{k}+\sqrt{k}.

Importantly, any polynomial can be written in terms of these sets.

Let AA be the associated matrix to pp. Note that AF=Varp=1\|A\|_{F}=\operatorname*{Var}{p}=1. Let AkA_{k} be the matrix corresponding to the top kk eigenvalues of AA. We now let p1p_{1} be the polynomial associated to A1/2A_{1}/2, p2p_{2} be associated to (A2A1)/2(A_{2}-A_{1})/2, p4p_{4} be associated to (A4A2)/2(A_{4}-A_{2})/2, and so on. It is clear that p=2(p1+p2++p2t+pd)p=2(p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d}). It is also clear that the matrix associated to pkp_{k} has rank at most kk. If the matrix associated to pkp_{k} had an eigenvalue more than 1/k1/\sqrt{k}, it would need to be the case that the k/2ndk/2^{nd} largest eigenvalue of AA had size at least 2/k2/\sqrt{k}. This is impossible since the sum of the squares of the eigenvalues of AA is at most 11.

We will also need covers of each of these sets Pk\mathcal{P}_{k}.

For each kk, there exists a set CkPk\mathcal{C}_{k}\subset\mathcal{P}_{k} so that

For each pPkp\in\mathcal{P}_{k} there exists a qCkq\in\mathcal{C}_{k} so that p(G)q(G)2(ε/d)2\|p(G)-q(G)\|_{2}\leq(\varepsilon/d)^{2}.

Ck=2O(dklog(d/ε)).|\mathcal{C}_{k}|=2^{O(dk\log(d/\varepsilon))}.

We note that any such pp is associated to a matrix AA of the form A=i=1kλiviviTA=\sum_{i=1}^{k}\lambda_{i}v_{i}v_{i}^{T}, for λi[0,1/k]\lambda_{i}\in[0,1/\sqrt{k}] and viv_{i} orthonormal. It suffices to let qq correspond to the matrix A=i=1kμiwiwiTA^{\prime}=\sum_{i=1}^{k}\mu_{i}w_{i}w_{i}^{T} for with λiμi<(ε/d)3|\lambda_{i}-\mu_{i}|<(\varepsilon/d)^{3} and viwi<(ε/d)3|v_{i}-w_{i}|<(\varepsilon/d)^{3} for all ii. It is easy to let μi\mu_{i} and wiw_{i} range over covers of the interval and the sphere with appropriate errors. This gives a set of possible qq’s of size 2O(dklog(d/ε))2^{O(dk\log(d/\varepsilon))} as desired. Unfortunately, some of these qq will not be in Pk\mathcal{P}_{k} as they will have eigenvalues that are too large. However, this is easily fixed by replacing each such qq by the closest element of Pk\mathcal{P}_{k}. This completes our proof. ∎

We next will show that these covers are sufficient to express any polynomial.

Combining the above two lemmas we have that any such pp can be written as

where qkq_{k} above is in Ck\mathcal{C}_{k} and pk(G)2<(ε/d)2\|p_{k}(G)\|_{2}<(\varepsilon/d)^{2}. Thus, p=p1+p2++p2t+pdp^{\prime}=p_{1}+p_{2}+\ldots+p_{2^{t}}+p_{d} has p(G)2(ε/d)\|p^{\prime}(G)\|_{2}\leq(\varepsilon/d). This completes the proof. ∎

The key observation now is that if p(x)T|p(x)|\geq T for x2d/ε\|x\|_{2}\leq\sqrt{d/\varepsilon}, then writing p=q1+q2+q4++qd+pp=q_{1}+q_{2}+q_{4}+\ldots+q_{d}+p^{\prime} as above, it must be the case that qk(x)>(T1)/(2log(d))|q_{k}(x)|>(T-1)/(2\log(d)) for some kk. Therefore, to prove our main result, it suffices to show that, with high probability over the choice of SS, for any T10log(1/ε)log2(d/ε)T\geq 10\log(1/\varepsilon)\log^{2}(d/\varepsilon) and any qCkq\in\mathcal{C}_{k} for some kk, that PrxuS(q(x)>T/(2log(d)))<ε/(2T2log2(T)log(d))\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log(d)). Equivalently, it suffices to show that for T10log(1/ε)log(d/ε)T\geq 10\log(1/\varepsilon)\log(d/\varepsilon) it holds PrxuS(q(x)>T/(2log(d)))<ε/(2T2log2(T)log2(d))\Pr_{x\in_{u}S}(|q(x)|>T/(2\log(d)))<\varepsilon/(2T^{2}\log^{2}(T)\log^{2}(d)). Note that this holds automatically for T>(d/ε)T>(d/\varepsilon), as p(x)p(x) cannot possibly be that large for x2d/ε\|x\|_{2}\leq\sqrt{d/\varepsilon}. Furthermore, note that losing a constant factor in the probability, it suffices to show this only for TT a power of 22.

Therefore, it suffices to show for every kdk\leq d, every qCkq\in\mathcal{C}_{k} and every d/kεTlog(1/ε)log(d/ε)d/\sqrt{k\varepsilon}\gg T\gg\log(1/\varepsilon)\log(d/\varepsilon) that with probability at least 12Ω(dklog(d/ε))1-2^{-\Omega(dk\log(d/\varepsilon))} over the choice of SS we have that PrxuS(q(x)>T)ε/(T2log4(d/ε))\Pr_{x\in_{u}S}(|q(x)|>T)\ll\varepsilon/(T^{2}\log^{4}(d/\varepsilon)). However, by the Hanson-Wright inequality, we have that

Therefore, by Chernoff bounds, the probability that more than a ε/(T2log4(d/ε))\varepsilon/(T^{2}\log^{4}(d/\varepsilon))-fraction of the elements of SS satisfy this property is at most

Appendix B Omitted Details from Section 5

Here we formally describe the distributions we used in our experiments. In all settings, our goal was to find noise distributions so that noise points were not “obvious” outliers, in the sense that there is no obvious pointwise pruning process which could throw away the noise points, which still gave the algorithms we tested the most difficulty. We again remark that while other algorithms had varying performances depending on the noise distribution, it seemed that the performance of ours was more or less unaffected by it.

Our uncorrupted points were generated by N(μ,I)\operatorname{\mathcal{N}}(\mu,I), where μ\mu is the all-ones vector. Our noise distribution is given as

where Π1\Pi_{1} is the product distribution over the hypercube where every coordinate is or 11 with probability 1/21/2, and Π2\Pi_{2} is a product distribution where the first coordinate is ether or 1212 with equal probability, the second coordinate is 2-2 or with equal probability, and all remaining coordinates are zero.

For the isotropic synthetic covariance experiment, our uncorrupted points were generated by N(0,I)\operatorname{\mathcal{N}}(0,I), and the noise points were all zeros. For the skewed synthetic covariance experiment, our uncorrupted points were generated by N(0,I+100e1e1T)\operatorname{\mathcal{N}}(0,I+100e_{1}e_{1}^{T}), where e1e_{1} is the first unit vector, and our noise points were generated as follows: we took a fixed random rotation of points of the form YiΠY_{i}\sim\Pi, where Π\Pi is a product distribution whose first d/2d/2 coordinates are each uniformly selected from {0.5,0,0.5}\{-0.5,0,0.5\}, and whose next d/21d/2-1 coordinates are each 0.8×Ai0.8\times A_{i}, where for each coordinate ii, AiA_{i} is an independent random integer between 2-2 and 22, and whose last coordinate is a uniformly random integer between $$.

We took the 20 dimensional data from [NJB+08], which was diagonalized, and randomly rotated it. This was to simulate the higher dimensional case, since the singular vectors that [NJB+08] obtained did not seem to be sparse or analytically sparse. Our noise was distributed as Π\Pi, where Π\Pi is a product distribution whose first d/2d/2 coordinates are each uniformly random integers between and 22 and whose last d/2d/2 coordinates are each uniformly randomly either 22 or 33, all scaled by a factor of 1/241/24.

B.2 Comparison with other robust PCA methods on semi-synthetic data

In addition to comparing our results with simple pruning techniques, as we did in Figure 3 in the main text, we also compared our algorithm with implementations of other robust PCA techniques from the literature with accessible implementations. In particular, we compared our technique with RANSAC-based techniques, LRVCov, two SDPs ([CLMW11, XCS10]) for variants of robust PCA, and an algorithm proposed by [CLMW11] to speed up their SDP based on alternating descent. For the SDPs, since black box methods were too slow to run on the full data set (as [CLMW11] mentions, black-box solvers for the SDPs are impractical above perhaps 100 data points), we subsample the data, and run the SDP on the subsampled data. For each of these methods, we ran the algorithm on the true data points plus noise, where the noise was generated as described above. We then take the estimate of the covariance it outputs, and project the data points onto the top two singular values of this matrix, and plot the results in Figure 4.

Similar results occurred for most noise patterns we tried. We found that only our algorithm and LRVCov were able to reasonably reconstruct Europe, in the presence of this noise. It is hard to judge qualitatively which of the two maps generated is preferable, but it seems that ours stretches the picture somewhat less than LRVCov.