Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers
Jacob Steinhardt, Moses Charikar, Gregory Valiant
Introduction
What are the fundamental properties that allow one to robustly learn from a dataset, even if some fraction of that dataset consists of arbitrarily corrupted data? While much work has been done in the setting of noisy data, or for restricted families of outliers, it is only recently that provable algorithms for learning in the presence of a large fraction of arbitrary (and potentially adversarial) data have been formulated in high-dimensional settings (klivans2009learning; xu2010principal; diakonikolas2016robust; lai2016agnostic; steinhardt2016avoiding; charikar2017learning). In this work, we formulate a conceptually simple criterion that a dataset can satisfy–resilience–which guarantees that properties such as the mean of that dataset can be robustly estimated even if a large fraction of additional arbitrary data is inserted.
The above game models mean estimation in the presence of arbitrary outliers; one can easily consider other problems as well (e.g. regression) but we will focus on mean estimation in this paper.
The resilience condition is essentially that the mean of every large subset of must be close to the mean of all of . More formally, for a norm , our criterion is as follows:
In the definition above, need not equal the mean of ; this distinction is useful in statistical settings where the sample mean of a finite set of points differs slightly from the true mean. However, \eqrefeq:resilience-intro implies that differs from the mean of by at most .
In the remainder of this section, we will outline our main results, starting with information-theoretic results and then moving on to algorithmic results. In Section 1.1, we show that resilience is indeed information-theoretically sufficient for robust mean estimation. In Section 1.2, we then provide finite-sample bounds showing that resilience holds with high probability for i.i.d. samples from a distribution.
In Section 1.3, we turn our attention to algorithmic bounds. We identity a property–bounded variance in the dual norm–under which efficient algorithms exist. We then show that, as long as the norm is strongly convex, every resilient set has a large subset with bounded variance, thus enabling efficient algorithms. This connection between resilience and bounded variance is the most technically non-trivial component of our results, and may be of independent interest.
Both our information-theoretic and algorithmic bounds yield new results in concrete settings, which we discuss in the corresponding subsections. In Section 1.4, we also discuss an extension of resilience to low-rank matrix approximation, which enables us to derive new bounds in that setting as well. In Section 1.5 we outline the rest of the paper and point to technical highlights, and in Section 1.6 we discuss related work.
First, we show that resilience is indeed information-theoretically sufficient for robust recovery of the mean . In what follows, we use to denote the smallest such that is -resilient.
More generally, if (even if ), it is possible to output a (random) such that with probability at least .
The first part says that robustness to an fraction of outliers depends on resilience to a fraction of deletions. Thus, Bob has a good strategy as long as is small.
The fact that estimation is possible even when was first established by steinhardt2016avoiding in a crowdsourcing setting, and later by charikar2017learning in a number of settings including mean estimation. Apart from being interesting due to its unexpectedness, estimation in this regime has immediate implications for robust estimation of mixtures of distributions (by considering each mixture component in turn as the “good” set ) or of planted substructures in random graphs. We refer the reader to charikar2017learning for a full elaboration of this point.
The proof of Proposition 1.2, given in detail in Section 2, is a pigeonhole argument. For the case, we simply search for any large resilient set and output its mean; then and must have large overlap, and by resilience their means must both be close to the mean of their intersection, and hence to each other.
2 Finite-Sample Concentration
While Proposition 1.2 provides a deterministic condition under which robust mean estimation is possible, we would also like a way of checking that resilience holds with high probability given samples from a distribution . First, we provide an alternate characterization of resilience which says that a distribution is resilient if it has thin tails in every direction:
In other words, if we project onto any unit vector in the dual norm, the -tail of must have mean at most . Thus, for instance, a distribution with variance at most along every unit vector would have . Note that Lemma 1.3 requires to be the mean, rather than an arbitrary vector as before.
We next provide a meta-result establishing that resilience of a population distribution very generically transfers to a finite set of samples from that distribution. The number of samples necessary depends on two quantities and that will be defined in detail later; for now we note that they are ways of measuring the effective dimension of the space.
Then, given samples , with probability there is a subset of of the such that is -resilient with \sigma^{\prime}=\mathcal{O}\Big{(}\sigma\cdot\Big{(}1+\sqrt{\frac{\log(M/\delta)}{\epsilon^{2}n}}+\frac{(B/\sigma)\log(M/\delta)}{n}\Big{)}\Big{)}.
Note that Proposition 1.4 only guarantees resilience on a -element subset of the , rather than all of . From the perspective of robust estimation, this is sufficient, as we can simply regard the remaining points as part of the “bad” points controlled by Alice. This weaker requirement seems to be actually necessary to achieve Proposition 1.4, and was also exploited in charikar2017learning to yield improved bounds for a graph partitioning problem. There has been a great deal of recent interest in showing how to “prune” samples to achieve faster rates in random matrix settings (guedon2014community; le2015concentration; rebrova2015coverings; rebrova2016norms), and we think the general investigation of such pruning results is likely to be fruitful.
We remark that the sample complexity in Proposition 1.4 is suboptimal in many cases, requiring roughly samples when samples would suffice. At the end of the next subsection we discuss a tighter but more specialized bound based on spectral graph sparsification.
The term in the sample complexity is likely loose, and we believe the true dependence on is at most . This looseness comes from Proposition 1.4, which uses a naïve covering argument and could potentially be improved with more sophisticated tools. Nevertheless, it is interesting that resilience holds long before the empirical th moments concentrate, which would require samples.
Stochastic block models. Finally, we consider the semi-random stochastic block model studied in charikar2017learning (described in detail in Section LABEL:sec:sbm-app). For a graph on vertices, this model posits a subset of “good” vertices, which are connected to each other with probability and to the other (“bad”) vertices with probability (where ); the connections among the bad vertices can be arbitrary. The goal is to recover the set .
In particular, we get non-trivial guarantees as long as . charikar2017learning derive a weaker (but computationally efficient) bound when , and remark on the similarity to the famous Kesten-Stigum threshold , which is the conjectured threshold for computationally efficient recovery in the classical stochastic block model (see decelle2011asymptotic for the conjecture, and mossel2013proof; massoulie2014community for a proof in the two-block case). Our information-theoretic upper bound matches the Kesten-Stigum threshold up to a factor. We conjecture that this upper bound is tight; some evidence for this is given in steinhardt2017clique, which provides a nearly matching information-theoretic lower bound when , .
3 Strong Convexity, Second Moments, and Efficient Algorithms
Most existing algorithmic results on robust mean estimation rely on analyzing the empirical covariance of the data in some way (see, e.g., lai2016agnostic; diakonikolas2016robust; balakrishnan2017sparse). In this section we establish connections between bounded covariance and resilience, and show that in a very general sense, bounded covariance is indeed sufficient to enable robust mean estimation.
Given a norm , we say that a set of points has variance bounded by in that norm if (recall denotes the dual norm). Since this implies a tail bound along every direction, it is easy to see (c.f. Lemma 1.3) that a set with variance bounded by is -resilient around its mean for all . Therefore, bounded variance implies resilience.
If is -resilient in a -strongly convex norm , then contains a set of size at least with bounded variance: for all .
Using Lemma 1.3, we can show that -resilience is equivalent to having bounded st moments in every direction; Theorem 1.5 can thus be interpreted as saying that any set with bounded st moments can be pruned to have bounded nd moments.
Given points with bounded variance, we establish algorithmic results assuming that one can solve the “generalized eigenvalue” problem up to some multiplicative accuracy . Specifically, we make the following assumption:
There is a convex set of PSD matrices such that
for every PSD matrix . Moreover, it is possible to optimize linear functions over in polynomial time.
Our main algorithmic result is the following:
Suppose that contains a subset of size whose variance around its mean is bounded by in the norm . Also suppose that Assumption 1.6 holds for the dual norm . Then, if , there is a polynomial-time algorithm whose output satisfies \|\hat{\mu}-\mu\|=\mathcal{O}\big{(}\sigma_{0}\sqrt{\kappa\epsilon}\big{)}.
If, in addition, is -strongly convex, then even if only has size there is a polynomial-time algorithm such that \|\hat{\mu}-\mu\|=\mathcal{O}\big{(}\frac{\sqrt{\kappa}\sigma_{0}}{\sqrt{\gamma}\alpha}\big{)} with probability .
Applications.
4 Low-Rank Recovery
As before, we start by formulating an appropriate resilience criterion:
Rank-resilience says that the variation in should be sufficiently spread out: there should not be a direction of variation that is concentrated in only a -fraction of the points. Under rank-resilience, we can perform efficient rank- recovery even in the presence of a -fraction of arbitrary data:
Let . If a set of points contains a set of size that is -rank-resilient, then it is possible to efficiently recover a matrix of rank at most such that .
The power of Theorem 1.9 comes from the fact that the error depends on rather than e.g. , which is what previous results yielded. This distinction is crucial in practice, since most data have a few (but more than one) large singular values followed by many small singular values. Note that in contrast to Theorem 1.7, Theorem 1.9 only holds when is relatively large: at least in size.
5 Summary, Technical Highlights, and Roadmap
Beyond the results themselves, the following technical aspects of our work may be particularly interesting: The proof of Proposition 1.2 (establishing that resilience is indeed sufficient for robust estimation), while simple, is a nice pigeonhole argument that we found to be conceptually illuminating.
In addition, the proof of Theorem 1.5, on pruning resilient sets to obtain sets with bounded variance, exploits strong convexity in a non-trivial way in conjunction with minimax duality; we think it reveals a fairly non-obvious geometric structure in resilient sets, and also shows how the ability to prune points can yield sets with meaningfully stronger properties.
6 Related Work
A number of authors have recently studied robust estimation and learning in high-dimensional settings: lai2016agnostic study mean and covariance estimation, while diakonikolas2016robust focus on estimating Gaussian and binary product distributions, as well as mixtures thereof; note that this implies mean/covariance estimation of the corresponding distributions. charikar2017learning recently showed that robust estimation is possible even when the fraction of “good” data is less than . We refer to these papers for an overview of the broader robust estimation literature; since those papers, a number of additional results have also been published: diakonikolas2017practical provide a case study of various robust estimation methods in a genomic setting, balakrishnan2017sparse study sparse mean estimation, and others have studied problems including regression, Bayes nets, planted clique, and several other settings (diakonikolas2016bayes; diakonikolas2016statistical; diakonikolas2017robustly; diakonikolas2017learning; kane2017robust; meister2017data).
Low rank estimation was studied by lai2016agnostic, but their bounds depend on the maximum eigenvalue of the covariance matrix, while our bound provides robust recovery guarantees in terms of lower singular values of . (Some work, such as diakonikolas2016robust, shows how to estimate all of in e.g. Frobenius norm, but appears to require the samples to be drawn from a Gaussian.)
Resilience and Robustness: Information-Theoretic Sufficiency
Recall the definition of resilience: is -resilient if whenever and . Here we establish Proposition 1.2 showing that, if we ignore computational efficiency, resilience leads directly to an algorithm for robust mean estimation. {prooff}Proposition 1.2 We prove Proposition 1.2 via a constructive (albeit exponential-time) algorithm. To prove the first part, suppose that the true set is -resilient around , and let be any set of size that is -resilient (around some potentially different vector ). We claim that is sufficiently close to .
Indeed, let , which by the pigeonhole principle has size at least . Therefore, by the definition of resilience,
But by the same argument, as well. By the triangle inequality, , which completes the first part of the proposition.
For the second part, we need the following simple lemma relating -resilience to -resilience:
For any , a distribution/set is -resilient around its mean if and only if it is -resilient. More generally, even if is not the mean then the distribution/set is -resilient. In other words, if for all sets of size at least , then for all sets of size at least .
Given Lemma 2.1, the second part of Proposition 1.2 is similar to the first part, but requires us to consider multiple resilient sets rather than a single . Suppose is -resilient around –and thus also -resilient by Lemma 2.1–and let be a maximal collection of subsets of such that:
for all .
is -resilient around some point .
for all .
Clearly . We claim that at least one of the is close to . By maximality of the collection , it must be that cannot be added to the collection. First suppose that . Then is -resilient (because any subset of points in is a subset of at least points in ). But this contradicts the maximality of , so we must have .
Now, this implies that , so by pigeonhole we must have for some . Letting as before, we find that and hence by resilience of and we have . If we output one of the at random, we are then within the desired distance of with probability .
Powering up Resilience: Finding a Core with Bounded Variance
In this section we prove Theorem 1.5, which says that for strongly convex norms, every resilient set contains a core with bounded variance. Recall that this is important for enabling algorithmic applications that depend on a bounded variance condition.
First recall the definition of resilience (Definition 1.1): a set is -resilient if for every set of size , we have . For , we observe that resilience in a norm is equivalent to having bounded first moments in the dual norm:
Conversely, if has 1st moments bounded by , it is -resilient.
The proof is routine and can be found in Section LABEL:sec:1st-moment-lp-proof. Supposing a set has bounded st moments, we will show that it has a large core with bounded second moments. This next result is not routine:
Proposition 3.2 Without loss of generality take and suppose that . We can pose the problem of finding a resilient core as an integer program:
Here the variable indicates whether the point lies in the core . By taking a continuous relaxation and applying a standard duality argument, we obtain the following:
Suppose that for all and all vectors satisfying , we have
Then the value of \eqrefeq:minimax-0 is at most .
The proof is straightforward and deferred to Section LABEL:sec:minimax-proof. Now, to bound \eqrefeq:dual, let be i.i.d. random sign variables. We have
Here (i) is Khintchine’s inequality (haagerup1981best) and (ii) is the assumed first moment bound. It remains to bound \eqrefeq:q-norm. The key is the following inequality asserting that the dual norm is strongly smooth whenever is strongly convex (c.f. Lemma 17 of shalev07online):
If is -strongly convex, then is -strongly smooth: .