Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers

Jacob Steinhardt, Moses Charikar, Gregory Valiant

Introduction

What are the fundamental properties that allow one to robustly learn from a dataset, even if some fraction of that dataset consists of arbitrarily corrupted data? While much work has been done in the setting of noisy data, or for restricted families of outliers, it is only recently that provable algorithms for learning in the presence of a large fraction of arbitrary (and potentially adversarial) data have been formulated in high-dimensional settings (klivans2009learning; xu2010principal; diakonikolas2016robust; lai2016agnostic; steinhardt2016avoiding; charikar2017learning). In this work, we formulate a conceptually simple criterion that a dataset can satisfy–resilience–which guarantees that properties such as the mean of that dataset can be robustly estimated even if a large fraction of additional arbitrary data is inserted.

The above game models mean estimation in the presence of arbitrary outliers; one can easily consider other problems as well (e.g. regression) but we will focus on mean estimation in this paper.

The resilience condition is essentially that the mean of every large subset of $S$ must be close to the mean of all of $S$ . More formally, for a norm $\|\cdot\|$ , our criterion is as follows:

In the definition above, $\mu$ need not equal the mean of $S$ ; this distinction is useful in statistical settings where the sample mean of a finite set of points differs slightly from the true mean. However, \eqrefeq:resilience-intro implies that $\mu$ differs from the mean of $S$ by at most $\sigma$ .

In the remainder of this section, we will outline our main results, starting with information-theoretic results and then moving on to algorithmic results. In Section 1.1, we show that resilience is indeed information-theoretically sufficient for robust mean estimation. In Section 1.2, we then provide finite-sample bounds showing that resilience holds with high probability for i.i.d. samples from a distribution.

In Section 1.3, we turn our attention to algorithmic bounds. We identity a property–bounded variance in the dual norm–under which efficient algorithms exist. We then show that, as long as the norm is strongly convex, every resilient set has a large subset with bounded variance, thus enabling efficient algorithms. This connection between resilience and bounded variance is the most technically non-trivial component of our results, and may be of independent interest.

Both our information-theoretic and algorithmic bounds yield new results in concrete settings, which we discuss in the corresponding subsections. In Section 1.4, we also discuss an extension of resilience to low-rank matrix approximation, which enables us to derive new bounds in that setting as well. In Section 1.5 we outline the rest of the paper and point to technical highlights, and in Section 1.6 we discuss related work.

First, we show that resilience is indeed information-theoretically sufficient for robust recovery of the mean $\mu$ . In what follows, we use $\sigma_{*}(\epsilon)$ to denote the smallest $\sigma$ such that $S$ is $(\sigma,\epsilon)$ -resilient.

More generally, if $|S|\geq\alpha n$ (even if $\alpha<\frac{1}{2}$ ), it is possible to output a (random) $\hat{\mu}$ such that $\|\hat{\mu}-\mu\|\leq\frac{16}{\alpha}\sigma_{*}(\frac{\alpha}{4})$ with probability at least $\frac{\alpha}{2}$ .

The first part says that robustness to an $\epsilon$ fraction of outliers depends on resilience to a $\frac{\epsilon}{1-\epsilon}$ fraction of deletions. Thus, Bob has a good strategy as long as $\sigma_{*}(\frac{\epsilon}{1-\epsilon})$ is small.

The fact that estimation is possible even when $\alpha<\frac{1}{2}$ was first established by steinhardt2016avoiding in a crowdsourcing setting, and later by charikar2017learning in a number of settings including mean estimation. Apart from being interesting due to its unexpectedness, estimation in this regime has immediate implications for robust estimation of mixtures of distributions (by considering each mixture component in turn as the “good” set $S$ ) or of planted substructures in random graphs. We refer the reader to charikar2017learning for a full elaboration of this point.

The proof of Proposition 1.2, given in detail in Section 2, is a pigeonhole argument. For the $\epsilon<\frac{1}{2}$ case, we simply search for any large resilient set $S^{\prime}$ and output its mean; then $S$ and $S^{\prime}$ must have large overlap, and by resilience their means must both be close to the mean of their intersection, and hence to each other.

2 Finite-Sample Concentration

While Proposition 1.2 provides a deterministic condition under which robust mean estimation is possible, we would also like a way of checking that resilience holds with high probability given samples $x_{1},\ldots,x_{n}$ from a distribution $p$ . First, we provide an alternate characterization of resilience which says that a distribution is resilient if it has thin tails in every direction:

In other words, if we project onto any unit vector $v$ in the dual norm, the $\epsilon$ -tail of $x-\mu$ must have mean at most $\frac{1-\epsilon}{\epsilon}\sigma$ . Thus, for instance, a distribution with variance at most $\sigma_{0}^{2}$ along every unit vector would have $\sigma=\mathcal{O}(\sigma_{0}\sqrt{\epsilon})$ . Note that Lemma 1.3 requires $\mu$ to be the mean, rather than an arbitrary vector as before.

We next provide a meta-result establishing that resilience of a population distribution $p$ very generically transfers to a finite set of samples from that distribution. The number of samples necessary depends on two quantities $B$ and $\log M$ that will be defined in detail later; for now we note that they are ways of measuring the effective dimension of the space.

Then, given $n$ samples $x_{1},\ldots,x_{n}\sim p$ , with probability $1-\delta-\exp(-\epsilon n/6)$ there is a subset $T$ of $(1-\epsilon)n$ of the $x_{i}$ such that $T$ is $(\sigma^{\prime},\epsilon)$ -resilient with $\sigma^{\prime}=\mathcal{O}\Big{(}\sigma\cdot\Big{(}1+\sqrt{\frac{\log(M/\delta)}{\epsilon^{2}n}}+\frac{(B/\sigma)\log(M/\delta)}{n}\Big{)}\Big{)}$ .

Note that Proposition 1.4 only guarantees resilience on a $(1-\epsilon)n$ -element subset of the $x_{i}$ , rather than all of $x_{1},\ldots,x_{n}$ . From the perspective of robust estimation, this is sufficient, as we can simply regard the remaining $\epsilon n$ points as part of the “bad” points controlled by Alice. This weaker requirement seems to be actually necessary to achieve Proposition 1.4, and was also exploited in charikar2017learning to yield improved bounds for a graph partitioning problem. There has been a great deal of recent interest in showing how to “prune” samples to achieve faster rates in random matrix settings (guedon2014community; le2015concentration; rebrova2015coverings; rebrova2016norms), and we think the general investigation of such pruning results is likely to be fruitful.

We remark that the sample complexity in Proposition 1.4 is suboptimal in many cases, requiring roughly $d^{1.5}$ samples when $d$ samples would suffice. At the end of the next subsection we discuss a tighter but more specialized bound based on spectral graph sparsification.

The $d^{1.5}/\epsilon$ term in the sample complexity is likely loose, and we believe the true dependence on $d$ is at most $d\log(d)$ . This looseness comes from Proposition 1.4, which uses a naïve covering argument and could potentially be improved with more sophisticated tools. Nevertheless, it is interesting that resilience holds long before the empirical $k$ th moments concentrate, which would require $d^{k/2}$ samples.

Stochastic block models. Finally, we consider the semi-random stochastic block model studied in charikar2017learning (described in detail in Section LABEL:sec:sbm-app). For a graph on $n$ vertices, this model posits a subset $S$ of $\alpha n$ “good” vertices, which are connected to each other with probability $\frac{a}{n}$ and to the other (“bad”) vertices with probability $\frac{b}{n}$ (where $b<a$ ); the connections among the bad vertices can be arbitrary. The goal is to recover the set $S$ .

In particular, we get non-trivial guarantees as long as $\frac{(a-b)^{2}}{a}\gg\frac{\log(2/\alpha)}{\alpha^{2}}$ . charikar2017learning derive a weaker (but computationally efficient) bound when $\frac{(a-b)^{2}}{a}\gg\frac{\log(2/\alpha)}{\alpha^{3}}$ , and remark on the similarity to the famous Kesten-Stigum threshold $\frac{(a-b)^{2}}{a}\gg\frac{1}{\alpha^{2}}$ , which is the conjectured threshold for computationally efficient recovery in the classical stochastic block model (see decelle2011asymptotic for the conjecture, and mossel2013proof; massoulie2014community for a proof in the two-block case). Our information-theoretic upper bound matches the Kesten-Stigum threshold up to a $\log(2/\alpha)$ factor. We conjecture that this upper bound is tight; some evidence for this is given in steinhardt2017clique, which provides a nearly matching information-theoretic lower bound when $a=1$ , $b=\frac{1}{2}$ .

3 Strong Convexity, Second Moments, and Efficient Algorithms

Most existing algorithmic results on robust mean estimation rely on analyzing the empirical covariance of the data in some way (see, e.g., lai2016agnostic; diakonikolas2016robust; balakrishnan2017sparse). In this section we establish connections between bounded covariance and resilience, and show that in a very general sense, bounded covariance is indeed sufficient to enable robust mean estimation.

Given a norm $\|\cdot\|$ , we say that a set of points $x_{1},\ldots,x_{n}$ has variance bounded by $\sigma_{0}^{2}$ in that norm if $\frac{1}{n}\sum_{i=1}^{n}\langle x_{i}-\mu,v\rangle^{2}\leq\sigma_{0}^{2}\|v\|_{*}^{2}$ (recall $\|\cdot\|_{*}$ denotes the dual norm). Since this implies a tail bound along every direction, it is easy to see (c.f. Lemma 1.3) that a set with variance bounded by $\sigma_{0}^{2}$ is $(\mathcal{O}(\sigma_{0}\sqrt{\epsilon}),\epsilon)$ -resilient around its mean for all $\epsilon<\frac{1}{2}$ . Therefore, bounded variance implies resilience.

If $S$ is $(\sigma,\frac{1}{2})$ -resilient in a $\gamma$ -strongly convex norm $\|\cdot\|$ , then $S$ contains a set $S_{0}$ of size at least $\frac{1}{2}|S|$ with bounded variance: $\frac{1}{|S_{0}|}\sum_{i\in S_{0}}\langle x_{i}-\mu,v\rangle^{2}\leq\frac{288\sigma^{2}}{\gamma}\|v\|_{*}^{2}$ for all $v$ .

Using Lemma 1.3, we can show that $(\sigma,\frac{1}{2})$ -resilience is equivalent to having bounded $1$ st moments in every direction; Theorem 1.5 can thus be interpreted as saying that any set with bounded $1$ st moments can be pruned to have bounded $2$ nd moments.

Given points with bounded variance, we establish algorithmic results assuming that one can solve the “generalized eigenvalue” problem $\max_{\|v\|_{*}\leq 1}v^{\top}Av$ up to some multiplicative accuracy $\kappa$ . Specifically, we make the following assumption:

There is a convex set $\mathcal{P}$ of PSD matrices such that

for every PSD matrix $A$ . Moreover, it is possible to optimize linear functions over $\mathcal{P}$ in polynomial time.

Our main algorithmic result is the following:

Suppose that $x_{1},\ldots,x_{n}$ contains a subset $S$ of size $(1-\epsilon)n$ whose variance around its mean $\mu$ is bounded by $\sigma_{0}^{2}$ in the norm $\|\cdot\|$ . Also suppose that Assumption 1.6 holds for the dual norm $\|\cdot\|_{*}$ . Then, if $\epsilon\leq\frac{1}{4}$ , there is a polynomial-time algorithm whose output satisfies $\|\hat{\mu}-\mu\|=\mathcal{O}\big{(}\sigma_{0}\sqrt{\kappa\epsilon}\big{)}$ .

If, in addition, $\|\cdot\|$ is $\gamma$ -strongly convex, then even if $S$ only has size $\alpha n$ there is a polynomial-time algorithm such that $\|\hat{\mu}-\mu\|=\mathcal{O}\big{(}\frac{\sqrt{\kappa}\sigma_{0}}{\sqrt{\gamma}\alpha}\big{)}$ with probability $\Omega(\alpha)$ .

Applications.

4 Low-Rank Recovery

As before, we start by formulating an appropriate resilience criterion:

Rank-resilience says that the variation in $X$ should be sufficiently spread out: there should not be a direction of variation that is concentrated in only a $\delta$ -fraction of the points. Under rank-resilience, we can perform efficient rank- $k$ recovery even in the presence of a $\delta$ -fraction of arbitrary data:

Let $\delta\leq\frac{1}{3}$ . If a set of $n$ points contains a set $S$ of size $(1-\delta)n$ that is $\delta$ -rank-resilient, then it is possible to efficiently recover a matrix $P$ of rank at most $15k$ such that $\|(I-P)X_{S}\|_{2}=\mathcal{O}(\sigma_{k+1}(X_{S}))$ .

The power of Theorem 1.9 comes from the fact that the error depends on $\sigma_{k+1}$ rather than e.g. $\sigma_{2}$ , which is what previous results yielded. This distinction is crucial in practice, since most data have a few (but more than one) large singular values followed by many small singular values. Note that in contrast to Theorem 1.7, Theorem 1.9 only holds when $S$ is relatively large: at least $(1-\delta)n\geq\frac{2}{3}n$ in size.

5 Summary, Technical Highlights, and Roadmap

Beyond the results themselves, the following technical aspects of our work may be particularly interesting: The proof of Proposition 1.2 (establishing that resilience is indeed sufficient for robust estimation), while simple, is a nice pigeonhole argument that we found to be conceptually illuminating.

In addition, the proof of Theorem 1.5, on pruning resilient sets to obtain sets with bounded variance, exploits strong convexity in a non-trivial way in conjunction with minimax duality; we think it reveals a fairly non-obvious geometric structure in resilient sets, and also shows how the ability to prune points can yield sets with meaningfully stronger properties.

6 Related Work

A number of authors have recently studied robust estimation and learning in high-dimensional settings: lai2016agnostic study mean and covariance estimation, while diakonikolas2016robust focus on estimating Gaussian and binary product distributions, as well as mixtures thereof; note that this implies mean/covariance estimation of the corresponding distributions. charikar2017learning recently showed that robust estimation is possible even when the fraction $\alpha$ of “good” data is less than $\frac{1}{2}$ . We refer to these papers for an overview of the broader robust estimation literature; since those papers, a number of additional results have also been published: diakonikolas2017practical provide a case study of various robust estimation methods in a genomic setting, balakrishnan2017sparse study sparse mean estimation, and others have studied problems including regression, Bayes nets, planted clique, and several other settings (diakonikolas2016bayes; diakonikolas2016statistical; diakonikolas2017robustly; diakonikolas2017learning; kane2017robust; meister2017data).

Low rank estimation was studied by lai2016agnostic, but their bounds depend on the maximum eigenvalue $\|\Sigma\|_{2}$ of the covariance matrix, while our bound provides robust recovery guarantees in terms of lower singular values of $\Sigma$ . (Some work, such as diakonikolas2016robust, shows how to estimate all of $\Sigma$ in e.g. Frobenius norm, but appears to require the samples to be drawn from a Gaussian.)

Resilience and Robustness: Information-Theoretic Sufficiency

Recall the definition of resilience: $S$ is $(\sigma,\epsilon)$ -resilient if $\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu)\|\leq\sigma$ whenever $T\subseteq S$ and $|T|\geq(1-\epsilon)|S|$ . Here we establish Proposition 1.2 showing that, if we ignore computational efficiency, resilience leads directly to an algorithm for robust mean estimation. {prooff}Proposition 1.2 We prove Proposition 1.2 via a constructive (albeit exponential-time) algorithm. To prove the first part, suppose that the true set $S$ is $(\sigma,\frac{\epsilon}{1-\epsilon})$ -resilient around $\mu$ , and let $S^{\prime}$ be any set of size $(1-\epsilon)n$ that is $(\sigma,\frac{\epsilon}{1-\epsilon})$ -resilient (around some potentially different vector $\mu^{\prime}$ ). We claim that $\mu^{\prime}$ is sufficiently close to $\mu$ .

Indeed, let $T=S\cap S^{\prime}$ , which by the pigeonhole principle has size at least $(1-2\epsilon)n=\frac{1-2\epsilon}{1-\epsilon}|S|=(1-\frac{\epsilon}{1-\epsilon})|S|$ . Therefore, by the definition of resilience,

But by the same argument, $\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu^{\prime})\|\leq\sigma$ as well. By the triangle inequality, $\|\mu-\mu^{\prime}\|\leq 2\sigma$ , which completes the first part of the proposition.

For the second part, we need the following simple lemma relating $\epsilon$ -resilience to $(1-\epsilon)$ -resilience:

For any $0<\epsilon<1$ , a distribution/set is $(\sigma,\epsilon)$ -resilient around its mean $\mu$ if and only if it is $(\frac{1-\epsilon}{\epsilon}\sigma,1-\epsilon)$ -resilient. More generally, even if $\mu$ is not the mean then the distribution/set is $(\frac{2-\epsilon}{\epsilon}\sigma,1-\epsilon)$ -resilient. In other words, if $\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu)\|\leq\sigma$ for all sets $T$ of size at least $(1-\epsilon)n$ , then $\|\frac{1}{|T^{\prime}|}\sum_{i\in T^{\prime}}(x_{i}-\mu)\|\leq\frac{2-\epsilon}{\epsilon}\sigma$ for all sets $T^{\prime}$ of size at least $\epsilon n$ .

Given Lemma 2.1, the second part of Proposition 1.2 is similar to the first part, but requires us to consider multiple resilient sets $S_{i}$ rather than a single $S^{\prime}$ . Suppose $S$ is $(\sigma,\frac{\alpha}{4})$ -resilient around $\mu$ –and thus also $(\frac{8}{\alpha}\sigma,1-\frac{\alpha}{4})$ -resilient by Lemma 2.1–and let $S_{1},\ldots,S_{m}$ be a maximal collection of subsets of $[n]$ such that:

$|S_{j}|\geq\frac{\alpha}{2}n$ for all $j$ .

$S_{j}$ is $(\frac{8}{\alpha}\sigma,1-\frac{\alpha}{2})$ -resilient around some point $\mu_{j}$ .

$S_{j}\cap S_{j^{\prime}}=\emptyset$ for all $j\neq j^{\prime}$ .

Clearly $m\leq\frac{2}{\alpha}$ . We claim that at least one of the $\mu_{j}$ is close to $\mu$ . By maximality of the collection $\{S_{j}\}_{j=1}^{m}$ , it must be that $S_{0}=S\backslash(S_{1}\cup\cdots\cup S_{m})$ cannot be added to the collection. First suppose that $|S_{0}|\geq\frac{\alpha}{2}n$ . Then $S_{0}$ is $(\frac{8}{\alpha}\sigma,1-\frac{\alpha}{2})$ -resilient (because any subset of $\frac{\alpha}{2}|S_{0}|$ points in $S_{0}$ is a subset of at least $\frac{\alpha}{4}|S|$ points in $S$ ). But this contradicts the maximality of $\{S_{j}\}_{j=1}^{m}$ , so we must have $|S_{0}|<\frac{\alpha}{2}n$ .

Now, this implies that $|S\cap(S_{1}\cup\cdots\cup S_{m})|\geq\frac{\alpha}{2}n$ , so by pigeonhole we must have $|S\cap S_{j}|\geq\frac{\alpha}{2}|S_{j}|$ for some $j$ . Letting $T=S\cap S_{j}$ as before, we find that $|T|\geq\frac{\alpha}{2}|S_{j}|\geq\frac{\alpha}{4}|S|$ and hence by resilience of $S_{j}$ and $S$ we have $\|\mu-\mu_{j}\|\leq 2\cdot(\frac{8}{\alpha}\sigma)=\frac{16}{\alpha}\sigma$ . If we output one of the $\mu_{j}$ at random, we are then within the desired distance of $\mu$ with probability $\frac{1}{m}\geq\frac{\alpha}{2}$ .

Powering up Resilience: Finding a Core with Bounded Variance

In this section we prove Theorem 1.5, which says that for strongly convex norms, every resilient set contains a core with bounded variance. Recall that this is important for enabling algorithmic applications that depend on a bounded variance condition.

First recall the definition of resilience (Definition 1.1): a set $S$ is $(\sigma,\epsilon)$ -resilient if for every set $T\subseteq S$ of size $(1-\epsilon)|S|$ , we have $\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu)\|\leq\sigma$ . For $\epsilon=\frac{1}{2}$ , we observe that resilience in a norm is equivalent to having bounded first moments in the dual norm:

Conversely, if $S$ has 1st moments bounded by $\sigma$ , it is $(2\sigma,\frac{1}{2})$ -resilient.

The proof is routine and can be found in Section LABEL:sec:1st-moment-lp-proof. Supposing a set has bounded $1$ st moments, we will show that it has a large core with bounded second moments. This next result is not routine:

Proposition 3.2 Without loss of generality take $\mu=0$ and suppose that $S=[n]$ . We can pose the problem of finding a resilient core as an integer program:

Here the variable $c_{i}$ indicates whether the point $i$ lies in the core $S_{0}$ . By taking a continuous relaxation and applying a standard duality argument, we obtain the following:

Suppose that for all $m$ and all vectors $v_{1},\ldots,v_{m}$ satisfying $\sum_{j=1}^{m}\|v_{j}\|_{*}^{2}\leq 1$ , we have

Then the value of \eqrefeq:minimax-0 is at most $8B^{2}$ .

The proof is straightforward and deferred to Section LABEL:sec:minimax-proof. Now, to bound \eqrefeq:dual, let $s_{1},\ldots,s_{m}\in\{-1,+1\}$ be i.i.d. random sign variables. We have

Here (i) is Khintchine’s inequality (haagerup1981best) and (ii) is the assumed first moment bound. It remains to bound \eqrefeq:q-norm. The key is the following inequality asserting that the dual norm $\|\cdot\|_{*}$ is strongly smooth whenever $\|\cdot\|$ is strongly convex (c.f. Lemma 17 of shalev07online):

If $\|\cdot\|$ is $\gamma$ -strongly convex, then $\|\cdot\|_{*}$ is $(1/\gamma)$ -strongly smooth: $\frac{1}{2}(\|v+w\|_{*}^{2}+\|v-w\|_{*}^{2})\leq\|v\|_{*}^{2}+(1/\gamma)\|w\|_{*}^{2}$ .