Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers

Jacob Steinhardt, Moses Charikar, Gregory Valiant

Introduction

What are the fundamental properties that allow one to robustly learn from a dataset, even if some fraction of that dataset consists of arbitrarily corrupted data? While much work has been done in the setting of noisy data, or for restricted families of outliers, it is only recently that provable algorithms for learning in the presence of a large fraction of arbitrary (and potentially adversarial) data have been formulated in high-dimensional settings (klivans2009learning; xu2010principal; diakonikolas2016robust; lai2016agnostic; steinhardt2016avoiding; charikar2017learning). In this work, we formulate a conceptually simple criterion that a dataset can satisfy–resilience–which guarantees that properties such as the mean of that dataset can be robustly estimated even if a large fraction of additional arbitrary data is inserted.

The above game models mean estimation in the presence of arbitrary outliers; one can easily consider other problems as well (e.g. regression) but we will focus on mean estimation in this paper.

The resilience condition is essentially that the mean of every large subset of SS must be close to the mean of all of SS. More formally, for a norm \|\cdot\|, our criterion is as follows:

In the definition above, μ\mu need not equal the mean of SS; this distinction is useful in statistical settings where the sample mean of a finite set of points differs slightly from the true mean. However, \eqrefeq:resilience-intro implies that μ\mu differs from the mean of SS by at most σ\sigma.

In the remainder of this section, we will outline our main results, starting with information-theoretic results and then moving on to algorithmic results. In Section 1.1, we show that resilience is indeed information-theoretically sufficient for robust mean estimation. In Section 1.2, we then provide finite-sample bounds showing that resilience holds with high probability for i.i.d. samples from a distribution.

In Section 1.3, we turn our attention to algorithmic bounds. We identity a property–bounded variance in the dual norm–under which efficient algorithms exist. We then show that, as long as the norm is strongly convex, every resilient set has a large subset with bounded variance, thus enabling efficient algorithms. This connection between resilience and bounded variance is the most technically non-trivial component of our results, and may be of independent interest.

Both our information-theoretic and algorithmic bounds yield new results in concrete settings, which we discuss in the corresponding subsections. In Section 1.4, we also discuss an extension of resilience to low-rank matrix approximation, which enables us to derive new bounds in that setting as well. In Section 1.5 we outline the rest of the paper and point to technical highlights, and in Section 1.6 we discuss related work.

First, we show that resilience is indeed information-theoretically sufficient for robust recovery of the mean μ\mu. In what follows, we use σ(ϵ)\sigma_{*}(\epsilon) to denote the smallest σ\sigma such that SS is (σ,ϵ)(\sigma,\epsilon)-resilient.

More generally, if Sαn|S|\geq\alpha n (even if α<12\alpha<\frac{1}{2}), it is possible to output a (random) μ^\hat{\mu} such that μ^μ16ασ(α4)\|\hat{\mu}-\mu\|\leq\frac{16}{\alpha}\sigma_{*}(\frac{\alpha}{4}) with probability at least α2\frac{\alpha}{2}.

The first part says that robustness to an ϵ\epsilon fraction of outliers depends on resilience to a ϵ1ϵ\frac{\epsilon}{1-\epsilon} fraction of deletions. Thus, Bob has a good strategy as long as σ(ϵ1ϵ)\sigma_{*}(\frac{\epsilon}{1-\epsilon}) is small.

The fact that estimation is possible even when α<12\alpha<\frac{1}{2} was first established by steinhardt2016avoiding in a crowdsourcing setting, and later by charikar2017learning in a number of settings including mean estimation. Apart from being interesting due to its unexpectedness, estimation in this regime has immediate implications for robust estimation of mixtures of distributions (by considering each mixture component in turn as the “good” set SS) or of planted substructures in random graphs. We refer the reader to charikar2017learning for a full elaboration of this point.

The proof of Proposition 1.2, given in detail in Section 2, is a pigeonhole argument. For the ϵ<12\epsilon<\frac{1}{2} case, we simply search for any large resilient set SS^{\prime} and output its mean; then SS and SS^{\prime} must have large overlap, and by resilience their means must both be close to the mean of their intersection, and hence to each other.

2 Finite-Sample Concentration

While Proposition 1.2 provides a deterministic condition under which robust mean estimation is possible, we would also like a way of checking that resilience holds with high probability given samples x1,,xnx_{1},\ldots,x_{n} from a distribution pp. First, we provide an alternate characterization of resilience which says that a distribution is resilient if it has thin tails in every direction:

In other words, if we project onto any unit vector vv in the dual norm, the ϵ\epsilon-tail of xμx-\mu must have mean at most 1ϵϵσ\frac{1-\epsilon}{\epsilon}\sigma. Thus, for instance, a distribution with variance at most σ02\sigma_{0}^{2} along every unit vector would have σ=O(σ0ϵ)\sigma=\mathcal{O}(\sigma_{0}\sqrt{\epsilon}). Note that Lemma 1.3 requires μ\mu to be the mean, rather than an arbitrary vector as before.

We next provide a meta-result establishing that resilience of a population distribution pp very generically transfers to a finite set of samples from that distribution. The number of samples necessary depends on two quantities BB and logM\log M that will be defined in detail later; for now we note that they are ways of measuring the effective dimension of the space.

Then, given nn samples x1,,xnpx_{1},\ldots,x_{n}\sim p, with probability 1δexp(ϵn/6)1-\delta-\exp(-\epsilon n/6) there is a subset TT of (1ϵ)n(1-\epsilon)n of the xix_{i} such that TT is (σ,ϵ)(\sigma^{\prime},\epsilon)-resilient with \sigma^{\prime}=\mathcal{O}\Big{(}\sigma\cdot\Big{(}1+\sqrt{\frac{\log(M/\delta)}{\epsilon^{2}n}}+\frac{(B/\sigma)\log(M/\delta)}{n}\Big{)}\Big{)}.

Note that Proposition 1.4 only guarantees resilience on a (1ϵ)n(1-\epsilon)n-element subset of the xix_{i}, rather than all of x1,,xnx_{1},\ldots,x_{n}. From the perspective of robust estimation, this is sufficient, as we can simply regard the remaining ϵn\epsilon n points as part of the “bad” points controlled by Alice. This weaker requirement seems to be actually necessary to achieve Proposition 1.4, and was also exploited in charikar2017learning to yield improved bounds for a graph partitioning problem. There has been a great deal of recent interest in showing how to “prune” samples to achieve faster rates in random matrix settings (guedon2014community; le2015concentration; rebrova2015coverings; rebrova2016norms), and we think the general investigation of such pruning results is likely to be fruitful.

We remark that the sample complexity in Proposition 1.4 is suboptimal in many cases, requiring roughly d1.5d^{1.5} samples when dd samples would suffice. At the end of the next subsection we discuss a tighter but more specialized bound based on spectral graph sparsification.

The d1.5/ϵd^{1.5}/\epsilon term in the sample complexity is likely loose, and we believe the true dependence on dd is at most dlog(d)d\log(d). This looseness comes from Proposition 1.4, which uses a naïve covering argument and could potentially be improved with more sophisticated tools. Nevertheless, it is interesting that resilience holds long before the empirical kkth moments concentrate, which would require dk/2d^{k/2} samples.

Stochastic block models. Finally, we consider the semi-random stochastic block model studied in charikar2017learning (described in detail in Section LABEL:sec:sbm-app). For a graph on nn vertices, this model posits a subset SS of αn\alpha n “good” vertices, which are connected to each other with probability an\frac{a}{n} and to the other (“bad”) vertices with probability bn\frac{b}{n} (where b<ab<a); the connections among the bad vertices can be arbitrary. The goal is to recover the set SS.

In particular, we get non-trivial guarantees as long as (ab)2alog(2/α)α2\frac{(a-b)^{2}}{a}\gg\frac{\log(2/\alpha)}{\alpha^{2}}. charikar2017learning derive a weaker (but computationally efficient) bound when (ab)2alog(2/α)α3\frac{(a-b)^{2}}{a}\gg\frac{\log(2/\alpha)}{\alpha^{3}}, and remark on the similarity to the famous Kesten-Stigum threshold (ab)2a1α2\frac{(a-b)^{2}}{a}\gg\frac{1}{\alpha^{2}}, which is the conjectured threshold for computationally efficient recovery in the classical stochastic block model (see decelle2011asymptotic for the conjecture, and mossel2013proof; massoulie2014community for a proof in the two-block case). Our information-theoretic upper bound matches the Kesten-Stigum threshold up to a log(2/α)\log(2/\alpha) factor. We conjecture that this upper bound is tight; some evidence for this is given in steinhardt2017clique, which provides a nearly matching information-theoretic lower bound when a=1a=1, b=12b=\frac{1}{2}.

3 Strong Convexity, Second Moments, and Efficient Algorithms

Most existing algorithmic results on robust mean estimation rely on analyzing the empirical covariance of the data in some way (see, e.g., lai2016agnostic; diakonikolas2016robust; balakrishnan2017sparse). In this section we establish connections between bounded covariance and resilience, and show that in a very general sense, bounded covariance is indeed sufficient to enable robust mean estimation.

Given a norm \|\cdot\|, we say that a set of points x1,,xnx_{1},\ldots,x_{n} has variance bounded by σ02\sigma_{0}^{2} in that norm if 1ni=1nxiμ,v2σ02v2\frac{1}{n}\sum_{i=1}^{n}\langle x_{i}-\mu,v\rangle^{2}\leq\sigma_{0}^{2}\|v\|_{*}^{2} (recall \|\cdot\|_{*} denotes the dual norm). Since this implies a tail bound along every direction, it is easy to see (c.f. Lemma 1.3) that a set with variance bounded by σ02\sigma_{0}^{2} is (O(σ0ϵ),ϵ)(\mathcal{O}(\sigma_{0}\sqrt{\epsilon}),\epsilon)-resilient around its mean for all ϵ<12\epsilon<\frac{1}{2}. Therefore, bounded variance implies resilience.

If SS is (σ,12)(\sigma,\frac{1}{2})-resilient in a γ\gamma-strongly convex norm \|\cdot\|, then SS contains a set S0S_{0} of size at least 12S\frac{1}{2}|S| with bounded variance: 1S0iS0xiμ,v2288σ2γv2\frac{1}{|S_{0}|}\sum_{i\in S_{0}}\langle x_{i}-\mu,v\rangle^{2}\leq\frac{288\sigma^{2}}{\gamma}\|v\|_{*}^{2} for all vv.

Using Lemma 1.3, we can show that (σ,12)(\sigma,\frac{1}{2})-resilience is equivalent to having bounded 11st moments in every direction; Theorem 1.5 can thus be interpreted as saying that any set with bounded 11st moments can be pruned to have bounded 22nd moments.

Given points with bounded variance, we establish algorithmic results assuming that one can solve the “generalized eigenvalue” problem maxv1vAv\max_{\|v\|_{*}\leq 1}v^{\top}Av up to some multiplicative accuracy κ\kappa. Specifically, we make the following assumption:

There is a convex set P\mathcal{P} of PSD matrices such that

for every PSD matrix AA. Moreover, it is possible to optimize linear functions over P\mathcal{P} in polynomial time.

Our main algorithmic result is the following:

Suppose that x1,,xnx_{1},\ldots,x_{n} contains a subset SS of size (1ϵ)n(1-\epsilon)n whose variance around its mean μ\mu is bounded by σ02\sigma_{0}^{2} in the norm \|\cdot\|. Also suppose that Assumption 1.6 holds for the dual norm \|\cdot\|_{*}. Then, if ϵ14\epsilon\leq\frac{1}{4}, there is a polynomial-time algorithm whose output satisfies \|\hat{\mu}-\mu\|=\mathcal{O}\big{(}\sigma_{0}\sqrt{\kappa\epsilon}\big{)}.

If, in addition, \|\cdot\| is γ\gamma-strongly convex, then even if SS only has size αn\alpha n there is a polynomial-time algorithm such that \|\hat{\mu}-\mu\|=\mathcal{O}\big{(}\frac{\sqrt{\kappa}\sigma_{0}}{\sqrt{\gamma}\alpha}\big{)} with probability Ω(α)\Omega(\alpha).

Applications.

4 Low-Rank Recovery

As before, we start by formulating an appropriate resilience criterion:

Rank-resilience says that the variation in XX should be sufficiently spread out: there should not be a direction of variation that is concentrated in only a δ\delta-fraction of the points. Under rank-resilience, we can perform efficient rank-kk recovery even in the presence of a δ\delta-fraction of arbitrary data:

Let δ13\delta\leq\frac{1}{3}. If a set of nn points contains a set SS of size (1δ)n(1-\delta)n that is δ\delta-rank-resilient, then it is possible to efficiently recover a matrix PP of rank at most 15k15k such that (IP)XS2=O(σk+1(XS))\|(I-P)X_{S}\|_{2}=\mathcal{O}(\sigma_{k+1}(X_{S})).

The power of Theorem 1.9 comes from the fact that the error depends on σk+1\sigma_{k+1} rather than e.g. σ2\sigma_{2}, which is what previous results yielded. This distinction is crucial in practice, since most data have a few (but more than one) large singular values followed by many small singular values. Note that in contrast to Theorem 1.7, Theorem 1.9 only holds when SS is relatively large: at least (1δ)n23n(1-\delta)n\geq\frac{2}{3}n in size.

5 Summary, Technical Highlights, and Roadmap

Beyond the results themselves, the following technical aspects of our work may be particularly interesting: The proof of Proposition 1.2 (establishing that resilience is indeed sufficient for robust estimation), while simple, is a nice pigeonhole argument that we found to be conceptually illuminating.

In addition, the proof of Theorem 1.5, on pruning resilient sets to obtain sets with bounded variance, exploits strong convexity in a non-trivial way in conjunction with minimax duality; we think it reveals a fairly non-obvious geometric structure in resilient sets, and also shows how the ability to prune points can yield sets with meaningfully stronger properties.

6 Related Work

A number of authors have recently studied robust estimation and learning in high-dimensional settings: lai2016agnostic study mean and covariance estimation, while diakonikolas2016robust focus on estimating Gaussian and binary product distributions, as well as mixtures thereof; note that this implies mean/covariance estimation of the corresponding distributions. charikar2017learning recently showed that robust estimation is possible even when the fraction α\alpha of “good” data is less than 12\frac{1}{2}. We refer to these papers for an overview of the broader robust estimation literature; since those papers, a number of additional results have also been published: diakonikolas2017practical provide a case study of various robust estimation methods in a genomic setting, balakrishnan2017sparse study sparse mean estimation, and others have studied problems including regression, Bayes nets, planted clique, and several other settings (diakonikolas2016bayes; diakonikolas2016statistical; diakonikolas2017robustly; diakonikolas2017learning; kane2017robust; meister2017data).

Low rank estimation was studied by lai2016agnostic, but their bounds depend on the maximum eigenvalue Σ2\|\Sigma\|_{2} of the covariance matrix, while our bound provides robust recovery guarantees in terms of lower singular values of Σ\Sigma. (Some work, such as diakonikolas2016robust, shows how to estimate all of Σ\Sigma in e.g. Frobenius norm, but appears to require the samples to be drawn from a Gaussian.)

Resilience and Robustness: Information-Theoretic Sufficiency

Recall the definition of resilience: SS is (σ,ϵ)(\sigma,\epsilon)-resilient if 1TiT(xiμ)σ\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu)\|\leq\sigma whenever TST\subseteq S and T(1ϵ)S|T|\geq(1-\epsilon)|S|. Here we establish Proposition 1.2 showing that, if we ignore computational efficiency, resilience leads directly to an algorithm for robust mean estimation. {prooff}Proposition 1.2 We prove Proposition 1.2 via a constructive (albeit exponential-time) algorithm. To prove the first part, suppose that the true set SS is (σ,ϵ1ϵ)(\sigma,\frac{\epsilon}{1-\epsilon})-resilient around μ\mu, and let SS^{\prime} be any set of size (1ϵ)n(1-\epsilon)n that is (σ,ϵ1ϵ)(\sigma,\frac{\epsilon}{1-\epsilon})-resilient (around some potentially different vector μ\mu^{\prime}). We claim that μ\mu^{\prime} is sufficiently close to μ\mu.

Indeed, let T=SST=S\cap S^{\prime}, which by the pigeonhole principle has size at least (12ϵ)n=12ϵ1ϵS=(1ϵ1ϵ)S(1-2\epsilon)n=\frac{1-2\epsilon}{1-\epsilon}|S|=(1-\frac{\epsilon}{1-\epsilon})|S|. Therefore, by the definition of resilience,

But by the same argument, 1TiT(xiμ)σ\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu^{\prime})\|\leq\sigma as well. By the triangle inequality, μμ2σ\|\mu-\mu^{\prime}\|\leq 2\sigma, which completes the first part of the proposition.

For the second part, we need the following simple lemma relating ϵ\epsilon-resilience to (1ϵ)(1-\epsilon)-resilience:

For any 0<ϵ<10<\epsilon<1, a distribution/set is (σ,ϵ)(\sigma,\epsilon)-resilient around its mean μ\mu if and only if it is (1ϵϵσ,1ϵ)(\frac{1-\epsilon}{\epsilon}\sigma,1-\epsilon)-resilient. More generally, even if μ\mu is not the mean then the distribution/set is (2ϵϵσ,1ϵ)(\frac{2-\epsilon}{\epsilon}\sigma,1-\epsilon)-resilient. In other words, if 1TiT(xiμ)σ\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu)\|\leq\sigma for all sets TT of size at least (1ϵ)n(1-\epsilon)n, then 1TiT(xiμ)2ϵϵσ\|\frac{1}{|T^{\prime}|}\sum_{i\in T^{\prime}}(x_{i}-\mu)\|\leq\frac{2-\epsilon}{\epsilon}\sigma for all sets TT^{\prime} of size at least ϵn\epsilon n.

Given Lemma 2.1, the second part of Proposition 1.2 is similar to the first part, but requires us to consider multiple resilient sets SiS_{i} rather than a single SS^{\prime}. Suppose SS is (σ,α4)(\sigma,\frac{\alpha}{4})-resilient around μ\mu–and thus also (8ασ,1α4)(\frac{8}{\alpha}\sigma,1-\frac{\alpha}{4})-resilient by Lemma 2.1–and let S1,,SmS_{1},\ldots,S_{m} be a maximal collection of subsets of [n][n] such that:

Sjα2n|S_{j}|\geq\frac{\alpha}{2}n for all jj.

SjS_{j} is (8ασ,1α2)(\frac{8}{\alpha}\sigma,1-\frac{\alpha}{2})-resilient around some point μj\mu_{j}.

SjSj=S_{j}\cap S_{j^{\prime}}=\emptyset for all jjj\neq j^{\prime}.

Clearly m2αm\leq\frac{2}{\alpha}. We claim that at least one of the μj\mu_{j} is close to μ\mu. By maximality of the collection {Sj}j=1m\{S_{j}\}_{j=1}^{m}, it must be that S0=S\(S1Sm)S_{0}=S\backslash(S_{1}\cup\cdots\cup S_{m}) cannot be added to the collection. First suppose that S0α2n|S_{0}|\geq\frac{\alpha}{2}n. Then S0S_{0} is (8ασ,1α2)(\frac{8}{\alpha}\sigma,1-\frac{\alpha}{2})-resilient (because any subset of α2S0\frac{\alpha}{2}|S_{0}| points in S0S_{0} is a subset of at least α4S\frac{\alpha}{4}|S| points in SS). But this contradicts the maximality of {Sj}j=1m\{S_{j}\}_{j=1}^{m}, so we must have S0<α2n|S_{0}|<\frac{\alpha}{2}n.

Now, this implies that S(S1Sm)α2n|S\cap(S_{1}\cup\cdots\cup S_{m})|\geq\frac{\alpha}{2}n, so by pigeonhole we must have SSjα2Sj|S\cap S_{j}|\geq\frac{\alpha}{2}|S_{j}| for some jj. Letting T=SSjT=S\cap S_{j} as before, we find that Tα2Sjα4S|T|\geq\frac{\alpha}{2}|S_{j}|\geq\frac{\alpha}{4}|S| and hence by resilience of SjS_{j} and SS we have μμj2(8ασ)=16ασ\|\mu-\mu_{j}\|\leq 2\cdot(\frac{8}{\alpha}\sigma)=\frac{16}{\alpha}\sigma. If we output one of the μj\mu_{j} at random, we are then within the desired distance of μ\mu with probability 1mα2\frac{1}{m}\geq\frac{\alpha}{2}.

Powering up Resilience: Finding a Core with Bounded Variance

In this section we prove Theorem 1.5, which says that for strongly convex norms, every resilient set contains a core with bounded variance. Recall that this is important for enabling algorithmic applications that depend on a bounded variance condition.

First recall the definition of resilience (Definition 1.1): a set SS is (σ,ϵ)(\sigma,\epsilon)-resilient if for every set TST\subseteq S of size (1ϵ)S(1-\epsilon)|S|, we have 1TiT(xiμ)σ\|\frac{1}{|T|}\sum_{i\in T}(x_{i}-\mu)\|\leq\sigma. For ϵ=12\epsilon=\frac{1}{2}, we observe that resilience in a norm is equivalent to having bounded first moments in the dual norm:

Conversely, if SS has 1st moments bounded by σ\sigma, it is (2σ,12)(2\sigma,\frac{1}{2})-resilient.

The proof is routine and can be found in Section LABEL:sec:1st-moment-lp-proof. Supposing a set has bounded 11st moments, we will show that it has a large core with bounded second moments. This next result is not routine:

Proposition 3.2 Without loss of generality take μ=0\mu=0 and suppose that S=[n]S=[n]. We can pose the problem of finding a resilient core as an integer program:

Here the variable cic_{i} indicates whether the point ii lies in the core S0S_{0}. By taking a continuous relaxation and applying a standard duality argument, we obtain the following:

Suppose that for all mm and all vectors v1,,vmv_{1},\ldots,v_{m} satisfying j=1mvj21\sum_{j=1}^{m}\|v_{j}\|_{*}^{2}\leq 1, we have

Then the value of \eqrefeq:minimax-0 is at most 8B28B^{2}.

The proof is straightforward and deferred to Section LABEL:sec:minimax-proof. Now, to bound \eqrefeq:dual, let s1,,sm{1,+1}s_{1},\ldots,s_{m}\in\{-1,+1\} be i.i.d. random sign variables. We have

Here (i) is Khintchine’s inequality (haagerup1981best) and (ii) is the assumed first moment bound. It remains to bound \eqrefeq:q-norm. The key is the following inequality asserting that the dual norm \|\cdot\|_{*} is strongly smooth whenever \|\cdot\| is strongly convex (c.f. Lemma 17 of shalev07online):

If \|\cdot\| is γ\gamma-strongly convex, then \|\cdot\|_{*} is (1/γ)(1/\gamma)-strongly smooth: 12(v+w2+vw2)v2+(1/γ)w2\frac{1}{2}(\|v+w\|_{*}^{2}+\|v-w\|_{*}^{2})\leq\|v\|_{*}^{2}+(1/\gamma)\|w\|_{*}^{2}.