Learning from Untrusted Data

Moses Charikar, Jacob Steinhardt, Gregory Valiant

Introduction

What can be learned from data that is only partially trusted? In this paper, we study this question by considering the following setting: we observe nn data points, of which αn\alpha n are drawn independently from a distribution of interest, pp^{*}, and we make no assumptions about the remaining (1α)n(1-\alpha)n points—they could be very biased, arbitrary, or chosen by an adversary who is trying to obscure pp^{*}. Our goal is to accurately recover a parameter of interest of pp^{*} (such as the mean), despite the presence of significant amounts of untrusted data. Perhaps surprisingly, we will show that in high dimensions, accurate estimation and learning is often possible, even when the fraction of real data is small (i.e., α1\alpha\ll 1). To do this, we consider two notions of successful learning–the list decodable model and the semi-verified model–and provide strong positive results for both notions. Our results have implications in a variety of domains, including building secure machine learning systems, performing robust statistics in the presence of outliers, and agnostically learning mixture models.

The goal of accurate robust estimation appears at first glance to be impossible if the fraction α\alpha of real data is less than one half. Indeed, in the α=12\alpha=\frac{1}{2} case, it is possible that the real and fake data are distributed identically, except that the mean of the fake data is shifted by some large amount; in such a case, it is clearly impossible to differentiate which of these two distributions is “right”. Perhaps, however, such symmetries are the only real problem that can occur. It might then be possible to output a short list of possible parameter sets–if α=12\alpha=\frac{1}{2}, perhaps a list of two parameter sets–such that at least one is accurate. To this end, we consider a notion of successful learning called list decodable learning, first introduced by Balcan et al. (2008). In analogy with list decodable coding theory, the goal is for the learning algorithm to output a short list of possible hypotheses.

We say that a learning, estimation, or optimization problem is (m,ϵ)(m,\epsilon) list decodably solvable if an efficient algorithm can output a set of at most mm hypotheses/estimates/answers, with the guarantee that at least one is accurate to within error ϵ\epsilon.

A central question in this paper concerns which learning problems can be robustly solved in the above sense:

To what extent are learning problems robustly solvable in the list decodable sense? If the dataset consists of only an α\alpha-fraction of real data, in what settings is it possible to efficiently output a list of at most 1α\frac{1}{\alpha} or poly(1α)\operatorname{poly}(\frac{1}{\alpha}) parameter sets or estimates with the guarantee that at least one closely approximates the solution that could be obtained if one were given only honest data?

The intuition for why strong positive results are obtainable in the list decodable setting is the following. Given a dataset with an α\alpha fraction of trusted data, the remaining data might do one of two things: either it can be fairly similar to the good data, in which case it can bias the overall answers by only a small amount, or the adversarial data may be very different from the trusted data. The key is that if a portion of the untrusted data tries too hard to bias the final result, then it will end up looking quite different, and can be clustered out. Given this viewpoint, the question of robust learning in the list decodable model becomes a question of the extent to which an adversary can completely obscure the inherent structure possessed by a dataset drawn from a well-behaved distribution.

Our investigation of robust learning has three motivations. First, from a theoretical perspective, it is natural to ask what guarantees are possible in the setting in which a majority of data is untrusted (α<12\alpha<\frac{1}{2}). Is it the case that learning really becomes impossible (as is often stated), or can one at least narrow down the possible answers to a small set? Second, in many practical settings, there is a trade-off between the amount of data one can collect, and the quality of the data. For a fixed price, one might be able to collect either a small and accurate/trusted dataset, or a large but less trusted dataset. It is worth understanding how the quality of models derived from such datasets varies, across this entire range of dataset quality/quantity. Finally, robust learning with α1\alpha\ll 1 provides a new perspective on learning mixtures of distributions—by treating a single mixture component as the real data, and the remaining components as fake data, we can ask to what extent a mixture component can be learned, independently of the structure of the other components. While this perspective may seem to give up too much, we will show, somewhat surprisingly, that it is possible to learn mixtures almost as well under these adversarial assumptions as under stochastic assumptions.

When α12\alpha\leq\frac{1}{2}, the list decodable model handles symmetries by allowing the learner to output multiple possible answers; an alternative is to break these symmetries with a small amount of side information. In particular, in many practical settings it is possible to obtain a (sometimes extremely small) verified set of data that has been carefully checked, which could be used to determine which of multiple alternative answers is correct. This motivates us to introduce the following new notion of learnability:

In the semi-verified model, we observe nn data points, of which an unknown αn\alpha n are “real” data reflecting an underlying distribution pp^{*}, and the remaining (1α)n(1-\alpha)n points are arbitrary. Furthermore, we observe kk “verified” data points that are guaranteed to be drawn from pp^{*}.

The definition of the semi-verified model is inspired by the semi-supervised model of learning (see e.g. Chapelle et al. (2006)). In semi-supervised learning, one is concerned with a prediction/labeling task, and has access to a large amount of unlabeled data together with a small amount of labeled data; the central question is whether the presence of the unlabeled data can reduce the amount of labeled data required to learn. Analogously, in our robust learning setting, we are asking whether the presence of a large amount of untrusted data can reduce the amount of trusted data required for learning. Clearly the answer is “no” if we make no assumptions on the untrusted data. Nevertheless, the assumption that a significant fraction of that data is drawn from pp^{*} seems plausible, and may be sufficient to achieve strong positive results. We therefore ask:

To what extent can the availability of a modest amount of “verified” data facilitate (either computationally or information theoretically) the extraction of the information contained in the larger but untrusted dataset? What learning tasks can be performed in the above semi-verified setting given knk\ll n verified data points? How does the amount kk of verified data that is needed vary with the setting, the fraction α\alpha of honest data, etc.?

The above definition and associated questions reflect challenges faced in a number of practical settings, particularly those involving large crowdsourced datasets, or datasets obtained from unreliable sensors or devices. In such settings, despite the unreliability of the data, it is often possible to obtain a small verified dataset that has been carefully checked. Given its pervasiveness, it is somewhat surprising that neither the theory nor the machine learning communities have formalized this model; we think it is important to develop an understanding of the algorithmic possibilities in this domain. Obtaining theoretical guarantees seems especially important for designing provably secure learning systems that are guaranteed to perform well even if an adversary obtains control over some of the training data used by the algorithm.

The semi-verified and list decodable models can be reduced to each other. Informally, given mm candidate outputs from a list decodable algorithm, we expect to be able to distinguish between them with O(log(m))\mathcal{O}\left(\log(m)\right) verified data points. Conversely, if a model is learnable with kk verified points then we can output O((1α)k)\mathcal{O}\left(({\frac{1}{\alpha}})^{k}\right) candidate parameters in the list decodable setting (since if we sample that many kk-tuples from the untrusted data, at least one is likely to contain only honest data). For simplicity we state most results in the list decodable model.

From our general results (discussed in detail in the next section), we immediately obtain corollaries in specific settings, starting with mean estimation:

Since our results hold for any stochastic optimization problem, we can also study density estimation, by taking fif_{i} to be the negative log-likelihood:

Robust density estimation: Given an exponential family model pθ(x)exp(θϕ(x))p_{\theta}(x)\propto\exp\left(\theta^{\top}\phi(x)\right), we can find a distribution pθp_{{\theta}} with KL(pθ  pθ)O(σrα)\operatorname{KL}\left(p_{\theta^{*}}\ \|\ p_{{\theta}}\right)\leq\mathcal{O}({\frac{\sigma r}{\sqrt{\alpha}}}), where Covpθ[ϕ(x)]σ2I\operatorname{Cov}_{p_{\theta^{*}}}[\phi(x)]\preceq\sigma^{2}I and r=θ2r=\|\theta^{*}\|_{2}.

While density estimation could be reduced to mean estimation (via estimating the sufficient statistics), our analysis applies directly, to an algorithm that can be interpreted as approximately maximizing the log likelihood while removing outliers.

In the list decodable setting, our results also yield bounds for learning mixtures:

It is fairly surprising that, despite making no assumptions on the structure of the data outside of a mixture component, we nearly match the best computationally efficient results that fully leverage this structure. This suggests that there may be a connection between robustness and computation: perhaps the computational threshold for recovering a planted structure in random data (such as a geometric cluster or a high-density subgraph) matches the robustness threshold for recovering that structure in the presence of an adversary.

Beyond our main results, we develop certain technical machinery that may be of broader interest. Perhaps the most relevant is a novel matrix concentration inequality, based on ideas from spectral graph sparsification (Batson et al., 2012), which holds assuming only bounded second moments:Since the writing of this paper, we were able to prove a stronger version of Proposition 1.1 that requires only bounded first moments.

This result is strong in the following sense: if one instead uses all nn samples xix_{i}, the classical result of Rudelson (1999) only bounds the maximum eigenvalue by roughly σ2log(n){\sigma^{2}\log(n)}, and even then only in expectation. Even under stronger assumptions, one often needs either at least dlog(d)d\log(d) samples, or incurs a log(d)\log(d) factor in the bound on λmax\lambda_{\max}. In the planted partition model, this log factor causes natural spectral approaches to fail on sparse graphs, and avoiding the log factor has been a topic of recent interest (Guédon and Vershynin, 2014; Le et al., 2015; Rebrova and Tikhomirov, 2015; Rebrova and Vershynin, 2016).

Proposition 1.1 says that the undesirable log factor only arises due to a manageable fraction of bad samples, which when removed give us sharper concentration. Our framework allows us to exploit this by defining the good data to be the (unknown) set II for which the conclusion of Proposition 1.1 holds. One consequence is that we are able to recover planted partitions in sparse graphs essentially “for free”.

Separately, we introduce a novel regularizer based on minimum trace ellipsoids. This regularizer allows us to control the spectral properties of the model parameters at multiple scales simultaneously, and yields tighter bounds than standard trace norm regularization. We define the regularizer in Section 3, and prove a local Hölder’s inequality (Lemma 5.1), which exploits this regularization to achieve concentration bounds solely from deterministic spectral information.

We also employ padded decompositions, a space partitioning technique from the metric embedding literature (Fakcharoenphol et al., 2003). Their use is the following: when the loss functions are strongly convex, we can improve our bounds by identifying clusters in the data, and re-running our main algorithm on each cluster. Padded decompositions help us because they can identify clusters even if the remaining data has arbitrary structure. Our clustering scheme is described in Section 6.

The work closest to ours is Lai et al. (2016) and Diakonikolas et al. (2016), who study high-dimensional estimation in the presence of adversarial corruptions. They focus on the regime α1\alpha\approx 1, while our work focuses on α1\alpha\ll 1. In the overlap of these regimes (e.g. for α=34\alpha=\frac{3}{4}) our results improve upon these existing results. (The existing bounds are better as α1\alpha\to 1, but do not hold at all if α12\alpha\leq\frac{1}{2}.) The popular robust PCA algorithm (Candès et al., 2011; Chandrasekaran et al., 2011) allows for a constant fraction of the entries to be arbitrarily corrupted, but assumes the locations of these entries are sufficiently evenly distributed. However, Xu et al. (2010) give a version of PCA that is robust to arbitrary adversaries if α>12\alpha>\frac{1}{2}. Bhatia et al. (2015) study linear regression in the presence of adversaries, and obtain bounds for sufficiently large α\alpha (say α6465\alpha\geq\frac{64}{65}) when the design matrix is subset strong convex. Klivans et al. (2009) and Awasthi et al. (2014) provide strong bounds for robust classification in high dimensions for isotropic log-concave distributions.

The only works we are aware of that achieve general adversarial guarantees when α12\alpha\leq\frac{1}{2} are Hardt and Moitra (2013), who study robust subspace recovery in the presence of a large fraction of outliers, and Steinhardt et al. (2016), which is an early version of this work that focuses on community detection.

Balcan et al. (2008) introduce the list-decodable learning model, which was later studied by others, e.g. Balcan et al. (2009) and Kushagra et al. (2016). That work provides bounds for clustering in the presence of some adversarial data, but has two limitations relative to our results (apart from being in a somewhat different setting): the fraction of adversaries tolerated is small (O(1k)\mathcal{O}(\frac{1}{k})), and the bounds are not meaningful in high dimensions (e.g. Balcan et al. (2008) output a list of kO(k/γ2)k^{\mathcal{O}(k/\gamma^{2})} hypotheses, where γ\gamma typically scales as 1/d1/\sqrt{d}).

Kumar and Kannan (2010) and the follow-up work of Awasthi and Sheffet (2012) find deterministic conditions under which efficient kk-means clustering is possible, even in high dimensions. While the goal is different from ours, their condition is similar to the quantities appearing in our error bounds, and there is some overlap in techniques. They also obtain bounds in the presence of adversaries, but only if the fraction of adversaries is smaller than 1k\frac{1}{k} as in Balcan et al. (2008). Our corollaries for learning mixtures can be thought of as extending this line of work, by providing deterministic conditions under which clustering is possible even in the presence of a large fraction of adversarial data.

Separately, there has been considerable interest in semi-random graph models (Blum and Spencer, 1995; Feige and Krauthgamer, 2000; Feige and Kilian, 2001; Coja-Oghlan, 2004; Krivelevich and Vilenchik, 2006; Coja-Oghlan, 2007; Makarychev et al., 2012; Chen et al., 2014b; Guédon and Vershynin, 2014; Moitra et al., 2015; Agarwal et al., 2015) and robust community detection (Kumar and Kannan, 2010; Moitra et al., 2015; Makarychev et al., 2015; Cai and Li, 2015). In these models, a random graph is generated with a planted structure (such as a planted clique or partition) and adversaries are then allowed to modify some parts of this structure. Typically, the adversary is constrained to only modify o(n)o(n) nodes or to only modify the graph in restricted ways, though some of the above work considers substantially stronger adversaries as well.

Finally, there is a large literature on learning in the presence of errors, spanning multiple communities including learning theory (Kearns and Li, 1993) and statistics (Tukey, 1960). We refer the reader to Huber and Ronchetti (2009) and Hampel et al. (2011) for some recent surveys.

We pause to explain how our techniques relate to those of other recent robust learning papers by Diakonikolas et al. (2016) and Lai et al. (2016). At a high level, our algorithm works by solving a convex optimization problem whose objective value will be low if all of the data come from pp^{*}; then, if the objective is high, by looking at the dual we can identify which points are responsible for the high objective value and remove them as outliers.

In contrast, Diakonikolas et al. (2016) solve a convex feasibility problem, where the feasible set depends on the true distribution pp^{*} and hence is not observable. Nevertheless, they show that given a point that is far from feasible, it is possible to provide a separating hyperplane demonstrating infeasibility. Roughly speaking, then, we solve a “tainted” optimization problem and clean up errors after the fact, while they solve a “clean” (but unobserved) optimization problem and show that it is possible to make progress if one is far from the optimum. The construction of the separation oracle in Diakonikolas et al. (2016) is similar to the outlier removal step we present here, and it would be interesting to further understand the relationship between these approaches.

Diakonikolas et al. (2016) also propose another algorithm based on filtering. In the case of mean estimation, the basic idea is to compute the maximum eigenvector of the empirical covariance of the data — if this eigenvector is too large, then we can find a collection of points that are responsible for it being large, and remove them as outliers. Though it is not phrased this way, it can be thought of–similarly to our approach–as solving a tainted optimization problem (top eigenvalue on the noisy data) and then cleaning up outliers afterwards. Their outlier removal step seems tighter than ours, and it would be interesting to find an approach that obtains such tight bounds for a general class of optimization problems.

Finally, Lai et al. (2016) pursue an approach based on iteratively finding the top n/2n/2 eigenvectors (rather than just the top) and projecting out the remaining directions of variation, as well as removing outliers if the eigenvalues are too large. This seems similar in spirit to the filtering approach described above.

Our paper is organized as follows. In Section 2 we present our main results and some of their implications in specific settings. In Section 3 we explain our algorithm and provide some intuition for why it should work. In Section 4 we provide a proof outline for our main results. In Sections 5 and 6, we sharpen our results, first showing how to obtain concentration inequalities on the errors, and then showing how to obtain tighter bounds and stronger guarantees for strongly convex losses. In Section 7 we present lower bounds showing that our results are optimal in some settings. In Section 8 we present some intuition for our bounds, and in Section 9 we prove our main corollaries. The remaining sections are dedicated to deferred proofs.

We thank the anonymous reviewers who made many helpful comments that improved this paper. MC was supported by NSF grants CCF-1565581, CCF-1617577, CCF-1302518 and a Simons Investigator Award. JS was supported by a Fannie & John Hertz Foundation Fellowship, a NSF Graduate Research Fellowship, and a Future of Life Institute grant. GV was supported by NSF CAREER award CCF-1351108, a Sloan Foundation Research Fellowship, and a research grant from the Okawa Foundation.

Main Results and Implications

This stochastic optimization setting captures most concrete settings of interest — for instance, mean estimation corresponds to fi(w)=wxi22f_{i}(w)=\|w-x_{i}\|_{2}^{2}, linear regression to fi(w)=(yiw,xi)2f_{i}(w)=(y_{i}-\langle w,x_{i}\rangle)^{2}, and logistic regression to fi(w)=log(1+exp(yiw,xi))f_{i}(w)=\log(1+\exp(-y_{i}\langle w,x_{i}\rangle)).

The fact that fipf_{i}\sim p^{*} is irrelevant to our results—all that matters is the quantity SS, which exists even for a deterministic set of functions f1,,fnf_{1},\ldots,f_{n}. Furthermore, SS only depends on the good data and is independent of the adversary.

The definition (1) is a bit complex, so we go over some examples for intuition. We will see later that for the first two examples below (estimating means and product distributions), our implied error bounds are “good”, while for the final example (linear classification), our bounds are “bad”.

Product distributions: Suppose that xix_{i} is drawn from a product distribution on {0,1}d\{0,1\}^{d}, where the jjth coordinate is 11 with probability pjp_{j}. Let fi(w)=j=1dxijlog(wj)+(1xij)log(1wj)f_{i}(w)=\sum_{j=1}^{d}x_{ij}\log(w_{j})+(1-x_{ij})\log(1-w_{j}). In this case fˉ(w)=j=1dpjlog(wj)+(1pj)log(1wj)\bar{f}(w)=\sum_{j=1}^{d}p_{j}\log(w_{j})+(1-p_{j})\log(1-w_{j}), and wj=pjw_{j}^{*}=p_{j}, so that fˉ(w)fˉ(w)\bar{f}(w)-\bar{f}(w^{*}) is the KL divergence between pp and ww.

Linear classification: Suppose that xiN(0,I)x_{i}\sim\mathcal{N}(0,I) and that yi=sign(uxi)y_{i}=\operatorname{sign}(u^{\top}x_{i}) for some unknown vector uu. Our loss function is the logistic loss fi(w)=log(1+exp(yiw,xi))f_{i}(w)=\log(1+\exp(-y_{i}\langle w,x_{i}\rangle)). In this case fi(w)=yi1+exp(yiw,xi)xi\nabla f_{i}(w)=-\frac{y_{i}}{1+\exp(y_{i}\langle w,x_{i}\rangle)}x_{i}. It is less obvious how to compute SS, but Lemma 2.1 below implies that it is O(1)\mathcal{O}(1).

If fi(w)fˉ(w)\nabla f_{i}(w)-\nabla\bar{f}(w) is σ\sigma-sub-Gaussian and LL-Lipschitz for fipf_{i}\sim p^{*}, and αn\alpha n is at least dmax(1,log(rLσ))+log(1/δ)d\max(1,\log(\tfrac{rL}{\sigma}))+\log(1/\delta), then S=O(σ)S=\mathcal{O}\left(\sigma\right) with probability 1δ1-\delta.

1 Main Results

We can now state our main results. Our first result is that, just using the untrusted data, we can output a small ellipse which contains a parameter attaining small error under fˉ\bar{f}. This meta-result leads directly to our more specific results in the list decoding and semi-verified settings.

By applying our algorithm multiple times we can improve the bound wμ22=O(σμ2/α)\|w-\mu\|_{2}^{2}=\mathcal{O}\left(\sigma\|\mu\|_{2}/\sqrt{\alpha}\right) to wμ22=O(σ2/α)\|w-\mu\|_{2}^{2}=\mathcal{O}\left(\sigma^{2}/\alpha\right), so that our bounds are meaningful for large dd independent of μ2\|\mu\|_{2}. We discuss this in more detail in Section 6.

For an example where Theorem 2.2 is less meaningful, consider the linear classification example from above. In that case S=O(1)S=\mathcal{O}(1), and rr is likely also O(1)\mathcal{O}(1), so we obtain the bound fˉ(w)fˉ(w)=O(1/α)\bar{f}(w)-\bar{f}(w^{*})=\mathcal{O}(1/\sqrt{\alpha}). However, fˉ(0)=log(2)\bar{f}(0)=\log(2) while fˉ(w)0\bar{f}(w^{*})\geq 0, so this bound is essentially vacuous.

Using Theorem 2.2 as a starting point we can derive bounds for both models defined in Section 1, starting with the list decodable model. Here, we must make the further assumption that the fif_{i} are κ\kappa-strongly convex, meaning that fi(w)fi(w)(ww)fi(w)+κ2ww22f_{i}(w^{\prime})-f_{i}(w)\geq(w^{\prime}-w)^{\top}\nabla f_{i}(w)+\frac{\kappa}{2}\|w^{\prime}-w\|_{2}^{2}. The strong convexity allows us to show that for the good fif_{i}, the parameters w^i=arg minwEYfi(w)\hat{w}_{i}=\operatornamewithlimits{arg\,min}_{w\in\mathcal{E}_{Y}}f_{i}(w) concentrate around ww^{*}, with radius rrr^{\prime}\ll r. By clustering the w^i\hat{w}_{i} and iteratively re-running our algorithm on each cluster, we can obtain bounds that do not depend on rr, and output a single candidate parameter w^j\hat{w}_{j} for each cluster. We can thereby show:

In Section 6 we state and prove a stronger version of this result. A key tool in establishing Theorem 2.3 is padded decompositions (Fakcharoenphol et al., 2003), which identify clusters in data while making minimal assumptions on the geometry of points outside of a cluster, and are thus useful in our adversarial setting.

If the fif_{i} are not strongly convex then we cannot employ the clustering ideas above. However, because we have reduced H\mathcal{H} to the much smaller set EY\mathcal{E}_{Y}, we can nevertheless often approximate ww^{*} with only a small amount of verified data. In fact, in some settings we only need a single verified data point:

The particular functional form for fif_{i} was needed to obtain a concrete bound, but analogs of Lemma 2.4 should be possible in any setting where we can leverage the low complexity of EY\mathcal{E}_{Y} into a bound on ffˉf-\bar{f}. Note that if we replace EY\mathcal{E}_{Y} with H\mathcal{H} in Lemma 2.4, then the rα\frac{r}{\sqrt{\alpha}} dependence becomes rdr\sqrt{d}, which is usually vacuous.

The dependence on SS, rr and κ\kappa in the results above seems essentially necessary, though the optimal dependence on α\alpha is less clear. In Section 7 we show lower bounds for robust mean estimation even if pp^{*} is known to be Gaussian. These bounds roughly translate to a lower bound of \Omega\big{(}{\frac{S}{\kappa}{\sqrt{\log{({1}/{\alpha})}}}}\big{)} for strongly convex fif_{i}, and \Omega\big{(}{Sr\sqrt{\log(\smash{{1}/{\alpha}})}}\big{)} for linear fif_{i}, and hold in both the list decodable and semi-verified settings. It is unclear whether the optimal dependence on α\alpha is 1/α\sqrt{{1}/{\alpha}} or log(1/α)\sqrt{\log({1}/{\alpha})} or somewhere between. We do note that any dependence better than 1/α\sqrt{{1}/{\alpha}} would improve the best known results for efficiently solving kk-means for well-separated clusters, which may suggest at least a computational barrier to achieving log(1/α)\sqrt{\log({1}/{\alpha})}.

2 Implications

We now go over some implications of our general results in some more specific settings. All of the results below follow as corollaries of our main theorems, and are proved in Section 9.

Suppose that pp^{*} has bounded covariance: Covp[x]σ2I\operatorname{Cov}_{p^{*}}[x]\preceq\sigma^{2}I. Then, for ndαn\geq\frac{d}{\alpha}, with probability 1exp(Ω(αn))1-\exp\left(-\Omega(\alpha n)\right) it is possible to output mO(1α)m\leq\mathcal{O}\left(\frac{1}{\alpha}\right) candidate means μ^1,,μ^m\hat{\mu}_{1},\ldots,\hat{\mu}_{m} such that \min_{j=1}^{m}\|\mu-\hat{\mu}_{j}\|_{2}\leq\mathcal{O}\Big{(}{\sigma\sqrt{\frac{\log(2/\alpha)}{\alpha}}}\Big{)}. Moreover, if α0.51\alpha\geq 0.51 then we can take m=1m=1.

We can compare to the results of Lai et al. (2016) and Diakonikolas et al. (2016), who study mean estimation when α>12\alpha>\frac{1}{2} and one is required to output a single parameter (i.e., m=1m=1). For simplicity take α=34\alpha=\frac{3}{4}. Roughly, Lai et al. (2016) obtain error O(σlog(d))\mathcal{O}({\sigma\sqrt{\log(d)}}) with sample complexity n=O(d)n=\mathcal{O}\left(d\right), while requiring a bound on the fourth moments of pp^{*}; Diakonikolas et al. (2016) obtain error O(σ)\mathcal{O}\left(\sigma\right) with sample complexity n=O(d3)n=\mathcal{O}\left(d^{3}\right), and require pp^{*} to be Gaussian. Corollary 2.5 improves both of these by yielding error O(σ)\mathcal{O}\left(\sigma\right) with sample complexity n=O(d)n=\mathcal{O}\left(d\right), and only requires pp^{*} to have bounded second moments. Some fine print: Diakonikolas et al. also estimate the covariance matrix, and their recovery results are stronger if Σ=Cov[x]\Sigma=\operatorname{Cov}[x] is highly skewed; they roughly show μ^μΣ1=O(1)\|\hat{\mu}-\mu\|_{\Sigma^{-1}}=\mathcal{O}(1). The adversary model considered in both of these other papers is also slightly more general than ours: first nn points are drawn from pp^{*} and then an adversary is allowed to corrupt (1α)n(1-\alpha)n of the points. However, it is straightforward to show (by monotonicity of the operator norm) that our bounds will be worse by at most a 1/α1/\sqrt{\alpha} factor in this stricter setting, which is a constant if α=34\alpha=\frac{3}{4}. We note that in contrast to our results, these other results obtain error that vanishes as α1\alpha\to 1 (at a rate of (1α)1/2(1-\alpha)^{1/2} in the first case and O~(1α)\widetilde{\mathcal{O}}(1-\alpha) in the second case). We thus appear to incur some looseness when α1\alpha\approx 1, in exchange for obtaining results in the previously unexplored setting α12\alpha\leq\frac{1}{2}. It would be interesting to obtain a single algorithm that both applies when α1\alpha\ll 1 and achieves vanishing error as α1\alpha\to 1.

Suppose we are given nn samples, where each sample either comes from one of kk distributions p1,,pkp_{1}^{*},\ldots,p_{k}^{*} (with Covpi[x]σ2I\operatorname{Cov}_{p_{i}^{*}}[x]\preceq\sigma^{2}I for all ii), or is arbitrary. Let μi\mu_{i} be the mean of pip_{i}^{*}, let αi\alpha_{i} be the fraction of points from pip_{i}^{*}, and let α=mini=1kαi\alpha=\min_{i=1}^{k}\alpha_{i}. Then if ndαn\geq\frac{d}{\alpha}, with probability 1kexp(Ω(αε2n))1-k\exp(-\Omega(\alpha\varepsilon^{2}n)) we can obtain a partition T1,,TmT_{1},\ldots,T_{m} of [n][n] and corresponding candidate means μ^1,,μ^m\hat{\mu}_{1},\ldots,\hat{\mu}_{m} such that: for all but εαin\varepsilon\alpha_{i}n of the points drawn from pip_{i}^{*}, the point is assigned a set TjT_{j} and candidate mean μ^j\hat{\mu}_{j} with μiμ^j2O(σεlog(2α)α)\|\mu_{i}-\hat{\mu}_{j}\|_{2}\leq\mathcal{O}\left(\frac{\sigma}{\varepsilon}\sqrt{\frac{\log(\frac{2}{\alpha})}{\alpha}}\right). Moreover, mO(1α)m\leq\mathcal{O}\left(\frac{1}{\alpha}\right).

The 1ε\frac{1}{\varepsilon} dependence can be replaced with log(n)/ε\sqrt{\log(n)/\varepsilon}, or with log(2/ε)\sqrt{\log(2/\varepsilon)} if the xix_{i} are sub-Gaussian. The only difference is in which matrix concentration bound we apply to the xix_{i}. Corollary 2.6 says that we can partition the points into O(1α)\mathcal{O}\left(\frac{1}{\alpha}\right) sets, such that two points from well-separated clusters are unlikely to end up in the same set. Note however that one cluster might be partitioned into multiple sets. In the adversarial setting, this seems unavoidable, since an adversary could create a fake cluster very close to a real cluster, and one would be forced to either combine the clusters or risk splitting the real cluster in two.

We next consider implications of our results in a version of the planted partition model (McSherry, 2001). In this model we observe a random directed graph, represented as a matrix A{0,1}n×nA\in\{0,1\}^{n\times n}. For disjoint subsets I1,,IkI_{1},\ldots,I_{k} of [n][n], we generate edges as follows: (i) If u,vIiu,v\in I_{i}, then p(Auv=1)=anp(A_{uv}=1)=\frac{a}{n}. (ii) If uIi,v∉Iiu\in I_{i},v\not\in I_{i}, then p(Auv=1)=bnp(A_{uv}=1)=\frac{b}{n}. (iii) If u∉i=1kIiu\not\in\cup_{i=1}^{k}I_{i}, the edges emanating from uu can be arbitrary. In contrast to the typical planted partition model, we allow some number of corrupted vertices not belonging to any of the IiI_{i}. In general aa and bb could depend on the partition indices ii, jj, but we omit this for simplicity.

Note that the distribution over the row AuA_{u} is the same for all uIiu\in I_{i}. By taking this distribution to be the distribution pp^{*}, Theorem 2.3 yields the following result:

For the planted partition model above, let α=mini=1kIin\alpha=\min_{i=1}^{k}\frac{|I_{i}|}{n}. Then, with probability 1exp(Ω(αn))1-\exp(-\Omega(\alpha n)), we can obtain sets T1,,Tm[n]T_{1},\ldots,T_{m}\subseteq[n], with mO(1α)m\leq\mathcal{O}\left(\frac{1}{\alpha}\right), such that for all i[k]i\in[k], there is a j[m]j\in[m] with IiTjO(alog(2α)α2(ab)2)n|I_{i}\triangle T_{j}|\leq\mathcal{O}\left(\frac{a\log(\frac{2}{\alpha})}{\alpha^{2}(a-b)^{2}}\right)n, where \triangle denotes symmetric difference.

This shows that we can approximately recover the planted partition, even in the presence of arbitrary corruptions, provided (ab)2alog(2/α)α3\frac{(a-b)^{2}}{a}\gg\frac{\log(2/\alpha)}{\alpha^{3}} (since the bound on IiTj|I_{i}\triangle T_{j}| needs to be less than αn\alpha n to be meaningful). In contrast, the best efficient methods (assuming no corruptions) roughly require (ab)2a+(k1)bk\frac{(a-b)^{2}}{a+(k-1)b}\gg k in the case of kk equal-sized communities (Abbe and Sandon, 2015a; b). In the simplifying setting where b=12ab=\frac{1}{2}a, our bounds require ak3log(k)a\gg k^{3}\log(k) while existing bounds require ak2a\gg k^{2}. The case of unequal size communities is more complex, but roughly, our bounds require alog(2/α)α3a\gg\frac{\log(2/\alpha)}{\alpha^{3}} in contrast to a1α2a\gg\frac{1}{\alpha^{2}}.

Algorithm

In this section we present our algorithm, which consists of an SDP coupled with an outlier removal step. At a high level, our algorithm works as follows: first, we give each function fif_{i} its own parameter vector wiw_{i}, and minimize i=1nfi(wi)\sum_{i=1}^{n}f_{i}(w_{i}) subject to regularization which ensures the wiw_{i} remain close to each other; formally, we bound the wiw_{i} to lie within a small ellipse. The reason for doing this is that the different wiw_{i} are now only coupled via this regularization, and so the influence of adversarial data on the good parameters can only come from its effect on the shape of the ellipse. We will show that whenever the adversaries affect the shape of the ellipse more than a small amount, they are necessarily outliers that can be identified and removed. In the remainder of this section, we elaborate on these two steps of regularization and outlier removal, and provide pseudocode.

If the functions f1,,fnf_{1},\ldots,f_{n} were all drawn from pp^{*} (i.e., there are no adversaries), then a natural approach would be to let w^\hat{w} be the minimizer of i=1nfi(w)\sum_{i=1}^{n}f_{i}(w), which will approximately minimize fˉ(w)\bar{f}(w) by standard concentration results.

The problem with using this approach in the adversarial setting is that even a single adversarially chosen function fif_{i} could substantially affect the value of w^\hat{w}. To minimize this influence, we give each fif_{i} its own parameter wiw_{i}, and minimize i=1nfi(wi)\sum_{i=1}^{n}f_{i}(w_{i}), subject to a regularizer which encourages the wiw_{i} to be close together. The adversary now has no influence on the good wiw_{i} except via the regularizer, so the key challenge is to find a regularizer which sufficiently controls statistical error while also bounding the influence of the adversary.

It turns out that the right choice of regularizer in this case is to constrain the wiw_{i} to lie within an ellipse with small trace. Formally, the centerpiece of our algorithm is the following convex optimization problem:

Here the coefficients cic_{i} are non-negative weights which will eventually be used to downweight outliers (for now it is fine to imagine that ci=1c_{i}=1). Note that wiwiYw_{i}w_{i}^{\top}\preceq Y is equivalent to the semidefinite constraint \left[\begin{array}[]{cc}Y&w_{i}\\ w_{i}^{\top}&1\end{array}\right]\succeq 0. The problem (4) can be solved in polynomial time in nn and dd assuming oracle access to the gradients fi\nabla f_{i}.

Note that the semidefinite constraint wiwiYw_{i}w_{i}^{\top}\preceq Y is equivalent to wiY1wi1w_{i}^{\top}Y^{-1}w_{i}\leq 1, which says that wiw_{i} lies within the ellipse centered at defined by YY. The regularizer is thus the trace of the minimum ellipse containing the wiw_{i}; penalizing this trace will tend to push the wiw_{i} closer together, but is there any intuition behind its geometry? The following lemma shows that tr(Y)\operatorname{tr}(Y) is related to the trace norm of [w1  wn]\left[w_{1}\ \cdots\ w_{n}\right]:

The appearance of the trace norm makes sense in light of the intuition that we should be clustering the functions fif_{i}; indeed, trace norm regularization is a key ingredient in spectral algorithms for clustering (see e.g. Zha et al. (2001); Chen et al. (2014a; b)). Lemma 3.1 says that tr(Y)\operatorname{tr}(Y) simultaneously bounds the trace norm on every subset TT of [n][n], which ends up yielding better results than are obtained by simply penalizing the overall trace norm; we believe that this local trace norm regularization likely leads to better results even in non-adversarial spectral learning settings. The most important property of tr(Y)\operatorname{tr}(Y) is that it admits a certain type of local Hölder’s inequality which we will explain in Section 5.

Pseudocode for our algorithm is given in Algorithms 1 and 2.

Approach and Proof Outline

We now provide an outline of the proof of Theorem 2.2, by analyzing the output of Algorithm 1. The structure of our proof has analogies to classical uniform convergence arguments, so we will start by reviewing that case.

In uniform convergence arguments, we assume that all of f1,,fnf_{1},\ldots,f_{n} are drawn from pp^{*}, which brings us into the realm of classical learning theory. The analogue to the optimization (3) in Algorithm 1 is regularized empirical risk minimization:

where h(w)h(w) is a non-negative regularizer. Uniform convergence arguments involve two parts:

Bound the optimization error: Use the definition of w^\hat{w} to conclude that i=1nfi(w^)i=1nfi(w)+λh(w)\sum_{i=1}^{n}f_{i}(\hat{w})\leq\sum_{i=1}^{n}f_{i}(w^{*})+\lambda h(w^{*}) (since w^\hat{w} minimizes (7)). This step shows that w^\hat{w} does almost as well as ww^{*} at minimizing the empirical risk i=1nfi(w)\sum_{i=1}^{n}f_{i}(w).

Bound the statistical error: Show, via an appropriate concentration inequality, that 1ni=1nfi(w)\frac{1}{n}\sum_{i=1}^{n}f_{i}(w) is close to fˉ(w)\bar{f}(w) for all wHw\in\mathcal{H}. Therefore, w^\hat{w} is nearly as good as ww^{*} in terms of the true risk fˉ\bar{f}.

We will see next that the proof of Theorem 2.2 contains steps similar to these, though bounding the statistical error in the presence of adversaries requires an additional step of removing outliers.

Proof Overview

We will establish a stronger version of Theorem 2.2, which exhibits an explicit wEY^w\in\mathcal{E}_{\hat{Y}} with small error:

To prove Theorem 4.1, recall that Algorithm 1 has at its core the following optimization problem:

Throughout the argument, we will make use of the optimality of (w^1:n,Y^)(\hat{w}_{1:n},\hat{Y}) for (8), which implies that

The solution w^1:n\hat{w}_{1:n} to (8) satisfies

We next consider the statistical error. We cannot bound this error via standard uniform convergence techniques, because each fif_{i} has a different argument w^i\hat{w}_{i}. However, it turns out that the operator norm bound SS, together with a bound on tr(Y^)\operatorname{tr}(\hat{Y}), yield concentration of the fif_{i} to fˉ\bar{f}. In particular, we have:

Moreover, the above supposition holds if λ=8αnSr\lambda=\frac{\sqrt{8\alpha}nS}{r} and tr(Y^)>6r2α\operatorname{tr}(\hat{Y})>\frac{6r^{2}}{\alpha}.

Lemma 4.5 ensures that eventually we have tr(Y^)O(r2α)\operatorname{tr}(\hat{Y})\leq\mathcal{O}\left(\frac{r^{2}}{\alpha}\right), which allows us to bound the overall statistical error (using Lemma 4.3) by O(αnrS)\mathcal{O}\left(\sqrt{\alpha}nrS\right). In addition, since λ=O(αnS/r)\lambda=\mathcal{O}\left(\sqrt{\alpha}nS/r\right), the optimization error is bounded (via Lemma 4.2) by O(αnrS)\mathcal{O}\left(\sqrt{\alpha}nrS\right), as well. Combining the various bounds, we end up obtaining

In the remainder of this section, we will prove Lemmas 4.2 through 4.5, then show more formally how Theorem 4.1 follows from these lemmas.

Proof of Lemma 4.2

Proof of Lemma 4.3

For any w0w_{0} and any w1:nw_{1:n} satisfying wiwiYw_{i}w_{i}^{\top}\preceq Y for all ii, we have

To establish (11) from Lemma 4.6, note that

where (i) is by convexity of fif_{i}, (ii) is because

Proof of Lemma 4.4

Proof of Lemma 4.5

Now, remember that ci=cizmaxzizmaxc_{i}^{\prime}=c_{i}\cdot\frac{z_{\max}-z_{i}}{z_{\max}}. Then for any S[n]S\subseteq[n], we have

Proof of Theorem 4.1

Concentration of Errors: A Local Hölder’s Inequality

If the good data is sub-Gaussian, then can we obtain sub-Gaussian concentration of the errors, or can the adversaries force the error to only be small in expectation? What properties of the good data affect concentration of errors in the presence of an adversary?

We call this a local Hölder’s inequality because it is a sharpening of Lemma 4.6, which established via Hölder’s inequality that

Moreover, for any w,wHw,w^{\prime}\in\mathcal{H}, we have

Our outlier removal step can be modified based on SεS_{\varepsilon} so that almost none of the good points are removed. This is not strictly necessary for any of our later results, but is an intuitively appealing property for our algorithm to have. That we can preserve the good points is unsurprising in light of Corollary 5.3, which says that the good points concentrate, and hence should be cleanly separable from any outliers. The modified outlier removal step is given as Algorithm 3.

Suppose that λ=8αnSr\lambda=\frac{\sqrt{8\alpha}nS}{r} and tr(Y^)>35r2α\operatorname{tr}(\hat{Y})>\frac{35r^{2}}{\alpha}. Then, the update step in Algorithm 3 satisfies

We end this section by giving some intuition for the typical scale of SεS_{\varepsilon}. Recall that Lemma 2.1 shows that, when the gradients of fif_{i} are sub-Gaussian with parameter σ\sigma, then SO(σ)S\leq\mathcal{O}({\sigma}) assuming nd/αn\gg d/\alpha. A similar bound holds for SεS_{\varepsilon}, with an additional factor of log(2/ε)\sqrt{\log(2/\varepsilon)}:

If fi(w)fˉ(w)\nabla f_{i}(w)-\nabla\bar{f}(w) is σ\sigma-sub-Gaussian and LL-Lipschitz, then with probability 1δ1-\delta we have

In particular, if n1εα(dmax(1,log(rLσ))+log(1/δ))n\geq\frac{1}{\varepsilon\alpha}\left(d\max\left(1,\log(\tfrac{rL}{\sigma})\right)+\log(1/\delta)\right), then Sε=O(σlog(2/ε))S_{\varepsilon}=\mathcal{O}\left(\sigma\sqrt{\log(2/\varepsilon)}\right) with probability 1δ1-\delta, where O()\mathcal{O}(\cdot) masks only absolute constants.

What happens if the function errors are not sub-Gaussian, but we still have a bound on S=S1S=S_{1}? We can then bound SεS_{\varepsilon} in terms of SS by exploiting the monotonicity of the operator norm.

For any ε1ε2\varepsilon_{1}\leq\varepsilon_{2}, Sε1ε2αnε1αnSε22ε2ε1Sε2S_{\varepsilon_{1}}\leq\sqrt{\frac{\lfloor\varepsilon_{2}\alpha n\rfloor}{\lfloor\varepsilon_{1}\alpha n\rfloor}}S_{\varepsilon_{2}}\leq 2\sqrt{\frac{\varepsilon_{2}}{\varepsilon_{1}}}S_{\varepsilon_{2}}.

Proof of Lemma 5.1

Throughout this section we will for convenience use Δi\Delta_{i} to denote fi(w0)fˉ(w0)\nabla f_{i}(w_{0})-\nabla\bar{f}(w_{0}). We start with a helper lemma that translates information about SεS_{\varepsilon} into a more useful form:

Letting DT=[Δi]iTD_{T}=[\Delta_{i}]_{i\in T}, we show this as follows:

as was to be shown. Here (i) is Cauchy-Schwarz and (ii) uses the condition wiwiYw_{i}w_{i}^{\top}\preceq Y. ∎

Proof of Lemma 5.2

In the final step we invoke (46) twice, identically to Lemma 4.3.

For (50), we follow an identical argument to (12) from Lemma 4.3. Defining w(t)=tw+(1t)ww(t)=tw+(1-t)w^{\prime}, we have

This yields (50) and completes the lemma.

Proof of Corollary 5.3

On the other hand, by Lemmas 4.3 and 4.2, we also have

Proof of Lemma 5.4

Proof of Lemma 5.6

Bounds for Strongly Convex Losses

We now turn our attention to the special case that the functions fif_{i} are strongly convex in ww, in the sense that for all w,wHw,w^{\prime}\in\mathcal{H},

In this case, we will obtain stronger bounds by iteratively clustering the output w^1:n\hat{w}_{1:n} of Algorithm 1 and re-running the algorithm on each cluster. The main theorem in this section is a recovery result in the list decoding model, for an algorithm (Algorithm 4) that formalizes this clustering intuition:

Note, interestingly, that the bound does not depend on the radius rr. Since the list decoding model can be reduced to the semi-verified model, Theorem 6.1 also yields strengthened results in the semi-verified model when the functions are strongly convex (we omit these for brevity).

By strong convexity of fif_{i}, we then have

Let x1,,xnx_{1},\ldots,x_{n} be points in a metric space. A (ρ,τ,δ\rho,\tau,\delta)-padded decomposition is a (random) partition P\mathcal{P} of {x1,,xn}\{x_{1},\ldots,x_{n}\} such that: (i) each element of P\mathcal{P} has diameter at most ρ\rho, and (ii) for each xix_{i}, with probability 1δ1-\delta all points within distance τ\tau of xix_{i} lie in a single element of P\mathcal{P}.

Fakcharoenphol et al. (2003) show that for any τ\tau and δ\delta, a (ρ,τ,δ)(\rho,\tau,\delta)-padded decomposition exists with ρ=O(τlog(n)δ)\rho=\mathcal{O}\left(\frac{\tau\log(n)}{\delta}\right). Moreover, the same proof shows that, if every xix_{i} is within distance τ\tau of at least Ω(αn)\Omega(\alpha n) other xix_{i}, then we can actually take ρ=O(τlog(2/α)δ)\rho=\mathcal{O}\left(\frac{\tau\log(2/\alpha)}{\delta}\right). In particular, in our case we can obtain a (O(slog(2/α)),2s,7/8)(\mathcal{O}\left(s\log(2/\alpha)\right),2s,7/8)-padded decomposition of the w^i\hat{w}_{i} output by Algorithm 1; we show this formally in Lemma A.1. This probabilistic notion of clustering turns out to be sufficient for our purposes.

We are now prepared to analyze Algorithm 4. In each iteration, Algorithm 4 independently samples l=112log(t(t+1)δ)l=112\log\left(\frac{t(t+1)}{\delta}\right) padded decompositions of the w^i\hat{w}_{i}. For each decomposition Ph\mathcal{P}_{h}, it then runs Algorithm 1 on each component of the resulting partition, and thereby obtains candidate values wˉ1(h),,wˉn(h)\bar{w}_{1}(h),\ldots,\bar{w}_{n}(h). Finally, it updates w^i\hat{w}_{i} by finding a point close to at least 12\frac{1}{2} of the candidate values wˉi(h)\bar{w}_{i}(h), across h=1,,lh=1,\ldots,l.

If we formalize this argument, then we obtain Lemma 6.5, which controls the behavior of the update w^1:n(t)w^1:n(t+1)\hat{w}_{1:n}^{(t)}\to\hat{w}_{1:n}^{(t+1)} on each iteration of the loop:

Essentially, Lemma 6.5 shows that if almost all of the good points are within r(t)r^{(t)} of ww^{*} at the beginning of a loop iteration, then almost all of the good points are within 12r(t)\frac{1}{2}r^{(t)} of ww^{*} at the end of that loop iteration, provided r(t)r^{(t)} is large enough. Using Lemma 6.5, we can establish Theorem 6.1.

Proof of Theorem 6.1

Proof of Proposition 6.2

Proof of Corollary 6.4

where the first inequality is convexity of 22\|\cdot\|_{2}^{2}, and the second is because we only added positive terms to the sum.

Now, we will apply Lemma 6.3 at the value bi=12(bi+BCci)b_{i}^{\prime}=\frac{1}{2}\left(b_{i}+\frac{B}{C}c_{i}\right). This satisfies bib_{i}^{\prime}\in (since jcjjbj\sum_{j}c_{j}\geq\sum_{j}b_{j}), and also ibi=ibiεαn\sum_{i}b_{i}^{\prime}=\sum_{i}b_{i}\geq\varepsilon\alpha n. We can therefore invoke Lemma 6.3 to obtain

Proof of Lemma 6.5

Now if ii is not bad, then wˉi(h)w216r(t)\|\bar{w}_{i}(h)-w^{*}\|_{2}\leq\frac{1}{6}r^{(t)} for at least 5l8\frac{5l}{8} values of hh (because it is small for all but l8\frac{l}{8} of the at least 3l4\frac{3l}{4} good hh). Now, consider any wˉi(h0)\bar{w}_{i}(h_{0}) that is within distance 13r(t)\frac{1}{3}r^{(t)} of at least l2\frac{l}{2} of the wˉi(h)\bar{w}_{i}(h). By the pigeonhole principle, wˉi(h0)\bar{w}_{i}(h_{0}) is therefore close to at least one of these good wˉi(h)\bar{w}_{i}(h), and so satisfies wˉi(h0)w2wˉi(h0)wˉi(h)2+wˉi(h)w212r(t)\|\bar{w}_{i}(h_{0})-w^{*}\|_{2}\leq\|\bar{w}_{i}(h_{0})-\bar{w}_{i}(h)\|_{2}+\|\bar{w}_{i}(h)-w^{*}\|_{2}\leq\frac{1}{2}r^{(t)}. Moreover, such a h0h_{0} exists since any of the good wˉi(h)\bar{w}_{i}(h) be within distance 13r(t)\frac{1}{3}r^{(t)} of the other good wˉi(h)\bar{w}_{i}(h^{\prime}).

Lower Bounds

In this section we prove lower bounds showing that the dependence of our bounds on SS is necessary even if pp^{*} is a multivariate Gaussian. For α\alpha, we are only able to show a necessary dependence of log(1α)\sqrt{\log(\frac{1}{\alpha})}, rather than the 1/α\sqrt{{1}/{\alpha}} appearing in our bounds. Determining the true worst-case dependence on α\alpha is an interesting open problem.

One natural question is whether SS, which typically depends on the maximum singular value of the covariance matrix, is really the right dependence, or whether we could achieve bounds based on e.g. the average singular value instead. Lemma 7.1 rules this out, showing that it would require Ω(2k)\Omega(2^{k}) candidates in the list-decodable setting to achieve dependence on even the kkth singular value of the covariance matrix.

Suppose that pp^{*} is known to be a multivariate Gaussian N(μ,Σ)\mathcal{N}(\mu,\Sigma) with covariance ΣΣ0\Sigma\preceq\Sigma_{0}, where μ\mu and Σ\Sigma are otherwise unknown. Also let σk2\sigma_{k}^{2} denote the kkth largest singular value of Σ0\Sigma_{0}. Then, given any amount of data, an α\alpha-fraction of which is drawn from pp^{*}, any procedure for outputting at most m=2k1m=2^{k-1} candidate means μ^1,,μ^m\hat{\mu}_{1},\ldots,\hat{\mu}_{m} must have, with probability at least 12\frac{1}{2}, minj=1mμ^jμ2σk4log(1α)1+log(1α)\min_{j=1}^{m}\|\hat{\mu}_{j}-\mu\|_{2}\geq\frac{\sigma_{k}}{4}\cdot\frac{\log(\frac{1}{\alpha})}{\sqrt{1+\log(\frac{1}{\alpha})}}.

By the reduction from the semi-verified to the list-decodable model, we obtain a lower bound in the semi-verified setting as well:

Any algorithm that has μ^μ2<σk4log(1α)1+log(1α)\|\hat{\mu}-\mu\|_{2}<\frac{\sigma_{k}}{4}\cdot\frac{\log(\frac{1}{\alpha})}{\sqrt{1+\log(\frac{1}{\alpha})}} with probability at least 12\frac{1}{2} in the semi-verified model requires at least k2log2(1α)\frac{k-2}{\log_{2}(\frac{1}{\alpha})} verified samples.

Let us suppose that the unverified data has distribution p^=N(0,Σ0)\hat{p}=\mathcal{N}(0,\Sigma_{0}), which is possible iff p^αp\hat{p}\geq\alpha p^{*}. We start with a lemma characterizing when it is possible that the true distribution pp^{*} has mean μ\mu:

Let p^=N(0,Σ0)\hat{p}=\mathcal{N}(0,\Sigma_{0}) and p=N(μ,Σ0λμμ)p^{*}=\mathcal{N}(\mu,\Sigma_{0}-\lambda\mu\mu^{\top}). Then p^αp\hat{p}\geq\alpha p^{*} provided that μΣ01μlog2(1α)1+log(1α)\mu^{\top}\Sigma_{0}^{-1}\mu\leq\frac{\log^{2}(\frac{1}{\alpha})}{1+\log(\frac{1}{\alpha})} and λ=1log(1α)\lambda=\frac{1}{\log(\frac{1}{\alpha})}.

As a consequence, if μΣ01μt2\mu^{\top}\Sigma_{0}^{-1}\mu\leq t^{2} for t=log(1α)1+log(1α)t=\frac{\log(\frac{1}{\alpha})}{\sqrt{1+\log(\frac{1}{\alpha})}}, then pp^{*} could have mean μ\mu. Now, consider the space spanned by the kk largest eigenvectors of Σ0\Sigma_{0}, and let BkB_{k} be the ball of radius tσkt\sigma_{k} in this space. Also let P\mathcal{P} be a maximal packing of BkB_{k} of radius t4σk\frac{t}{4}\sigma_{k}. A simple volume argument shows that Bk2k|B_{k}|\geq 2^{k}. On the other hand, every element μ\mu of BkB_{k} is a potential mean because it satisfies μΣ01μμ22σk2t2\mu^{\top}\Sigma_{0}^{-1}\mu\leq\frac{\|\mu\|_{2}^{2}}{\sigma_{k}^{2}}\leq t^{2}. Therefore, if the true mean μ\mu is drawn uniformly from BkB_{k}, any 2k12^{k-1} candidate means μ^j\hat{\mu}_{j} must miss at least half of the elements of BkB_{k} (in the sense of being distance at least t4σk\frac{t}{4}\sigma_{k} away) and so with probability at least 12\frac{1}{2}, μ^jμ2\|\hat{\mu}_{j}-\mu\|_{2} is at least t4σk\frac{t}{4}\sigma_{k} for all jj, as was to be shown.

Proof of Corollary 7.2

Suppose that we can obtain μ^\hat{\mu} satisfying μ^μ2<σk4log(1α)1+log(1α)\|\hat{\mu}-\mu\|_{2}<\frac{\sigma_{k}}{4}\cdot\frac{\log(\frac{1}{\alpha})}{\sqrt{1+\log(\frac{1}{\alpha})}} with mk2log2(1α)m\leq\frac{k-2}{\log_{2}(\frac{1}{\alpha})} samples, with probability at least 12\frac{1}{2}. If we sample mm elements uniformly from the unverified samples, with probability αm\alpha^{m} we will obtain only samples from pp^{*}, and therefore with probability 12αm\frac{1}{2}\alpha^{m} we can obtain (by assumption) a candidate mean μ^\hat{\mu} with μ^μ2<σk4log(1α)1+log(1α)\|\hat{\mu}-\mu\|_{2}<\frac{\sigma_{k}}{4}\cdot\frac{\log(\frac{1}{\alpha})}{\sqrt{1+\log(\frac{1}{\alpha})}}. If we repeat this 2αm\frac{2}{\alpha^{m}} times, then with probability 1(112αm)2αm11e>121-\left(1-\frac{1}{2}\alpha^{m}\right)^{\frac{2}{\alpha^{m}}}\geq 1-\frac{1}{e}>\frac{1}{2}, at least one of the candidate means will be close to μ\mu. But by Lemma 7.1, this implies that 2αm2k1\frac{2}{\alpha^{m}}\geq 2^{k-1}, and so mk2log2(1α)m\geq\frac{k-2}{\log_{2}(\frac{1}{\alpha})}, as claimed.

Intuition: Stability Under Subsets

In this section, we establish a sort of “duality for robustness” that provides intuition underlying our results. Essentially, we will show the following:

If a statistic of a dataset is approximately preserved across every large subset of the data, then it can be robustly recovered even in the presence of a large amount of additional arbitrary/adversarial data.

In such a case, can we approximate the mean of II, even if II is not known and the points in [n]\I[n]\backslash I are arbitrary?

If we do not care about computation, the answer is yes, via a simple exponential-time algorithm. Call a set of points JJ α\alpha-stable if it satisfies the following properties: Jαn|J|\geq\alpha n, and (102) holds with II replaced by JJ (i.e., the mean moves by at most ε\varepsilon when taking subsets of JJ of size at least 12α2n\frac{1}{2}\alpha^{2}n). Then, we can run the following (exponential-time) algorithm:

Find an α\alpha-stable set JJ which has overlap at most 12α2n\frac{1}{2}\alpha^{2}n with all elements of U\mathcal{U}.

Append JJ to U\mathcal{U} and continue until no more such sets exist.

A simple inclusion-exclusion argument Specifically, kk sets of size αn\alpha n with pairwise overlap 12α2n\frac{1}{2}\alpha^{2}n must have a union of size at least kαn12(k2)α2nk\alpha n-\frac{1}{2}\binom{k}{2}\alpha^{2}n, which implies k2αk\leq\frac{2}{\alpha}. shows that k2αk\leq\frac{2}{\alpha}. Therefore, the above algorithm terminates with U2α|\mathcal{U}|\leq\frac{2}{\alpha}. On the other hand, by construction, II must have overlap at least 12α2n\frac{1}{2}\alpha^{2}n with at least one element of U\mathcal{U} (as otherwise we could have also added II to U\mathcal{U}). Therefore, for some JUJ\in\mathcal{U}, IJ12α2n|I\cap J|\geq\frac{1}{2}\alpha^{2}n. But then, letting T=IJT=I\cap J, we have μIμJ2μIμT2+μTμJ22ε\|\mu_{I}-\mu_{J}\|_{2}\leq\|\mu_{I}-\mu_{T}\|_{2}+\|\mu_{T}-\mu_{J}\|_{2}\leq 2\varepsilon. Therefore, the mean over II is within distance 2ε2\varepsilon of at least one of the μJ\mu_{J}, for JUJ\in\mathcal{U}.

We have therefore established our stated duality property for robustness: if a statistic is stable under taking subsets of data, then it can also be recovered if the data is itself a subset of some larger set of data.

Can we make the above algorithm computationally efficient? First, can we even check if a set JJ is α\alpha-stable? This involves checking the following constraint:

It is already unclear how to check this constraint efficiently. However, defining the matrix Cij=cicjJ×JC_{ij}=c_{i}c_{j}\in^{J\times J}, we can take a semi-definite relaxation of CC, resulting in the constraints CijC_{ij}\in and tr(C)12α2n\operatorname{tr}(C)\geq\frac{1}{2}\alpha^{2}n. Letting Aij=(xiμJ)(xjμJ)A_{ij}=(x_{i}-\mu_{J})^{\top}(x_{j}-\mu_{J}), this results in the following sufficient condition for α\alpha-stability:

As an aside, we note that the above discussion, together with the bound (Lemma B.3) on the operator norm of samples from a sub-Gaussian distribution, implies that a set of samples from a sub-Gaussian distribution is α\alpha-stable with high probability, and in particular the mean of any large subset of samples is close to the mean of the overall set. Moreover, even if a distribution has only bounded second moments, Proposition B.1 shows that a suitable subset of samples from the distribution will be α\alpha-stable.

Applications

In this section we present several applications of our main results.

For robustly learning the mean of a distribution with bounded variance, we have the following result:

Define fi(w)=wxi22f_{i}(w)=\|w-x_{i}\|_{2}^{2}, which is 11-strongly convex and has w=μw^{*}=\mu. In this case, fi(w)fˉ(w)=xiμ\nabla f_{i}(w)-\nabla\bar{f}(w)=x_{i}-\mu, and so Proposition B.1 implies that with probability 1exp(Ω(ε2αn))1-\exp(-\Omega(\varepsilon^{2}\alpha n)) there is a subset of (1ε/2)αn(1-\varepsilon/2)\alpha n of the good points satisfying S=O(σ/ε)S=\mathcal{O}\left(\sigma/\sqrt{\varepsilon}\right), and hence (by Lemma 5.6) Sε/2=O(σ/ε)S_{\varepsilon/2}=\mathcal{O}\left(\sigma/\varepsilon\right). Theorem 6.1 then says that we can output parameters w^1,,w^m\hat{w}_{1},\ldots,\hat{w}_{m}, where m1(1ε/2)2α1(1ε)αm\leq\lfloor\frac{1}{(1-\varepsilon/2)^{2}\alpha}\rfloor\leq\lfloor\frac{1}{(1-\varepsilon)\alpha}\rfloor, such that minj=1mw^jμ2O(σεlog(2/α)α)\min_{j=1}^{m}\|\hat{w}_{j}-\mu\|_{2}\leq\mathcal{O}\left(\frac{\sigma}{\varepsilon}\sqrt{\frac{\log(2/\alpha)}{\alpha}}\right), as was to be shown. ∎

For robustly learning a mixture of distributions, we have the following result:

Suppose we are given nn samples, where each sample either comes from one of kk sub-Gaussian distributions p1,,pkp_{1}^{*},\ldots,p_{k}^{*} (with Covpi[x]σ2I\operatorname{Cov}_{p_{i}^{*}}[x]\preceq\sigma^{2}I), or is arbitrary. Let IiI_{i}, αi\alpha_{i}, and μi\mu_{i} denote respectively the set of points drawn from pip_{i}^{*}, the fraction of points drawn from pip_{i}^{*}, and the mean of pip_{i}^{*}, and let α=mini=1kαi\alpha=\min_{i=1}^{k}\alpha_{i}. Suppose that ndαn\geq\frac{d}{\alpha}. Then for any ε12\varepsilon\leq\frac{1}{2}, with probability 1kexp(Ω(ε2αn))1-k\exp(-\Omega(\varepsilon^{2}\alpha n)), we can output a partition T1,,TmT_{1},\ldots,T_{m} of [n][n] and candidate means μ^1,,μ^m\hat{\mu}_{1},\ldots,\hat{\mu}_{m} such that:

m1(1ε)αm\leq\lfloor\frac{1}{(1-\varepsilon)\alpha}\rfloor.

For all but εαin\varepsilon\alpha_{i}n of the elements of hh of IiI_{i}, hh is assigned a partition TjT_{j} such that \|\mu_{i}-\hat{\mu}_{j}\|_{2}\leq\mathcal{O}\Big{(}{\frac{\sigma}{\varepsilon}\sqrt{\frac{\log(2/\alpha)}{\alpha}}}\Big{)}, where O()\mathcal{O}(\cdot) masks only absolute constants.

For the planted partition model, we have the following result:

Let I1,,IkI_{1},\ldots,I_{k} be disjoint subsets of [n][n], and consider the following random directed graph with vertex set [n][n]:

If u,vIiu,v\in I_{i}, the probability of an edge from uu to vv is an\frac{a}{n}.

If uIi,v∉Iiu\in I_{i},v\not\in I_{i}, the probability of an edge from uu to vv is bn\frac{b}{n}.

If u∉i=1kIiu\not\in\cup_{i=1}^{k}I_{i}, the edges emanating from uu can be arbitrary.

Let α=mini=1kIin\alpha=\min_{i=1}^{k}\frac{|I_{i}|}{n}. Then with probability 1kexp(Ω(αn))1-k\exp(-\Omega(\alpha n)), we can output sets T1,,Tm[n]T_{1},\ldots,T_{m}\subseteq[n], with m4αm\leq\frac{4}{\alpha}, such that for all i[k]i\in[k], there is a j[m]j\in[m] with IiTjO(alog(2/α)α2(ab)2n)|I_{i}\triangle T_{j}|\leq\mathcal{O}\left(\frac{a\log(2/\alpha)}{\alpha^{2}(a-b)^{2}}n\right), where \triangle denotes the symmetric difference of two sets.

It turns out that this problem reduces to mean estimation as well, though we will need to be more careful in how we apply our matrix concentration bound.

In this case we have d=nd=n, so Proposition B.1 implies that we can find a subset of IiI_{i} of size at least Ii/2α2n|I_{i}|/2\geq\frac{\alpha}{2}n such that S=O(σ1+dαn)=O(aαn)S=\mathcal{O}\left(\sigma\sqrt{1+\frac{d}{\alpha n}}\right)=\mathcal{O}\left(\sqrt{\frac{a}{\alpha n}}\right). Invoking Theorem 6.1 with ε=12\varepsilon=\frac{1}{2}, we obtain candidate means μ^1,,μ^m\hat{\mu}_{1},\ldots,\hat{\mu}_{m}, with m4αm\leq\frac{4}{\alpha}, such that μiμ^j22O(S2log(2/α)α)=O(alog(2/α)α2n)\|\mu_{i}-\hat{\mu}_{j}\|_{2}^{2}\leq\mathcal{O}\left(\frac{S^{2}\log(2/\alpha)}{\alpha}\right)=\mathcal{O}\left(\frac{a\log(2/\alpha)}{\alpha^{2}n}\right).

Now, we define the set TjT_{j} such that uTju\in T_{j} iff (μ^j)ua+b2n(\hat{\mu}_{j})_{u}\geq\frac{a+b}{2n}. Suppose for some uu, uIiu\in I_{i} but u∉Tju\not\in T_{j}. Then we must have (μ^j)u<a+b2n(\hat{\mu}_{j})_{u}<\frac{a+b}{2n}, but (μi)u=an(\mu_{i})_{u}=\frac{a}{n}. Therefore, any uIiu\in I_{i} with u∉Tju\not\in T_{j} contributes at least (ab)24n2\frac{(a-b)^{2}}{4n^{2}} to μiμ^j22\|\mu_{i}-\hat{\mu}_{j}\|_{2}^{2}. Similarly, any uTju\in T_{j} with u∉Iiu\not\in I_{i} similarly contributed at least (ab)24n2\frac{(a-b)^{2}}{4n^{2}}. We can conclude that μiμ^j22(ab)24n2IiTj\|\mu_{i}-\hat{\mu}_{j}\|_{2}^{2}\geq\frac{(a-b)^{2}}{4n^{2}}|I_{i}\triangle T_{j}|. In particular, if μiμ^j22=O(alog(2/α)α2n)\|\mu_{i}-\hat{\mu}_{j}\|_{2}^{2}=\mathcal{O}\left(\frac{a\log(2/\alpha)}{\alpha^{2}n}\right), then IiTj=O(alog(2/α)α2(ab)2n)|I_{i}\triangle T_{j}|=\mathcal{O}\left(\frac{a\log(2/\alpha)}{\alpha^{2}(a-b)^{2}}n\right), as was to be shown. ∎

Finally, we have the following corollary for density modeling:

Now take a single verified sample xpx\sim p^{*}, and define θ^=arg minθEYA(θ)θϕ(x)\hat{\theta}=\operatornamewithlimits{arg\,min}_{\theta\in\mathcal{E}_{Y}}A(\theta)-\theta^{\top}\phi(x). Similarly to Lemma 2.4, we have

where (i) is because θ^\hat{\theta} minimizes the right-hand term in (106). On the other hand, we have

The above gives us the desired result in the semi-verified model. But then, we can reduce the semi-verified model to the list decodable model by sampling random points from the untrusted data, which gives us the list decodable result as well. ∎

Appendix A Padded Decompositions

The following lemma is essentially from Fakcharoenphol et al. (2003).

Let x1,,xnx_{1},\ldots,x_{n} be points in a metric space, and let I[n]I\subseteq[n] be such that d(xi,xj)rd(x_{i},x_{j})\leq r for all i,jIi,j\in I. Consider the following procedure:

Initialize X={1,,n}X=\{1,\ldots,n\}, P=\mathcal{P}=\emptyset, and sample kk uniformly from 22 to kmaxk_{\max}, where kmax1+1δlog(nI)k_{\max}\geq 1+\frac{1}{\delta}\log(\tfrac{n}{|I|}).

Sample ii uniformly from {1,,n}\{1,\ldots,n\}

Let TT be the set of points in XX within distance krkr of xix_{i}.

Update XX\TX\leftarrow X\backslash T, P=P{T}\mathcal{P}=\mathcal{P}\cup\{T\}.

Then, with probability at least 1δ1-\delta, all elements of II are in a single element of P\mathcal{P}.

It suffices to show the following: if TT contains any element of II, then with probability at least 34\frac{3}{4} it contains all elements of II.

as was to be shown. Here (i) is the good old AM-GM inequality. ∎

Appendix B Matrix Concentration Bounds

The following is proved using techniques from Batson et al. (2012). However, the bound itself appears to be novel.

Without loss of generality we will take μ=0\mu=0. Our proof relies on the following claim:

For a matrix MM, suppose that McIM\prec cI and tr((cIM)1)14σ2\operatorname{tr}\left((cI-M)^{-1}\right)\leq\frac{1}{4\sigma^{2}}. Then, for xpx\sim p, with probability at least 1ε21-\frac{\varepsilon}{2} we have M+εxx(c+4σ2)IM+\varepsilon xx^{\top}\prec(c+4\sigma^{2})I and tr(((c+4σ2)I(M+εxx))1)14σ2\operatorname{tr}\left(((c+4\sigma^{2})I-(M+\varepsilon xx^{\top}))^{-1}\right)\leq\frac{1}{4\sigma^{2}}.

We prove this claim below, first showing how Proposition B.1 follows from the claim. We will go through the stream of nn samples, and keep a running set JJ and matrix MM. Initially J=J=\emptyset and M=0M=0. For each index ii in the stream, we will add ii to JJ (and update MM to M+εxixiM+\varepsilon x_{i}x_{i}^{\top}) if the conclusion of Lemma B.2 holds for c=4σ2(d+J)c=4\sigma^{2}(d+|J|). Note that then, by induction, the precondition of Lemma B.2 always holds for c=4σ2(d+J)c=4\sigma^{2}(d+|J|), since it holds when J=J=\emptyset and the condition for adding ii to JJ ensures that it continues to hold.

At the end of the stream, we therefore obtain JJ and MM such that M4σ2(d+J)IM\preceq 4\sigma^{2}(d+|J|)I, which implies that λmax(1JiJxixi)=1εJλmax(M)4σ2ε(1+dJ)\lambda_{\max}(\frac{1}{|J|}\sum_{i\in J}x_{i}x_{i}^{\top})=\frac{1}{\varepsilon|J|}\lambda_{\max}(M)\leq\frac{4\sigma^{2}}{\varepsilon}\left(1+\frac{d}{|J|}\right). It remains to show that J(1ε)n|J|\geq(1-\varepsilon)n with high probability.

To see this, note that for each element ii, J|J| increases by 11 with probability at least 1ε21-\frac{\varepsilon}{2}. The distribution of J|J| is therefore lower-bounded by the sum of nn independent Bernoulli variables with bias 1ε21-\frac{\varepsilon}{2}, which by the Chernoff bound is greater than (1ε)n(1-\varepsilon)n with probability at least 1exp(ε2n16)1-\exp\left(-\frac{\varepsilon^{2}n}{16}\right). ∎

This is essentially Lemma 3.3 of Batson et al. (2012). Instead of M+εxxM+\varepsilon xx^{\top}, it will be helpful later to consider consider M+txxM+txx^{\top} for arbitrary t[0,ε]t\in[0,\varepsilon]. By the Sherman-Morrison matrix inversion formula, we have

Therefore, letting Φc(M)\Phi_{c}(M) denote tr((cIM)1)\operatorname{tr}((cI-M)^{-1}), we have

We want to have Φc+4σ2I(M+txx)Φc(M)\Phi_{c+4\sigma^{2}I}(M+txx^{\top})\leq\Phi_{c}(M), which in light of the above is equivalent to the condition

where the second step is in light of the inequality

Since Φc+4σ2(M)Φc(M)14σ2\Phi_{c+4\sigma^{2}}(M)\leq\Phi_{c}(M)\leq\frac{1}{4\sigma^{2}}, the right-hand-side of (118) is at most 12\frac{1}{2} in expectation. Therefore, with probability at least 1ε21-\frac{\varepsilon}{2}, it is at most 1ε\frac{1}{\varepsilon}, in which case (118) holds for all t[0,ε]t\in[0,\varepsilon].

This certainly implies that Φc+4σ2(M+εxx)=tr(((c+4σ2)I(M+εxx))1)14σ2\Phi_{c+4\sigma^{2}}(M+\varepsilon xx^{\top})=\operatorname{tr}\left(((c+4\sigma^{2})I-(M+\varepsilon xx^{\top}))^{-1}\right)\leq\frac{1}{4\sigma^{2}}. Moreover, we also have that Φc+4σ2(M+txx)14σ2<\Phi_{c+4\sigma^{2}}(M+txx^{\top})\leq\frac{1}{4\sigma^{2}}<\infty for all t[0,ε]t\in[0,\varepsilon]. Since the eigenvalues of M+txxM+txx^{\top} are a continuous function of tt, we therefore know that the maximum eigenvalue of M+txxM+txx^{\top} never crosses c+4σ2c+4\sigma^{2} for t[0,ε]t\in[0,\varepsilon], and so M+εxx(c+4σ2)IM+\varepsilon xx^{\top}\prec(c+4\sigma^{2})I as well (since McI(c+4σ2)IM\prec cI\prec(c+4\sigma^{2})I by assumption). ∎

B.2 Concentration for sub-Gaussians

In this section we provide a matrix concentration result for sub-Gaussian variables, which underlies Lemma 5.5. The proofs closely mirrors Theorem 5.39 of Vershynin (2010).

where Q=18σ2(d+log(1/δ))Q=18\sigma^{2}\left(d+\log(1/\delta)\right).

We need the right-hand-side to be smaller than δ9d(ms)\frac{\delta}{9^{d}\binom{m}{s}}, for which we can take

Overall, we have that with probability 1δ1-\delta,

for all TT of size ss and all uNu\in\mathcal{N}. In particular, (123) can be bounded by 18σ2log(e/β)18\sigma^{2}\log(e/\beta), which completes the proof. ∎

Appendix C Deferred Proofs

Suppose that za+bz+c2z\leq a+\sqrt{bz+c^{2}} where a,b,c,0a,b,c,\geq 0. Then, z2a+b+cz\leq 2a+b+c.

Re-arranging to remove the square-root, we have z2(2a+b)z+(a2c2)0z^{2}-(2a+b)z+(a^{2}-c^{2})\leq 0. Now suppose that z>2a+b+cz>2a+b+c. Then, z2(2a+b)z+(a2c2)=z(z2ab)+(a2c2)>(2a+b+c)c+(a2c2)=a2+(2a+b)c>0z^{2}-(2a+b)z+(a^{2}-c^{2})=z(z-2a-b)+(a^{2}-c^{2})>(2a+b+c)c+(a^{2}-c^{2})=a^{2}+(2a+b)c>0, which is a contradiction. ∎

C.2 Proof of Lemma 2.1

We omit this as it is a special case of Lemma 5.5, proved below.

C.3 Proof of Lemma 2.4

where (i) and (iv) are applications of the power mean inequality, (ii) is because ϕ\phi is 11-Lipschitz, and (iii) is because wwYww^{\top}\preceq Y.

On the other hand, we have Σσq2I\Sigma\preceq\sigma_{q}^{2}I by the qqth moment condition (since q>2q>2), so tr(YΣ)q/2tr(Y)q/2σqq\operatorname{tr}(Y^{\top}\Sigma)^{q/2}\leq\operatorname{tr}(Y)^{q/2}\sigma_{q}^{q}. Also, letting Y=iλiviviY=\sum_{i}\lambda_{i}v_{i}v_{i}^{\top} be the eigendecomposition of YY, we have

C.4 Proof of Lemma 3.1

By duality of operator and trace norm, we have (letting ziz_{i} be the iith row of the matrix ZZ)

as was to be shown. Here (i) is Cauchy-Schwarz and (ii) uses the fact that wiwiYw_{i}w_{i}\preceq Y.

C.5 Proof of Lemma 4.6

C.6 Proof of Lemma 5.5

By Lemma B.3 below, at any fixed ww it is the case that

On the other hand, because fi(w)fˉ(w)\nabla f_{i}(w)-\nabla\bar{f}(w) is LL-Lipschitz in ww, one can check that Sε(w)S_{\varepsilon}(w) is also LL-Lipschitz in ww (see Lemma C.2 below). In particular, around any fixed ww, it holds that Sε(w)σ(1+6log(e/ε)+d+log(1/δ)εαn)S_{\varepsilon}(w)\leq\sigma\left(1+6\sqrt{\log(e/\varepsilon)+\frac{d+\log(1/\delta)}{\varepsilon\alpha n}}\right) for all ww^{\prime} with ww2σL\|w^{\prime}-w\|_{2}\leq\frac{\sigma}{L}. But we can cover H\mathcal{H} with (1+2rLσ)d\left(1+\frac{2rL}{\sigma}\right)^{d} balls of radius σL\frac{\sigma}{L}. Therefore, a union bound argument yields that, with probability 1δ1-\delta, for all wHw\in\mathcal{H} we have

If fi(w)fˉ(w)\nabla f_{i}(w)-\nabla\bar{f}(w) is LL-Lipschitz in ww, then Sε(w)S_{\varepsilon}(w) is also LL-Lipschitz in ww.

as was to be shown. Here (i) is the triangle inequality for the operator norm, (ii) is because Frobenius norm is larger than operator norm, and (iii) invokes the assumed Lipschitz property for Δi(w)\Delta_{i}(w). ∎

C.7 Proof of Lemma 7.3

Recall that p=N(μ,Σ0λμμ)p^{*}=\mathcal{N}(\mu,\Sigma_{0}-\lambda\mu\mu^{\top}) and p^=N(0,Σ0)\hat{p}=\mathcal{N}(0,\Sigma_{0}). We want to show that αpp^\alpha p^{*}\leq\hat{p}, which is equivalent to (by expanding the density formula for a Gaussian distribution)

Let us simplify the above quadratic in xx. Letting s=xΣ01μs=x^{\top}\Sigma_{0}^{-1}\mu and t=μΣ01μt=\mu^{\top}\Sigma_{0}^{-1}\mu, we have

where (i) is the Woodbury matrix inversion lemma, and (ii) is expanding the quadratic. The minimizing value of ss is 1λ\frac{1}{\lambda}, so that we obtain a lower bound (over all xx) of

Plugging back into (157), we obtain the condition

Simplifying further and again letting t=μΣ01μt=\mu^{\top}\Sigma_{0}^{-1}\mu, we have

so it suffices to have 1λ+λt1λt2log(1α)\frac{1}{\lambda}+\frac{\lambda t}{1-\lambda t}\leq 2\log(\frac{1}{\alpha}). Now, we will take λ=1log(1α)\lambda=\frac{1}{\log(\frac{1}{\alpha})}, in which case we need tlog(1α)tlog(1α)\frac{t}{\log(\frac{1}{\alpha})-t}\leq\log(\frac{1}{\alpha}), or equivalently tlog2(1α)1+log(1α)t\leq\frac{\log^{2}(\frac{1}{\alpha})}{1+\log(\frac{1}{\alpha})}, as claimed.

C.8 Bounding the square of a sub-Gaussian

for λ14σ2|\lambda|\leq\frac{1}{4\sigma^{2}}.

We begin by bounding the moments of ZZ. We have, for any p0p\geq 0 and uu,

Here (i) uses the inequality log(z)log(a)1a(za)\log(z)-\log(a)\leq\frac{1}{a}\left(z-a\right) (by concavity of log\log), while (iii) simply invokes the sub-Gaussian assumption at λ=pa\lambda=\frac{p}{a}. The inequality (ii) is used to move from Z|Z| to ZZ.

We will take the particular value a=pσa=\sqrt{p}\sigma, which yields

Clearly, when λ14σ2|\lambda|\leq\frac{1}{4\sigma^{2}}, the above is bounded by 1+16λ2σ4exp(16σ4λ2)1+16\lambda^{2}\sigma^{4}\leq\exp\left(16\sigma^{4}\lambda^{2}\right), as claimed. ∎

References