Sever: A Robust Meta-Algorithm for Stochastic Optimization

Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, Alistair Stewart

Introduction

Learning in the presence of outliers is a ubiquitous challenge in machine learning; nevertheless, most machine learning methods are very sensitive to outliers in high dimensions. The focus of this work is on designing algorithms that are outlier robust while remaining competitive in terms of accuracy and running time.

We highlight two motivating applications. The first is biological data (such as gene expression data), where mislabeling or measurement errors can create systematic outliers [RPW+02, LAT+08] that require painstaking manual effort to remove [PLJD10]. Detecting outliers in such settings is often important either because the outlier observations are of interest themselves or because they might contaminate the downstream statistical analysis. The second motivation is machine learning security, where outliers can be introduced through data poisoning attacks [BNJT10] in which an adversary inserts fake data into the training set (e.g., by creating a fake user account). Recent work has shown that for high-dimensional datasets, even a small fraction of outliers can substantially degrade the learned model [BNL12, NPXNR14, KL17, SKL17, KSL18].

Crucially, in both the biological and security settings above, the outliers are not “random” but are instead highly correlated, and could have a complex internal structure that is difficult to model. This leads us to the following conceptual question underlying the present work: Can we design training algorithms that are robust to the presence of an ε\varepsilon-fraction of arbitrary (and potentially adversarial) outliers?

Estimation in the presence of outliers is a prototypical goal in robust statistics and has been systematically studied since the pioneering work of Tukey [Tuk60]. Popular methods include RANSAC [FB81], minimum covariance determinant [RD99], removal based on kk-nearest neighbors [BKNS00], and Huberizing the loss [Owe07] (see [HA04] for a comprehensive survey). However, these classical methods either break down in high dimensions, or only handle “benign” outliers that are obviously different from the rest of the data (see Section 1.1 for futher discussion of these points).

Motivated by this, recent work in theoretical computer science has developed efficient robust estimators for classical problems such as linear classification [KLS09, ABL14], mean and covariance estimation [DKK+16, LRV16], clustering [CSV17], and regression [BJK15, BJKK17, BDLS17]. Nevertheless, the promise of practical high-dimensional robust estimation is yet to be realized; indeed, the aforementioned results generally suffer from one of two shortcomings–either they use sophisticated convex optimization algorithms that do not scale to large datasets, or they are tailored to specific problems of interest or specific distributional assumptions on the data, and hence do not have good accuracy on real data.

In this work, we address these shortcomings. We propose an algorithm, Sever, that is:

Robust: it can handle arbitrary outliers with only a small increase in error, even in high dimensions.

General: it can be applied to most common learning problems including regression and classification, and handles non-convex models such as neural networks.

Practical: the algorithm can be implemented with standard machine learning libraries.

At a high level, our algorithm (depicted in Figure 1 and described in detail in Section 2.2) is a simple “plug-in” outlier detector–first, run whatever learning procedure would be run normally (e.g., least squares in the case of linear regression). Then, consider the matrix of gradients at the optimal parameters, and compute the top singular vector of this matrix. Finally, remove any points whose projection onto this singular vector is too large (and re-train if necessary).

Despite its simplicity, our algorithm possesses strong theoretical guarantees: As long as the real (non-outlying) data is not too heavy-tailed, Sever is provably robust to outliers–see Section 2 for detailed statements of the theory. At the same time, we show that our algorithm works very well in practice and outperforms a number of natural baseline outlier detectors. In line with our original motivating biological and security applications, we implement our method on two tasks–a linear regression task for predicting protein activity levels [OSB+18], and a spam classification task based on emails from the Enron corporation [MAP06]. Even with a small fraction of outliers, baseline methods perform poorly on these datasets; for instance, on the Enron spam dataset with a 1%1\% fraction of outliers, baseline errors range from 13.4%13.4\% to 20.5%20.5\%, while Sever incurs only 7.3%7.3\% error (in comparison, the error is 3%3\% in the absence of outliers). Similarly, on the drug design dataset, with 10%10\% corruptions, Sever achieved 1.421.42 mean-squared error test error, compared to 1.511.51-2.332.33 for the baselines, and 1.231.23 error on the uncorrupted dataset.

As mentioned above, the myriad classical approaches to robust estimation perform poorly in high dimensions or in the presence of worst-case outliers. For instance, RANSAC [FB81] works by removing enough points at random that no outliers remain with decent probability; since we need at least dd points to fit a dd-dimensional model, this requires the number of outliers to be O(1/d)O(1/d). kk-nearest neighbors [BKNS00] similarly suffers from the curse of dimensionality when dd is large. The minimum covariance determinant estimator [RD99] only applies when the number of data points nn exceeds 2d2d, which does not hold for the datasets we consider (it also has other issues such as computational intractability). A final natural approach is to limit the effect of points with large loss (via e.g. Huberization [Owe07]), but as [KSL18] show (and we confirm in our experiments), correlated outliers often have lower loss than the real data under the learned model.

These issues have motivated work on high-dimensional robust statistics going back to Tukey [Tuk75]. However, it was not until much later that efficient algorithms with favorable properties were first proposed. [KLS09] gave the first efficient algorithms for robustly classification under the assumption that the distribution of the good data is isotropic and log-concave. Subsequently, [ABL14] obtained an improved and nearly optimal robust algorithm for this problem. Two concurrent works [DKK+16, LRV16] gave the first efficient robust estimators for several other tasks including mean and covariance estimation. There has since been considerable study of algorithmic robust estimation in high dimensions, including learning graphical models [DKS16], understanding computation-robustness tradeoffs [DKS17c, DKK+18], establishing connections to PAC learning [DKS17a], tolerating more noise by outputting a list of hypotheses [CSV17, MV18, DKS17b], robust estimation of discrete structures [Ste17, QV18, SCV18], and robust estimation via sum-of-squares [KS17b, HL17, KS17a].

Despite this progress, these recent theoretical papers typically focus on designing specialized algorithms for specific settings (such as mean estimation or linear classification for specific families of distributions) rather than on designing general algorithms. The only exception is [CSV17], which provides a robust meta-algorithm for stochastic convex optimization in a similar setting to ours. However, that algorithm (i) requires solving a large semidefinite program and (ii) incurs a significant loss in performance relative to standard training even in the absence of outliers. On the other hand, [DKK+17] provide a practical implementation of the robust mean and covariance estimation algorithms of [DKK+16], but do not consider more general learning tasks.

A number of papers [NTN11, NT13, BJK15, BJKK17] have proposed efficient algorithms for a type of robust linear regression. However, these works consider a restrictive corruption model that only allows adversarial corruptions to the responses (but not the covariates). On the other hand, [BDLS17] studies (sparse) linear regression and, more broadly, generalized linear models (GLMs) under a robustness model very similar to the one considered here. The main issues with this algorithm are that (i) it requires running the ellipsoid method (hence does not scale) and (ii) it crucially assumes Gaussianity of the covariates, which is unlikely to hold in practice.

In a related direction, [SKL17] provide a method for analyzing outlier detectors in the context of linear classification, either certifying robustness or generating an attack if the learner is not robust. The outlier detector they analyze is brittle in high dimensions, motivating the need for the robust algorithms presented in the current work. Later work by the same authors showed how to bypass a number of common outlier detection methods [KSL18]. We use these recent strong attacks as part of our evaluation and show that our algorithm is more robust.

[PSBR18] independently obtained a robust algorithm for stochastic convex optimization by combining gradient descent with robust mean estimation. This algorithm is similar to the one we present in Appendix D, and in that section we discuss in more detail the comparison between these two techniques. For the case of linear regression, [DKS19] provide efficient robust algorithms with near-optimal error guarantees under various distributional assumptions and establish matching computational-robustness tradeoffs.

Framework and Algorithm

In this section, we describe our formal framework as well as the Sever algorithm.

To help us learn the parameter vector ww^{\ast}, we have access to a training set of nn functions f1:n=def{f1,,fn}f_{1:n}\stackrel{{\scriptstyle\text{def}}}{{=}}\{f_{1},\ldots,f_{n}\}. (For linear regression, we would have fi(w)=12(wxiyi)2f_{i}(w)=\frac{1}{2}(w\cdot x_{i}-y_{i})^{2}, where (xi,yi)(x_{i},y_{i}) is an observed data point.) However, unlike the classical (uncorrupted) setting where we assume that f1,,fnpf_{1},\ldots,f_{n}\sim p^{\ast}, we allow for an ε\varepsilon-fraction of the points to be arbitrary outliers:

In the ε\varepsilon-contamination model, the adversary is allowed to both add and remove points. Our theoretical results hold in this strong robustness model. Our experimental evaluation uses corrupted instances in which the adversary is only allowed to add corrupted points. Additive corruptions essentially correspond to Huber’s contamination model [Hub64] in robust statistics.

Finally, we will often assume access to a black-box learner, which we denote by L\mathcal{L}, which takes in functions f1,,fnf_{1},\ldots,f_{n} and outputs a parameter vector wHw\in\mathcal{H}. We want to stipulate that L\mathcal{L} approximately minimizes 1ni=1nfi(w)\frac{1}{n}\sum_{i=1}^{n}f_{i}(w). For this purpose, we introduce the following definition:

We are now ready to define the notion of a γ\gamma-approximate learner:

In other words, L\mathcal{L} always finds an approximate critical point of the empirical learning objective. We note that most common learning algorithms (such as stochastic gradient descent) satisfy the γ\gamma-approximate learner property. For our example of linear regression, gradient descent could be performed using the gradient 1ni=1nxi(wxiyi)\frac{1}{n}\sum_{i=1}^{n}x_{i}(w\cdot x_{i}-y_{i}). However, in some cases, a more efficient and direct method is to set the gradient equal to 0 and solve for ww. In our linear regression example, this gives us a closed form solution for the optimal parameter vector.

2 Algorithm and Theory

As outlined in Figure 1, our algorithm works by post-processing the gradients of a black-box learning algorithm. The basic intuition is as follows: we want to ensure that the outliers do not have a large effect on the learned parameters. Intuitively, for the outliers to have such an effect, their corresponding gradients should be (i) large in magnitude and (ii) systematically pointing in a specific direction. We can detect this via singular value decomposition–if both (i) and (ii) hold then the outliers should be responsible for a large singular value in the matrix of gradients, which allows us to detect and remove them.

This is shown more formally via the pseudocode in Algorithm 1.

For concreteness, we describe how the algorithm would work for our running example of linear regression. First, we would solve for the optimal parameter vector on the dataset, disregarding issues of robustness. Specifically, we let w^\hat{w} be the solution to i=1nxi(w^xiyi)=0\sum_{i=1}^{n}x_{i}(\hat{w}\cdot x_{i}-y_{i})=0: setting the gradient equal to will give us a critical point, as desired. We compute the average gradient, 1ni=1nxi(w^xiyi),\frac{1}{n}\sum_{i=1}^{n}x_{i}(\hat{w}\cdot x_{i}-y_{i}), and use this to compute the matrix of centered gradients GG. That is, the jjth row of GG, GjG_{j}, is the vector xj(w^xjyj)1ni=1nxi(w^xiyi)x_{j}(\hat{w}\cdot x_{j}-y_{j})-\frac{1}{n}\sum_{i=1}^{n}x_{i}(\hat{w}\cdot x_{i}-y_{i}). We compute the top right singular vector of GG, project the data into this direction, and square the resulting magnitudes to derive a score for each point: τj=(Gjv)2\tau_{j}=(G_{j}\cdot v)^{2}. With these scores in place, we run Algorithm 2, to (randomly) remove some of the points with the largest scores. We re-run the entire procedure on this subset of points, until Algorithm 2 does not remove any points, at which point we terminate.

Our first theoretical result says that as long as the data is not too heavy-tailed, Sever will find an approximate critical point of the true function f\overline{f}, even in the presence of outliers.

Then our algorithm Sever applied to f1,,fn,σf_{1},\ldots,f_{n},\sigma returns a point wHw\in\mathcal{H} that, with probability at least 9/109/10, is a (γ+O(σε))(\gamma+O(\sigma\sqrt{\varepsilon}))-approximate critical point of f\overline{f}.

The key take-away from Theorem 2.1 is that the error guarantee has no dependence on the underlying dimension dd. In contrast, most natural algorithms incur an error that grows with dd, and hence have poor robustness in high dimensions.

We show that under some niceness assumptions on pp^{\ast}, the deterministic regularity conditions are satisfied with high probability with polynomially many samples:

The reader is referred to Proposition B.5 in the appendix for a detailed formal statement.

While Theorem 2.1 is very general and holds even for non-convex loss functions, we might in general hope for more than an approximate critical point. In particular, for convex problems, we can guarantee that we find an approximate global minimum. This follows as a corollary of Theorem 2.1:

If f\overline{f} is convex, the algorithm finds a wHw\in\mathcal{H} such that f(w)f(w)=O((σε+γ)r)\overline{f}(w)-\overline{f}(w^{*})=O((\sigma\sqrt{\varepsilon}+\gamma)r).

If f\overline{f} is ξ\xi-strongly convex, the algorithm finds a wHw\in\mathcal{H} such that f(w)f(w)=O((εσ2+γ2)/ξ)\overline{f}(w)-\overline{f}(w^{*})=O\left((\varepsilon\sigma^{2}+\gamma^{2})/{\xi}\right).

For our theory to hold, we need to use the randomized filtering algorithm shown in Algorithm 2 (which is essentially the robust mean estimation algorithm of [DKK+17]), and filter until the stopping condition in line 11 of Algorithm 1 is satisfied. However, in practice we found that the following simpler algorithm worked well: in each iteration simply remove the top pp fraction of outliers according to the scores τi\tau_{i}, and instead of using a specific stopping condition, simply repeat the filter for rr iterations in total. This is the version of Sever that we use in our experiments in Section 3.

Another application allows us to use the least-squares loss function (L(w,(X,Y))=(YwX)2L(w,(X,Y))=(Y-w\cdot X)^{2}) to perform linear regression under somewhat more restrictive assumptions (see Theorem E.2 for the full statement):

3 Overview of Sever and its Analysis

In the robust setting, stochastic optimization becomes quite challenging: Even for the most basic special cases of this problem (e.g., mean estimation, linear regression) a single adversarially corrupted sample can substantially change the location of the minimum for f^\hat{f}. Moreover, naive outlier removal methods can only tolerate a negligible fraction ε\varepsilon of corruptions (corresponding to ε=O(d1/2)\varepsilon=O(d^{-1/2})).

A first idea to get around this obstacle is the following: We consider the standard (projected) gradient descent method used to find the minimum of f^\hat{f}. This algorithm would proceed by repeatedly computing the gradient of f^\hat{f} at appropriate points and using it to update the current location. The issue is that adversarial corruptions can completely compromise this algorithm’s behavior, since they can substantially change the gradient of f^\hat{f} at the chosen points. The key observation is that approximating the gradient of f\overline{f} at a given point, given access to an ε\varepsilon-corrupted set of samples, can be viewed as a robust mean estimation problem. We can thus use the robust mean estimation algorithm of [DKK+17], which succeeds under fairly mild assumptions about the good samples. Assuming that the covariance matrix of f(w)\nabla f(w), fpf\sim p^{\ast}, is bounded, we can thus “simulate” gradient descent and compute an approximate minimum for f\overline{f}.

In summary, the first algorithmic idea is to use a robust mean estimation routine as a black-box in order to robustly estimate the gradient at each iteration of (projected) gradient descent. This yields a simple robust method for stochastic optimization with polynomial sample complexity and running time in a very general setting (See Appendix D for details.)

We are now ready to describe Sever (Algorithm 1) and the main insight behind it. Roughly speaking, Sever only calls our robust mean estimation routine (which is essentially the filtering method of [DKK+17] for outlier removal) each time the algorithm reaches an approximate critical point of f^\hat{f}. There are two main motivations for this approach: First, we empirically observed that if we iteratively filter samples, keeping the subset with the samples removed, then few iterations of the filter remove points. Second, an iteration of the filter subroutine (Algorithm 2) is more expensive than an iteration of gradient descent. Therefore, it is advantageous to run many steps of gradient descent on the current set of corrupted samples between consecutive filtering steps. This idea is further improved by using stochastic gradient descent, rather than computing the average at each step.

An important feature of our analysis is that Sever does not use a robust mean estimation routine as a black box. In contrast, we take advantage of the performance guarantees of our filtering algorithm. The main idea for the analysis is as follows: Suppose that we have reached an approximate critical point ww of f^\hat{f} and at this step we apply our filtering algorithm. By the performance guarantees of the latter algorithm we are in one of two cases: either the filtering algorithm removes a set of corrupted functions or it certifies that the gradient of f^\hat{f} is “close” to the gradient of f\overline{f} at ww. In the first case, we make progress as we produce a “cleaner” set of functions. In the second case, our certification implies that the point ww is also an approximate critical point of f\overline{f} and we are done.

Experiments

In this section we apply Sever to regression and classification problems. Our code is available at https://github.com/hoonose/sever. As our base learners, we used ridge regression and an SVM, respectively. We implemented the latter as a quadratic program, using Gurobi [Gur16] as a backend solver and YALMIP [Löf04] as the modeling language.

In both cases, we ran the base learner and then extracted gradients for each data point at the learned parameters. We then centered the gradients and ran MATLAB’s svds method to compute the top singular vector vv, and removed the top pp fraction of points ii with the largest outlier score τi\tau_{i}, computed as the squared magnitude of the projection onto vv (see Algorithm 1). We repeated this for rr iterations in total. For classification, we centered the gradients separately (and removed points separately) for each class, which improved performance.

We compared our method to six baseline methods. All but one of these all have the same high-level form as Sever (run the base learner then filter top pp fraction of points with the largest score), but use a different definition of the score τi\tau_{i} for deciding which points to filter:

loss: remove points with large loss (measured at the parameters output by the base learner).

RANSAC: repeatedly subsample points uniformly at random, and find the best fit with the subsample. Then, choose the best fit amongst this set of learners. Note that this method is not “filter-based”.In practice, heuristics must often be applied to choose the best fit. In our experiments, we “cheat” slightly by in fact choosing the best fit post-hoc by reporting the best error achieved by any learner in this way. Despite strengthening RANSAC in this way, we observe that it still has poor performance.

Both ridge regression and SVM have a single hyperparameter (the regularization coefficient). We optimized this based on the uncorrupted data and then kept it fixed throughout our experiments. In addition, since the data do not already have outliers, we added varying amounts of outliers (ranging from 0.5%0.5\% to 10%10\% of the clean data); this process is described in more detail below.

We found that centering the data points decreased error noticeably on the drug discovery dataset, while scaling each coordinate to have variance 11 decreased error by a small amount on the synthetic data. To center in the presence of outliers, we used the robust mean estimation algorithm from [DKK+17].

We devised a method of generating outliers that fools all of the baselines while still inducing high test error. At a high level, the outliers cause ridge regression to output w=0w=0 (so the model always predicts y=0y=0).

In our experiments we found that this method generated successful attacks as long as the fraction of outliers was at least roughly 2%2\% for synthetic data, and roughly 5%5\% for the drug discovery data.

In Figure 2 we compare the test error of our defense against the baselines as we increase the fraction ε\varepsilon of added outliers. To avoid cluttering the figure, we only show the performance of l2, loss, gradientCentered, RANSAC, and Sever; the performance of the remaining baselines is qualitatively similar to the baselines in Figure 2.

For both the baselines and our algorithms, we iterate the defense r=4r=4 times, each time removing the p=ε/2p=\varepsilon/2 fraction of points with largest score. For consistency of results, for each defense and each value of ε\varepsilon we ran the defense 3 times on fresh attack points and display the median of the 3 test errors.

When the attack parameters α\alpha and β\beta are tuned to defeat the baselines (Figure 2 left and center), our defense substantially outperforms the baselines as soon as we cross ε1.5%\varepsilon\approx 1.5\% for synthetic data, and ε5.5%\varepsilon\approx 5.5\% for the drug discovery data. In fact, most of the baselines do worse than not removing any outliers at all (this is because they end up mostly removing good data points, which causes the outliers to have a larger effect). Even when α\alpha and β\beta are instead tuned to defeat Sever, its resulting error remains small (Figure 2 right).

To understand why the baselines fail to detect the outliers, in Figure 3 we show a representative sample of the histograms of scores of the uncorrupted points overlaid with the scores of the outliers, for both synthetic data and the drug discovery dataset with ε=0.1\varepsilon=0.1, after one run of the base learner. The scores of the outliers lie well within the distribution of scores of the uncorrupted points. Thus, it would be impossible for the baselines to remove them without also removing a large fraction of uncorrupted points.

Interestingly, for small ε\varepsilon all of the methods improve upon the uncorrupted test error for the drug discovery data; this appears to be due to the presence of a small number of natural outliers in the data that all of the methods successfully remove.

2 Support Vector Machines

In contrast to ridge regression, we did not perform centering and rescaling for these datasets as it did not seem to have a large effect on results.

In all experiments for this section, each method removed the top p=n+n+min{n+,n}εrp=\frac{n_{-}+n_{+}}{\min\{n_{+},n_{-}\}}\cdot\frac{\varepsilon}{r} of highest-scoring points for each of r=2r=2 iterations, where n+n_{+} and nn_{-} are the number of positive and negative training points respectively. This expression for pp is chosen in order to account for class imbalance, which is extreme in the case of the Enron dataset – if the attacker plants all the outliers in the smaller class, then a smaller value of pp would remove too few points, even with a perfect detection method.

We considered fractions of outliers ranging from ε=0.005\varepsilon=0.005 to ε=0.03\varepsilon=0.03. By performing a sweep across hyperparameters of the attack, we generated 5656 distinct sets of attacks for each value of ε\varepsilon. In Figure 4, we show results for the attack where the loss baselines does the worst, as well as for the attack where our method does the worst. When attacks are most effective against loss, Sever substantially outperforms it, nearly matching the test accuracy of 5.8%5.8\% on the uncorrupted data, while loss performs worse than 30%30\% error at just a 1.5%1.5\% fraction of injected outliers. Even when attacks are most effective against Sever, it still outperforms loss, achieving a test error of at most 9.05%9.05\%. We note that other baselines behaved qualitatively similarly to loss, and the results are displayed in Section F.

For results on Enron, we used the same values of ε\varepsilon, and considered 9696 distinct hyperparameters for the attack. There was not a single attack that simultaneously defeated all of the baselines, so in Figure 5 we show two attacks that do well against different sets of baselines, as well as the attack that performs best against our method.

At ε=0.01\varepsilon=0.01, the worst performance of our method against all attacks was 7.34%7.34\%, in contrast to 13.43%20.48%13.43\%-20.48\% for the baselines (note that the accuracy is 3%3\% in the absence of outliers). However, at ε=0.03\varepsilon=0.03, while we still outperform the baselines, our error is relatively large—13.53%13.53\%.

To investigate this further, we looked at all 4848 attacks and found that while on 4242 out of 4848 attacks our error never exceeded 7%7\%, on 66 of the attacks (including the attack in Figure 5) the error was substantially higher. Figure 6 shows what is happening. The leftmost figure displays the scores assigned by Sever after the first iteration, where red bars indicate outliers. While some outliers are assigned extremely large scores and thus detected, several outliers are correctly classified and thus have 0 gradient. However, once we remove the first set of outliers, some outliers which were previously correctly classified now have large score, as displayed in the middle figure. Another iteration of this process produces the rightmost figure, where almost all the remaining outliers have large score and will thus be removed by Sever. This demonstrates that some outliers may be hidden until other outliers are removed, necessitating multiple iterations.

Motivated by this, we re-ran our method against the 66 attacks using r=3r=3 iterations instead of 22 (and decreasing pp as per the expression above). After this change, all 66 of the attacks had error at most 7.4%7.4\%.

Discussion

In this paper we have presented an algorithm, Sever, that has both strong theoretical robustness properties in the presence of outliers, and performs well on real datasets. Sever is based on the idea that learning can often be cast as the problem of finding an approximate stationary point of the loss, which can in turn be cast as a robust mean estimation problem, allowing us to leverage existing techniques for efficient robust mean estimation.

There are a number of directions along which Sever could be improved: first, it could be extended to handle more general assumptions on the data; second, it could be strengthened to achieve better error bounds in terms of the fraction of outliers; finally, one could imagine automatically learning a feature representation in which Sever performs well. We discuss each of these ideas in detail below.

The main underlying assumption on which Sever rests is that the top singular value of the gradients of the data is small. While this appeared to hold true on the datasets we considered, a common occurence in practice is for there to be a few large singular values, together with many small singular values. It would therefore be desirable to design a version of Sever that can take advantage of such phenomena. In addition, it would be worthwhile to do a more detailed empirical analysis across a wide variety of datasets investigating properties that can enable robust estimation (the notion of resilience in [SCV18] could provide a template for finding such properties).

In theory, Sever has a O(ε)O(\sqrt{\varepsilon}) dependence in error on the fraction ε\varepsilon of outliers (see Theorem 2.1). While without stronger assumptions this is likely not possible to improve, in practice we would prefer to have a dependence closer to O(ε)O(\varepsilon). Therefore, it would also be useful to improve Sever to have such an O(ε)O(\varepsilon)-dependence under stronger but realistic assumptions. Unfortunately, all existing algorithms for robust mean estimation that achieve error better than O(ε)O(\sqrt{\varepsilon}) either rely on strong distributional assumptions such as Gaussianity [DKK+16, LRV16], or else require expensive computation involving e.g. sum-of-squares optimization [HL17, KS17a, KS17b]. Improving the robustness of Sever thus requires improvements on the robust mean estimation algorithm that Sever uses as a primitive.

Finally, we note that Sever performs best when the features have small covariance and strong predictive power. One situation in particular where this holds is when there are many approximately independent features that are predictive of the true signal.

It would be interesting to try to learn a representation with such a property. This could be done, for instance, by training a neural network with some cost function that encourages independent features (some ideas along these general lines are discussed in [Ben17]). An issue is how to learn such a representation robustly; one idea is learn a representation on a dataset that is known to be free of outliers, and hope that the representation is useful on other datasets in the same application domain. Beyond these specific questions, we view the general investigation of robust methods (both empirically and theoretically) as an important step as machine learning moves forwards. Indeed, as machine learning is applied in increasingly many situations and in increasingly automated ways, it is important to attend to robustness considerations so that machine learning systems behave reliably and avoid costly errors. While the bulk of recent work has highlighted the vulnerabilities of machine learning (e.g. [SZS+14, LWSV16, SKL17, EEF+18, CLL+17]), we are optimistic that practical algorithms backed by principled theory can finally patch these vulnerabilities and lead to truly reliable systems.

References

Appendix

The appendix is organized as follows. In Section A, we describe additional preliminaries required for our technical arguments. In Section B, we analyze our main algorithm, Sever. In Section C, we specialize our analysis to the important case of Generalized Linear Models (GLMs). In Section D, we describe a variant of our algorithm which performs robust filtering on each iteration of projected gradient descent, and works under more general assumptions. In Section E, we describe concrete applications of Sever – in particular, how it can be used to robustly optimize in the settings of linear regression, logistic regression, and support vector machines. Finally, in Section F, we provide additional plots from our experimental evaluations.

Appendix A Preliminaries

In this section, we formally introduce our setting for robust stochastic optimization.

In addition, some of our bounds will make use of the following quantities:

The strong convexity parameter ξ\xi of f\overline{f}, if it exists. This is the maximal ξ\xi such that f(w)f(w0)+ww0,f(w0)+ξ2ww022\overline{f}(w)\geq\overline{f}(w_{0})+\langle w-w_{0},\nabla\overline{f}(w_{0})\rangle+\frac{\xi}{2}\|w-w_{0}\|_{2}^{2} for all w,w0Hw,w_{0}\in\mathcal{H}.

The strong smoothness parameter β\beta of f\overline{f}, if it exists. This is the minimal β\beta such that f(w)f(w0)+ww0,f(w0)+β2ww022\overline{f}(w)\leq\overline{f}(w_{0})+\langle w-w_{0},\nabla\overline{f}(w_{0})\rangle+\frac{\beta}{2}\|w-w_{0}\|_{2}^{2} for all w,w0Hw,w_{0}\in\mathcal{H}.

The Lipschitz constant LL of f\overline{f}, if it exists. This is the minimal LL such that f(w)f(w0)Lww02\overline{f}(w)-\overline{f}(w_{0})\leq L\|w-w_{0}\|_{2} for all w,w0Hw,w_{0}\in\mathcal{H}.

Appendix B General Analysis of Sever

This section is dedicated to the analysis of Algorithm 1, where we do not make convexity assumptions about the underlying functions f1,,fnf_{1},\ldots,f_{n}. In this case, we can show that our algorithm finds an approximate critical point of f\overline{f}. When we specialize to convex functions, this immediately implies that we find an approximate minimal point of f\overline{f}.

Our proof proceeds in two parts. First, we define a set of deterministic conditions under which our algorithm finds an approximate minimal point of f\overline{f}. We then show that, under mild assumptions on our functions, this set of deterministic conditions holds with high probability after polynomially many samples.

For completeness, we recall the definitions of a γ\gamma-approximate critical point and a γ\gamma-approximate learner:

We first explicitly demonstrate a set of deterministic conditions on the (uncorrupted) data points. Our deterministic regularity conditions are as follows:

In Section B.1, we prove the following theorem, which shows that under Assumption B.1 our algorithm succeeds:

Observe that the above theorem holds quite generally; in particular, it holds for non-convex functions. As a corollary of this theorem, in Section B.2 we show that this immediately implies that Sever robustly minimizes convex functions, if Assumption B.1 holds:

If f\overline{f} is convex, the algorithm finds a wHw\in\mathcal{H} such that f(w)f(w)=O((σ0r+σ1r2)ε+γr)\overline{f}(w)-\overline{f}(w^{*})=O((\sigma_{0}r+\sigma_{1}r^{2})\sqrt{\varepsilon}+\gamma r).

If f\overline{f} is ξ\xi-strongly convex, the algorithm finds a wHw\in\mathcal{H} such that

In the strongly convex case and when σ1>0\sigma_{1}>0, we can remove the dependence on σ1\sigma_{1} and rr in the above by repeatedly applying Sever with decreasing rr:

using at most O(log(rξ/(γ+σ0ε)))O(\log(r\xi/(\gamma+\sigma_{0}\sqrt{\varepsilon}))) calls to Sever.

To concretely use Theorem B.2, Corollary B.3, and Corollary B.4, in Section B.4 we show that the Assumption B.1 is satisfied with high probability under mild conditions on the distribution over the functions, after drawing polynomially many samples:

an ε\varepsilon-corrupted set of points f1,,fnf_{1},\ldots,f_{n} with high probability satisfy Assumption B.1.

The remaining subsections are dedicated to the proofs of Theorem B.2, Corollary B.3, Corollary B.4, and Proposition B.5.

B.1 Proof of Theorem B.2

If the samples satisfy (1) of Assumption B.1, and if S2n/3|S|\geq 2n/3 then if SS^{\prime} is the output of \textscFilter(S,τ,σ)\textsc{Filter}(S,\tau,\sigma), we have that

If the samples satisfy Assumption B.1, \textscFilter(S,τ,σ)=S\textsc{Filter}(S,\tau,\sigma)=S, and nS11εnn-|S|\leq 11\varepsilon n, then

Before we prove these lemmata, we show how together they imply Theorem B.2.

First, we note that the algorithm must terminate in at most nn iterations. This is easy to see as each iteration of the main loop except for the last must decrease the size of SS by at least 11.

Thus it suffices to prove these two lemmata. We first prove Lemma B.6:

This is the supremum over unit vectors vv of

and therefore, δ=O(σε)\delta=O(\sigma\sqrt{\varepsilon}) as desired. ∎

B.2 Proof of Corollary B.3

In this section, we show that the Sever algorithm finds an approximate global optimum for convex optimization in various settings, under Assumption B.1. We do so by simply applying the guarantees of Theorem B.2 in a fairly black box manner.

Before we proceed with the proof of Corollary B.3, we record a simple lemma that allows us to translate an approximate critical point guarantee to an approximate global optimum guarantee:

If ff is ξ\xi-strongly convex, then f(x)f(y)2δ2/ξ|f(x)-f(y)|\leq 2\delta^{2}/\xi and xy22δ/ξ\|x-y\|_{2}\leq 2\delta/\xi.

Let r=xy2>0r=\|x-y\|_{2}>0 and g(t)=f(x+tv)g(t)=f(x+tv). We have that g(0)=f(x),g(r)=f(y)g(0)=f(x),g(r)=f(y) and that gg is convex (or ξ\xi-strongly convex) with g(0)δg^{\prime}(0)\geq-\delta and g(r)δg^{\prime}(r)\leq\delta. By convexity, the derivative of gg is increasing on [0,r][0,r] and therefore g(t)δ|g^{\prime}(t)|\leq\delta for all t[0,r]t\in[0,r]. This implies that

To show the second part of the lemma, we note that if gg is ξ\xi-strongly convex that g(t)ξg^{\prime\prime}(t)\geq\xi for all tt. This implies that g(r)>g(0)+ξrg^{\prime}(r)>g^{\prime}(0)+\xi r. Since g(r)g(0)2δg^{\prime}(r)-g^{\prime}(0)\leq 2\delta, we obtain that r2δ/ξr\leq 2\delta/\xi, from which the second statement follows. ∎

By applying the algorithm of Theorem B.2, we can find a point ww that is a γ=def(γ+O(σε))\gamma^{\prime}\stackrel{{\scriptstyle\text{def}}}{{=}}(\gamma+O(\sigma\sqrt{\varepsilon}))-approximate critical point of f\overline{f}, where σ=defσ0+σ1ww2\sigma\stackrel{{\scriptstyle\text{def}}}{{=}}\sigma_{0}+\sigma_{1}\|w^{\ast}-w\|_{2}. That is, for any unit vector vv pointing towards the interior of H\mathcal{H}, we have that vf(w)γ.v\cdot\nabla\overline{f}(w)\geq-\gamma^{\prime}.

To prove (i), we apply Lemma B.8 to f\overline{f} at ww which gives that

To prove (ii), we apply Lemma B.8 to f\overline{f} at ww which gives that

Plugging in parameters appropriately then immediately gives the desired bound. ∎

B.3 Proof of Corollary B.4

We apply Sever iteratively starting with a domain H1=H\mathcal{H}_{1}=\mathcal{H} and radius r1=rr_{1}=r. After each iteration, we know the resulting point is close to ww^{\ast} will be able to reduce the search radius.

At step ii, we have a domain of radius rir_{i}. As in the proof of Corollary B.3 above, we apply algorithm of Theorem B.2, we can find a point wiw_{i} that is a γi=def(γ+O(σiε))\gamma^{\prime}_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}(\gamma+O(\sigma^{\prime}_{i}\sqrt{\varepsilon}))-approximate critical point of f\overline{f}, where σi=defσ0+σ1ri\sigma^{\prime}_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}\sigma_{0}+\sigma_{1}r_{i}. Then using Lemma B.8, we obtain that wiw22γi/ξ\|w_{i}-w^{\ast}\|_{2}\leq 2\gamma^{\prime}_{i}/\xi.

Now we can define Hi+1\mathcal{H}_{i+1} as the intersection of H\mathcal{H} and the ball of radius ri+1=2γi/ξr_{i+1}=2\gamma^{\prime}_{i}/\xi around wiw_{i} and repeat using this domain. We have that ri+1=2γi/ξ=2γ/ξ+O(σ0ε/ξ+σ1εri/ξ)r_{i+1}=2\gamma^{\prime}_{i}/\xi=2\gamma/\xi+O(\sigma_{0}\sqrt{\varepsilon}/\xi+\sigma_{1}\sqrt{\varepsilon}r_{i}/\xi). Now if we choose the constant CC such that the constant in this O()O() is C/4C/4, then using our assumption that ξ2σ1ε\xi\geq 2\sigma_{1}\sqrt{\varepsilon}, we obtain that

Now if ri8γ/ξ+2Cσ0ε/ξr_{i}\geq 8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi, then we have ri+1ri/2r_{i+1}\leq r_{i}/2 and if ri8γ/ξ+2Cσ0ε/ξr_{i}\leq 8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi then we also have ri+18γ/ξ+2Cσ0ε/ξr_{i+1}\leq 8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi . When rir_{i} is smaller than this we stop and output wiw_{i}. Thus we stop in at most O(log(r)log(8γ/ξ+2Cσ0ε/ξ))=O(log(rξ/(γ+σ0ε))O(\log(r)-\log(8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi))=O(\log(r\xi/(\gamma+\sigma_{0}\sqrt{\varepsilon})) iterations and have ri=O(γ/ξ+Cσ0ε)r_{i}=O(\gamma/\xi+C\sigma_{0}\sqrt{\varepsilon}). But then γi=γ+O(σiε))γ+C(σ0+σ1ri)ε/8=O(γ+σ0ε).\gamma^{\prime}_{i}=\gamma+O(\sigma^{\prime}_{i}\sqrt{\varepsilon}))\leq\gamma+C(\sigma_{0}+\sigma_{1}r^{\prime}_{i})\sqrt{\varepsilon}/8=O(\gamma+\sigma_{0}\sqrt{\varepsilon}). Using Lemma B.8 we obtain that

as required. The bound on w^w2\|\widehat{w}-w^{*}\|_{2} follows similarly.

While we don’t give explicit bounds on the number of calls to the approximate learner needed by Sever, such bounds can be straightforwardly obtained under appropriate assumptions on the fif_{i} (see, e.g., the following subsection). Two remarks are in order. First, in this case we cannot take advantage of assumptions that only hold at f\overline{f} but might not on the corrupted average ff. Second, our algorithm can take advantage of a closed form for the minimum. For example, for the case of linear regression considered in Section E, fif_{i} is not Lipschitz with a small constant if xix_{i} is far from the mean, but there is a simple closed form for the minimum of the least squares loss.

B.4 Proof of Proposition B.5

We will proceed by a cover argument. First we claim that for each wHw\in\mathcal{H} that (3) and (4) hold with high probability. For Equation (3), it suffices to show that for each unit vector vv in a cover N\mathcal{N} of size 2O(d)2^{O(d)} of the sphere that

Since v(f(w)f)|v\cdot(\nabla f(w)-\overline{f})| is always bounded by LL, Equation (5) holds for each v,wv,w with probability at least 1exp(Ω(nσ2/L2))1-\exp(-\Omega(n\sigma^{2}/L^{2})) by a Chernoff bound (noting that the removal of an ε\varepsilon-fraction of points cannot increase this by much). Similarly, to show Equation 4, it suffices to show that for each such vv that

A Chernoff bound implies that with probability 1exp(Ω(nσ2ε/L2))1-\exp(-\Omega(n\sigma^{2}\varepsilon/L^{2})) that the average over our original set of ff’s of (v(f(w)f))(v\cdot(\nabla f(w)-\overline{f})) is O(σε)O(\sigma\sqrt{\varepsilon}). Assuming that Equation (5) holds, removing an ε\varepsilon-fraction of these ff’s cannot change this value by more than O(σε)O(\sigma\sqrt{\varepsilon}). By union bounding over N\mathcal{N} and standard net arguments, this implies that Equations (3) and (4) hold with probability 1exp(Ω(dnσ2ε/L2))1-\exp(\Omega(d-n\sigma^{2}\varepsilon/L^{2})) for any given ww.

To show that our conditions hold for all wHw\in\mathcal{H}, we note that by β\beta-smoothness, if Equation (4) holds for some ww, it holds for all other ww^{\prime} in a ball of radius σ2ε/β\sqrt{\sigma^{2}\varepsilon}/\beta (up to a constant multiplicative loss). Similarly, if Equation (3) holds at some ww, it holds with bound σ2I\sigma^{2}I for all ww^{\prime} in a ball of radius σ2/(2Lβ)\sigma^{2}/(2L\beta). Therefore, if Equations (3) and (4) hold for all ww in a min(σ2ε/β,σ/(2Lβ))\min(\sqrt{\sigma^{2}\varepsilon}/\beta,\sigma/(2L\beta))-cover of H\mathcal{H}, the assumptions of Theorem B.2 will hold everywhere. Since we have such covers of size exp(O(dlog(rβL/(σ2ε))))\exp(O(d\log(r\beta L/(\sigma^{2}\varepsilon)))), by a union bound, this holds with high probability if

Appendix C Analysis of Sever for GLMs

A case of particular interest is that of Generalized Linear Models (GLMs):

A sample from this GLM is given by fX,Y(w)f_{X,Y}(w) where (X,Y)Dxy(X,Y)\sim D_{xy}.

Our goal, as usual, is to approximately minimize f\overline{f} given ε\varepsilon-corrupted samples from DxyD_{xy}. Throughout this section we assume that H\mathcal{H} is contained in the ball of radius rr around , i.e. HB(0,r)\mathcal{H}\subseteq B(0,r). Moreover, we will let w=arg minwHf(w)w^{*}=\operatorname*{arg\,min}_{w\in\mathcal{H}}\overline{f}(w) be a minimizer of f\overline{f} in H\mathcal{H}.

This case covers a number of interesting applications, including SVMs and logistic regression. Unfortunately, the tools developed in Appendix B do not seem to be able to cover this case in a simple manner. In particular, it is unclear how to demonstrate that Assumption B.1 holds after taking polynomially many samples from a GLM. To rectify this, in this section, we demonstrate a different deterministic regularity condition under which we show Sever succeeds, and we show that this condition holds after polynomially many samples from a GLM. Specifically, we will show that Sever succeeds under the following deterministic condition:

Equation (1) holds with σ1=0\sigma_{1}=0 and the same σ0\sigma_{0}, and

In this section, we will show the following two statements. The first demonstrates that Assumption C.1 implies that Sever succeeds, and the second shows that Assumption C.1 holds after polynomially many samples from a GLM. Formally:

If the link functions are ξ\xi-strongly convex, the algorithm finds a wHw\in\mathcal{H} such that

Suppose moreover that the following conditions all hold:

σY(0)1|\sigma_{Y}(0)|\leq 1 for all YYY\in\mathcal{Y}.

Then with probability at least 9/109/10 over the original set of samples, there is a set of (1ε)n(1-\varepsilon)n of the fif_{i} that satisfy Assumption C.1 on H\mathcal{H} with σ0=2\sigma_{0}=2, σ1=0\sigma_{1}=0 and σ2=1+r\sigma_{2}=1+r.and σ2=1+r\sigma_{2}=1+r.

As before, since Sever either terminates or throws away at least one sample, clearly it cannot run for more than nn iterations. Thus the runtime bound is simple, and it suffices to show correctness.

Let f1,,fnf_{1},\ldots,f_{n} satisfy Assumption C.1. Then with probability at least 9/109/10, Sever applied to f1,,fn,σ0f_{1},\ldots,f_{n},\sigma_{0} returns a point wHw\in\mathcal{H} which is a (γ+O(σ0ε))(\gamma+O(\sigma_{0}\sqrt{\varepsilon}))-approximate critical point of f^\hat{f}.

so they satisfy (1), since the RHS is bounded by Assumption C.1. Thus this lemma follows from an application of Theorem B.2. ∎

With this critical lemma in place, we can now prove Theorem C.2:

Condition on the event that Lemma C.4 holds, and let wHw\in\mathcal{H} be the output of Sever. By Assumption C.1, we know that f^(w)f(w)σ2ε\hat{f}(w^{*})\geq\overline{f}(w^{*})-\sigma_{2}\sqrt{\varepsilon}, and moreover, ww^{*} is a γ+σ0ε\gamma+\sigma_{0}\sqrt{\varepsilon}-approximate critical point of f^\hat{f}.

Since each link function is convex, so is f^\hat{f}. Hence, by Lemma B.8, since ww is a (γ+O(σ0ε))(\gamma+O(\sigma_{0}\sqrt{\varepsilon}))-approximate critical point of f^\hat{f}, we have f^(w)f^(w)r(γ+O(σ0ε))\hat{f}(w)-\hat{f}(w^{*})\leq r(\gamma+O(\sigma_{0}\sqrt{\varepsilon})). By Assumption B.1, this immediately implies that f(w)f(w)r(γ+O(σ0ε))+O(σ2ε)\overline{f}(w)-\overline{f}(w^{*})\leq r(\gamma+O(\sigma_{0}\sqrt{\varepsilon}))+O(\sigma_{2}\sqrt{\varepsilon}), as claimed.

The bound for strongly convex functions follows from the exact argument, except using the statement in Lemma B.8 pertaining to strongly convex functions. ∎

C.2 Proof of Proposition C.3

We first note that fX,Y(w)=XσY(wX).\nabla f_{X,Y}(w)=X\sigma_{Y}^{\prime}(w\cdot X). Thus, under Assumption C.1, we have for any vv that

In particular, since this last expression is independent of ww, we only need to check this single matrix bound.

Thus, for nn at least a sufficiently large multiple of dlog(dr/ε)/εd\log(dr/\varepsilon)/\varepsilon, this holds for all ww in our cover of H\mathcal{H} with high probability. This completes the proof. ∎

Appendix D An Alternative Algorithm: Robust Filtering in Each Iteration

In this section, we describe another algorithm for robust stochastic optimization. This algorithm uses standard robust mean estimation techniques to compute approximate gradients pointwise, which it then feeds into a standard projective gradient descent algorithm. This algorithm in practice turns out to be somewhat slower than the one employed in the rest of this paper, because it employs a filtering algorithm at every step of the projective gradient descent, and does not remember which points were filtered between iterations. On the other hand, we present this algorithm for two reasons. Firstly, because it is a conceptually simpler interpretation of the main ideas of this paper, and secondly, because the algorithm works under somewhat more general assumptions. In particular, this algorithm only requires that for each wHw\in\mathcal{H} that there is a corresponding good set of functions, rather than that there exists a single good set that works simultaneously for all ww.

In particular, we can make do with the following somewhat weaker assumption:

We make essential use of the following result, which appears in both [DKK+17, SCV18]:

Our general robust algorithm for stochastic optimization will make calls to Algorithm A{\cal A} in a black-box manner, as well as to the projection operator onto H\mathcal{H}. We will measure the cost of our algorithm by the total number of such calls.

While it is not needed for the theoretical results established in this subsection, we note that the robust mean estimation algorithm of [DKK+17] relies on an iterative outlier removal method only requiring basic eigenvalue computations (SVD), while the [SCV18] algorithm employs semidefinite programming. In our experiments, we use the algorithm in [DKK+17] and variants thereof.

Using the above black-box, together with known results on convex optimization with errors, we obtain the following meta-theorem:

We note that by applying Algorithm A\mathcal{A} on {fi(w)}\{\nabla f_{i}(w)\}, we can find an approximation to f(w)\nabla\overline{f}(w) with error O(σε)O(\sigma\sqrt{\varepsilon}). We note that standard projective gradient descent algorithms can be made to run efficiently even if the gradients given are only approximate, and this can be used to find our O(σε)O(\sigma\sqrt{\varepsilon})-approximate critical point. ∎

In this section we give a brief comparison of this algorithm to Sever. The algorithm presented in this section is much simpler to state, and also requires weaker conditions on the data. However, because the algorithms work in somewhat different settings, the comparison is a bit delicate, so we explain in more detail below.

There is a major conceptual difference between these two algorithms: namely, Sever works with a black-box non-robust learner, and requires the filter algorithm for robust mean estimation. In contrast, the algorithm in this section works with a black-box robust mean estimation algorithm, and plugs it into a specific non-robust learning algorithm, specifically (approximate) stochastic gradient descent. When instantiated with the same primitives, these algorithms have similar theoretical runtime guarantees.

However, in the practical implementation, we prefer Sever for a couple of reasons. First, in practice we find that in practice, Sever often only requires a constant number of runs of the base black-box learner, and so incurs only a constant factor overhead. In contrast, the algorithm presented in this section requires at least linear time per iteration of SGD, since it needs to run a robust mean estimation algorithm on the entire dataset (and the total number of iterations needed is comparable). In contrast, SGD typically runs in constant time per iteration, so this presents a major bottleneck for scalability.

Second, we find it is much more useful from a practical point of view to allow for black-box non-robust learners, than to allow for black-box robust mean estimation algorithms. This is simply because the former allows Sever to be much more problem-specific, and allow for optimizations for each individual learning problem. For instance, it is what allows us to use optimized libraries in our experiments for linear regression and SVM. In contrast, there is relatively little reward to allow for black-box robust mean estimation algorithms, as not only are there relatively few options, but also this does not allow us really to specialize the algorithm to the problem at hand.

Appendix E Applications of the General Algorithm

In this section, we present three concrete applications of our general robust algorithm. In particular, we describe how to robustly optimize models for linear regression, support vector machines, and logistic regression, in Sections E.1, E.2, E.3, respectively.

Given the model for linear regression described above, assume the following conditions for DeD_{e} and DxD_{x}:

VareDe[e]ξ\operatorname{Var}_{e\sim D_{e}}[e]\leq\xi;

Our main result for linear regression is the following:

Let ε>0\varepsilon>0, and let DxyD_{xy} be a distribution over pairs (X,Y)(X,Y) which satisfies the conditions of Assumption E.1. Suppose we are given O(d5ε2)O\left(\frac{d^{5}}{\varepsilon^{2}}\right) ε\varepsilon-noisy samples from DxyD_{xy}. Then in either of the following two cases, there exists an algorithm that, with probability at least 9/109/10, produces a w^\widehat{w} with the following guarantees:

If w2r\|w^{*}\|_{2}\leq r, then f(w^)f(w)+O(((ξ+ε)r+Cr2)ε)\overline{f}(\widehat{w})\leq\overline{f}(w^{*})+O(((\sqrt{\xi}+\sqrt{\varepsilon})r+\sqrt{C}r^{2})\sqrt{\varepsilon}).

The proof will follow from two lemmas (proved in Section E.1.1 and E.1.2, respectively). First, we will bound the covariance of the gradient, in Lemma E.3:

With this in hand, we can prove Lemma E.4, giving us a polynomial sample complexity which is sufficient to satisfy the conditions of Assumption B.1.

Suppose DxyD_{xy} satisfies the conditions of Assumption E.1. Given O(d5/ε2)O(d^{5}/\varepsilon^{2}) ε\varepsilon-noisy samples from DxyD_{xy}, then with probability at last 9/109/10, they satisfy Assumption B.1 with parameters σ0=30ξ+ε\sigma_{0}=30\sqrt{\xi}+\sqrt{\varepsilon} and σ1=18C+1\sigma_{1}=18\sqrt{C+1}.

The proof concludes by applying Corollary B.4 or case (i) of Corollary B.3 for the first and second cases respectively.

Note that for this setting we have that f(w,z)=f(w,x,y)=(yw,x)2f(w,z)=f(w,x,y)=(y-\langle w,x\rangle)^{2}. We then have that wf(w,z)=2(ww,x+e)x\nabla_{w}f(w,z)=-2(\langle w^{\ast}-w,x\rangle+e)x. Our main claim is the following:

where we again used the fact that the noise ee is independent of xx and its expectation is zero.

Given the above claim, we can bound from above the spectral norm of the covariance matrix of the gradients as follows: Specifically, for a unit vector vv, the quantity vTCov[wf(w,z)]vv^{T}\operatorname{Cov}[\nabla_{w}f(w,z)]v is bounded from above by a constant times the following quantities:

The second term is at most the upper bound of the variance of the noise ξ\xi times σ2\sigma^{2}.

The third term is at most vTΣ(ww)(ww)TΣvv^{T}\Sigma(w^{\ast}-w)(w^{\ast}-w)^{T}\Sigma v, which by our bounded covariance assumption is at most σ4ww22.\sigma^{4}\|w^{\ast}-w\|_{2}^{2}.

This gives the parameters in the meta-theorem.

E.1.2 Proof of Lemma E.4

We will repeatedly make use of the following, which bounds how much removing points or probability mass affects an expectation in terms of its variance:

We can thus take σ0=30ξ+ε\sigma_{0}=30\sqrt{\xi}+\sqrt{\varepsilon} and σ1=18C+116C\sigma_{1}=18\sqrt{C+1}\geq 16\sqrt{C} to get (2).

To get both (2) and (1) hold with σ0=30ξ+ε\sigma_{0}=30\sqrt{\xi}+\sqrt{\varepsilon} and σ1=18C+1)\sigma_{1}=18\sqrt{C+1}). This happens with probability at least 9/109/10 by a union bound on the probabilistic assumptions above.

E.2 Support Vector Machines

In this section, we demonstrate how our results apply to learning support vector machines (i.e., halfspaces under hinge loss). In particular, we describe how SVMs fit into the GLM framework described in Section C.

One technical point is that fif_{i} does not have a gradient everywhere – instead, we will be concerned with the sub-gradients of the fif_{i}’s. All our results which operate on the gradients also work for sub-gradients. To be precise, we will take the sub-gradient to be when the gradient is undefined:

Let fi\nabla f_{i} be the sub-gradient of fi(w)f_{i}(w) with respect to ww, where fi=yixi\nabla f_{i}=-y_{i}x_{i} if yi(wxi)<1y_{i}(w\cdot x_{i})<1, and otherwise.

To get a bound on the error of hinge loss, we will need to assume the marginal distribution DxD_{x} is anti-concentrated.

A distribution is δ\delta-anticoncentrated if at most an O(δ)O(\delta)-fraction of its probability mass is within Euclidean distance δ\delta of any hyperplane.

Given the model for SVMs as described above, assume the following conditions for the marginal distribution DxD_{x}:

DxD_{x} is ε1/4\varepsilon^{1/4}-anticoncentrated.

Our main result on SVMs is the following:

Let ε>0\varepsilon>0, and let DxyD_{xy} be a distribution over pairs (X,Y)(X,Y), where the marginal distribution DxD_{x} satisfies the conditions of Assumption E.7. Then there exists an algorithm that with probability 9/109/10, given O(dlog(d/ε)/ε)O(d\log(d/\varepsilon)/\varepsilon) ε\varepsilon-noisy samples from DxyD_{xy}, returns a w^\widehat{w} such that for any ww^{\ast},

Our approach will be to fit this problem into the GLM framework developed in Section C. First, we will restrict our search over ww to H\mathcal{H}, a ball of radius r=ε1/4r=\varepsilon^{-1/4}. As we argue in Lemma E.9, this restriction comes at a cost of at most O(ε1/4)O(\varepsilon^{1/4}) in our algorithm’s loss. With this restriction, we will argue that the problem satisfies the conditions of Proposition C.3. This allows us to argue that, with a polynomial number of samples, we can obtain a set of fif_{i}’s satisfying the conditions of Assumption C.1. This will allow us to apply Theorem C.2, concluding the proof.

We start by showing that, due to anticoncentration of DD, there is a wHw^{\prime}\in\mathcal{H} with loss close to ww^{*}:

Otherwise, we break into case analysis, based on the value of (x,y)(x,y):

wx>1|w^{\prime}\cdot x|>1: If y(wx)>1y(w^{\prime}\cdot x)>1, then L(w,(x,y))=L(w,(x,y))=0L(w^{\prime},(x,y))=L(w^{*},(x,y))=0. If y(wx)<1y(w^{\prime}\cdot x)<-1, then L(w,(x,y))=1y(wx)1y(wx)=L(w,(x,y))L(w^{\prime},(x,y))=1-y(w^{\prime}\cdot x)\leq 1-y(w^{*}\cdot x)=L(w^{*},(x,y)). Both cases use the fact that w2<w2\|w^{\prime}\|_{2}<\|w^{\ast}\|_{2}.

wx1|w^{\prime}\cdot x|\leq 1: In this case, we have that L(w,(x,y))2L(w^{\prime},(x,y))\leq 2. Since L(w,(x,y))0L(w^{*},(x,y))\geq 0, we have that L(w,(x,y))L(w,(x,y))+2L(w^{\prime},(x,y))\leq L(w^{*},(x,y))+2.

We first show that this problem fits into the GLM framework, in particular, satisfying the conditions of Proposition C.3. The link function is σy(t)=max{0,1yt}\sigma_{y}(t)=\max\{0,1-yt\}, giving us the loss function L(w,(x,y))=σy(wx)L(w,(x,y))=\sigma_{y}(w\cdot x). We let H\mathcal{H} be the set w2ε1/4\|w\|_{2}\leq\varepsilon^{-1/4}, giving us the parameter r=ε1/4r=\varepsilon^{-1/4}. Condition 1 is satisfied by Assumption E.7. For y{1,1}y\in\{-1,1\}, σy(t)=0\sigma^{\prime}_{y}(t)=0 for yt1yt\geq 1 and σy(t)=y\sigma^{\prime}_{y}(t)=-y for yt<1yt<1. Thus we have that σ1(t)1|\sigma^{\prime}_{1}(t)|\leq 1 for all tt and yy, satisfying Condition 2. Finally, one can observe that σy(0)=1\sigma_{y}(0)=1 for all yy, satisfying Condition 3. Thus we can apply Proposition C.3: if we take O(dlog(dr/ε)/ε)O(d\log(dr/\varepsilon)/\varepsilon) ε\varepsilon-corrupted samples, then they satisfy Assumption C.1 on H\mathcal{H} with σ0=2\sigma_{0}=2, σ1=0\sigma_{1}=0 and σ2=1+ε1/4\sigma_{2}=1+\varepsilon^{-1/4}, with probability 9/109/10.

Now we can apply the algorithm of Theorem C.2. Since the loss is convex, we get a vector w^\widehat{w} with f(w^)f(w)=O((σ0r+σ1r2+σ2)ε)=O((2ε1/4+ε1/4)ε)=O(ε1/4)\overline{f}(\widehat{w})-\overline{f}({w^{*}}^{\prime})=O((\sigma_{0}r+\sigma_{1}r^{2}+\sigma_{2})\sqrt{\varepsilon})=O((2\varepsilon^{-1/4}+\varepsilon^{-1/4})\sqrt{\varepsilon})=O(\varepsilon^{1/4}) where w{w^{*}}^{\prime} is the minimizer of f\overline{f} on H\mathcal{H}.

We thus have that f(w^)f(w)+O(ε1/4)f(w)+O(ε1/4)f(w)+O(ε1/4)\overline{f}(\hat{w})\leq\overline{f}({w^{*}}^{\prime})+O(\varepsilon^{1/4})\leq\overline{f}(w^{\prime})+O(\varepsilon^{1/4})\leq\overline{f}(w^{*})+O(\varepsilon^{1/4}). The second inequality follows because w{w^{*}}^{\prime} is the minimizer of f\overline{f} on H\mathcal{H}, and the third inequality follows from Lemma E.9. ∎

E.3 Logistic Regression

In this section, we demonstrate how our results apply to logistic regression. In particular, we describe how logistic regression fits into the GLM framework described in Section C.

The gradient of this function is L(w,(x,y))=12(ϕ(wx)ϕ(wx)y)x\nabla L(w,(x,y))=\frac{1}{2}(\phi(w\cdot x)-\phi(-w\cdot x)-y)x. The goal is to find a w^\widehat{w} approximately minimizing the objective function

Given the model for logistic regression as described above, assume the following conditions for the marginal distribution DxD_{x}:

DxD_{x} is ε1/4log(1/ε)\varepsilon^{1/4}\sqrt{\log(1/\varepsilon)}-anticoncentrated.

We can get a similar result to that for hinge loss for logistic regression:

Let ε>0\varepsilon>0, and let DxyD_{xy} be a distribution over pairs (X,Y)(X,Y), where the marginal distribution DxD_{x} satisfies the conditions of Assumption E.10. Then there exists an algorithm that with probability 9/109/10, given O(dlog(d/ε)/ε)O(d\log(d/\varepsilon)/\varepsilon) ε\varepsilon-noisy samples from DxyD_{xy}, returns a w^\widehat{w} such that for any ww^{\ast},

The approach is very similar to that of Theorem E.8, which we repeat here for clarity. First, we will restrict our search over ww to H\mathcal{H}, a ball of radius r=ε1/4log(1/ε)r=\varepsilon^{-1/4}\sqrt{\log(1/\varepsilon)}. As we argue in Lemma E.12, this restriction comes at a cost of at most O(ε1/4log(1/ε))O(\varepsilon^{1/4}\sqrt{\log(1/\varepsilon)}) in our algorithm’s loss. With this restriction, we will argue that the problem satisfies the conditions of Proposition C.3. This allows us to argue that, with a polynomial number of samples, we can obtain a set of fif_{i}’s satisfying the conditions of Assumption C.1. This will allow us to apply Theorem C.2, concluding the proof.

We start by showing that, due to anticoncentration of DD, there is a wHw^{\prime}\in\mathcal{H} with loss close to ww^{*}:

Recalling that ϕ=1/(1+exp(t))\phi=1/(1+\exp(-t)), we have that ln(ϕ(t)ϕ(t))=ln(exp(t)+exp(t)+2)-\ln(\phi(t)\phi(-t))=\ln(\exp(t)+\exp(-t)+2). Since exp(t)+exp(t)+2exp(t)\exp(t)+\exp(-t)+2\geq\exp(|t|), we have tln(ϕ(t)ϕ(t))|t|\leq-\ln(\phi(t)\phi(-t)). On the other hand, ln(exp(t)+exp(t)+2)=t+ln(1+2exp(t)+exp(2t))t+ln(1+3exp(t))t+3exp(t)\ln(\exp(t)+\exp(-t)+2)=|t|+\ln(1+2\exp(-|t|)+\exp(-2|t|))\leq|t|+\ln(1+3\exp(-|t|))\leq|t|+3\exp(-|t|). ∎

With this in hand, we can conclude with the proof of Theorem E.11.

We first show that this problem fits into the GLM framework, in particular, satisfying the conditions of Proposition C.3. The link function is σy(t)=12(ln(ϕ(t)ϕ(t))yt)\sigma_{y}(t)=\frac{1}{2}(-\ln(\phi(t)\phi(-t))-yt), giving us the loss function L(w,(x,y))=σy(wx)L(w,(x,y))=\sigma_{y}(w\cdot x). We let H\mathcal{H} be the set w2ε1/4ln(1/ε)\|w\|_{2}\leq\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}, giving us the parameter r=ε1/4ln(1/ε)r=\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}. Condition 1 is satisfied by Assumption E.10. For y{1,1}y\in\{-1,1\}, σy(t)=12(ϕ(t)ϕ(t)y)\sigma^{\prime}_{y}(t)=\frac{1}{2}(\phi(t)-\phi(-t)-y), which gives that σy(t)1|\sigma^{\prime}_{y}(t)|\leq 1 for all tt and yy, satisfying Condition 2. Finally, σy(0)=ln2<1\sigma_{y}(0)=\ln 2<1 for all yy, satisfying Condition 3. Thus we can apply Proposition C.3: if we take O(dlog(dr/ε)/ε)O(d\log(dr/\varepsilon)/\varepsilon) ε\varepsilon-corrupted samples, then they satisfy Assumption C.1 on H\mathcal{H} with σ0=2\sigma_{0}=2, σ1=0\sigma_{1}=0 and σ2=1+ε1/4ln(1/ε)\sigma_{2}=1+\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}, with probability 9/109/10.

Now we can apply the algorithm of Theorem C.2. Since the loss is convex, we get a vector w^\widehat{w} with f(w^)f(w)=O((σ0r+σ1r2+σ2)ε)=O((2ε1/4ln(1/ε)+ε1/4ln(1/ε))ε)=O(ε1/4ln(1/ε))\overline{f}(\widehat{w})-\overline{f}({w^{*}}^{\prime})=O((\sigma_{0}r+\sigma_{1}r^{2}+\sigma_{2})\sqrt{\varepsilon})=O((2\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}+\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)})\sqrt{\varepsilon})=O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)}) where w{w^{*}}^{\prime} is the minimizer of f\overline{f} on H\mathcal{H}.

We thus have that f(w^)f(w)+O(ε1/4ln(1/ε))f(w)+O(ε1/4ln(1/ε))f(w)+O(ε1/4ln(1/ε))\overline{f}(\hat{w})\leq\overline{f}({w^{*}}^{\prime})+O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)})\leq\overline{f}(w^{\prime})+O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)})\leq\overline{f}(w^{*})+O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)}). The second inequality follows because w{w^{*}}^{\prime} is the minimizer of f\overline{f} on H\mathcal{H}, and the third inequality follows from Lemma E.12. ∎

Appendix F Additional Experimental Results

In this section, we provide additional plots of our experimental results, comparing with all baselines considered.