Sever: A Robust Meta-Algorithm for Stochastic Optimization

Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Jacob Steinhardt, Alistair Stewart

Introduction

Learning in the presence of outliers is a ubiquitous challenge in machine learning; nevertheless, most machine learning methods are very sensitive to outliers in high dimensions. The focus of this work is on designing algorithms that are outlier robust while remaining competitive in terms of accuracy and running time.

We highlight two motivating applications. The first is biological data (such as gene expression data), where mislabeling or measurement errors can create systematic outliers [RPW+02, LAT+08] that require painstaking manual effort to remove [PLJD10]. Detecting outliers in such settings is often important either because the outlier observations are of interest themselves or because they might contaminate the downstream statistical analysis. The second motivation is machine learning security, where outliers can be introduced through data poisoning attacks [BNJT10] in which an adversary inserts fake data into the training set (e.g., by creating a fake user account). Recent work has shown that for high-dimensional datasets, even a small fraction of outliers can substantially degrade the learned model [BNL12, NPXNR14, KL17, SKL17, KSL18].

Crucially, in both the biological and security settings above, the outliers are not “random” but are instead highly correlated, and could have a complex internal structure that is difficult to model. This leads us to the following conceptual question underlying the present work: Can we design training algorithms that are robust to the presence of an $\varepsilon$ -fraction of arbitrary (and potentially adversarial) outliers?

Estimation in the presence of outliers is a prototypical goal in robust statistics and has been systematically studied since the pioneering work of Tukey [Tuk60]. Popular methods include RANSAC [FB81], minimum covariance determinant [RD99], removal based on $k$ -nearest neighbors [BKNS00], and Huberizing the loss [Owe07] (see [HA04] for a comprehensive survey). However, these classical methods either break down in high dimensions, or only handle “benign” outliers that are obviously different from the rest of the data (see Section 1.1 for futher discussion of these points).

Motivated by this, recent work in theoretical computer science has developed efficient robust estimators for classical problems such as linear classification [KLS09, ABL14], mean and covariance estimation [DKK+16, LRV16], clustering [CSV17], and regression [BJK15, BJKK17, BDLS17]. Nevertheless, the promise of practical high-dimensional robust estimation is yet to be realized; indeed, the aforementioned results generally suffer from one of two shortcomings–either they use sophisticated convex optimization algorithms that do not scale to large datasets, or they are tailored to specific problems of interest or specific distributional assumptions on the data, and hence do not have good accuracy on real data.

In this work, we address these shortcomings. We propose an algorithm, Sever, that is:

Robust: it can handle arbitrary outliers with only a small increase in error, even in high dimensions.

General: it can be applied to most common learning problems including regression and classification, and handles non-convex models such as neural networks.

Practical: the algorithm can be implemented with standard machine learning libraries.

At a high level, our algorithm (depicted in Figure 1 and described in detail in Section 2.2) is a simple “plug-in” outlier detector–first, run whatever learning procedure would be run normally (e.g., least squares in the case of linear regression). Then, consider the matrix of gradients at the optimal parameters, and compute the top singular vector of this matrix. Finally, remove any points whose projection onto this singular vector is too large (and re-train if necessary).

Despite its simplicity, our algorithm possesses strong theoretical guarantees: As long as the real (non-outlying) data is not too heavy-tailed, Sever is provably robust to outliers–see Section 2 for detailed statements of the theory. At the same time, we show that our algorithm works very well in practice and outperforms a number of natural baseline outlier detectors. In line with our original motivating biological and security applications, we implement our method on two tasks–a linear regression task for predicting protein activity levels [OSB+18], and a spam classification task based on emails from the Enron corporation [MAP06]. Even with a small fraction of outliers, baseline methods perform poorly on these datasets; for instance, on the Enron spam dataset with a $1\%$ fraction of outliers, baseline errors range from $13.4\%$ to $20.5\%$ , while Sever incurs only $7.3\%$ error (in comparison, the error is $3\%$ in the absence of outliers). Similarly, on the drug design dataset, with $10\%$ corruptions, Sever achieved $1.42$ mean-squared error test error, compared to $1.51$ - $2.33$ for the baselines, and $1.23$ error on the uncorrupted dataset.

As mentioned above, the myriad classical approaches to robust estimation perform poorly in high dimensions or in the presence of worst-case outliers. For instance, RANSAC [FB81] works by removing enough points at random that no outliers remain with decent probability; since we need at least $d$ points to fit a $d$ -dimensional model, this requires the number of outliers to be $O(1/d)$ . $k$ -nearest neighbors [BKNS00] similarly suffers from the curse of dimensionality when $d$ is large. The minimum covariance determinant estimator [RD99] only applies when the number of data points $n$ exceeds $2d$ , which does not hold for the datasets we consider (it also has other issues such as computational intractability). A final natural approach is to limit the effect of points with large loss (via e.g. Huberization [Owe07]), but as [KSL18] show (and we confirm in our experiments), correlated outliers often have lower loss than the real data under the learned model.

These issues have motivated work on high-dimensional robust statistics going back to Tukey [Tuk75]. However, it was not until much later that efficient algorithms with favorable properties were first proposed. [KLS09] gave the first efficient algorithms for robustly classification under the assumption that the distribution of the good data is isotropic and log-concave. Subsequently, [ABL14] obtained an improved and nearly optimal robust algorithm for this problem. Two concurrent works [DKK+16, LRV16] gave the first efficient robust estimators for several other tasks including mean and covariance estimation. There has since been considerable study of algorithmic robust estimation in high dimensions, including learning graphical models [DKS16], understanding computation-robustness tradeoffs [DKS17c, DKK+18], establishing connections to PAC learning [DKS17a], tolerating more noise by outputting a list of hypotheses [CSV17, MV18, DKS17b], robust estimation of discrete structures [Ste17, QV18, SCV18], and robust estimation via sum-of-squares [KS17b, HL17, KS17a].

Despite this progress, these recent theoretical papers typically focus on designing specialized algorithms for specific settings (such as mean estimation or linear classification for specific families of distributions) rather than on designing general algorithms. The only exception is [CSV17], which provides a robust meta-algorithm for stochastic convex optimization in a similar setting to ours. However, that algorithm (i) requires solving a large semidefinite program and (ii) incurs a significant loss in performance relative to standard training even in the absence of outliers. On the other hand, [DKK+17] provide a practical implementation of the robust mean and covariance estimation algorithms of [DKK+16], but do not consider more general learning tasks.

A number of papers [NTN11, NT13, BJK15, BJKK17] have proposed efficient algorithms for a type of robust linear regression. However, these works consider a restrictive corruption model that only allows adversarial corruptions to the responses (but not the covariates). On the other hand, [BDLS17] studies (sparse) linear regression and, more broadly, generalized linear models (GLMs) under a robustness model very similar to the one considered here. The main issues with this algorithm are that (i) it requires running the ellipsoid method (hence does not scale) and (ii) it crucially assumes Gaussianity of the covariates, which is unlikely to hold in practice.

In a related direction, [SKL17] provide a method for analyzing outlier detectors in the context of linear classification, either certifying robustness or generating an attack if the learner is not robust. The outlier detector they analyze is brittle in high dimensions, motivating the need for the robust algorithms presented in the current work. Later work by the same authors showed how to bypass a number of common outlier detection methods [KSL18]. We use these recent strong attacks as part of our evaluation and show that our algorithm is more robust.

[PSBR18] independently obtained a robust algorithm for stochastic convex optimization by combining gradient descent with robust mean estimation. This algorithm is similar to the one we present in Appendix D, and in that section we discuss in more detail the comparison between these two techniques. For the case of linear regression, [DKS19] provide efficient robust algorithms with near-optimal error guarantees under various distributional assumptions and establish matching computational-robustness tradeoffs.

Framework and Algorithm

In this section, we describe our formal framework as well as the Sever algorithm.

To help us learn the parameter vector $w^{\ast}$ , we have access to a training set of $n$ functions $f_{1:n}\stackrel{{\scriptstyle\text{def}}}{{=}}\{f_{1},\ldots,f_{n}\}$ . (For linear regression, we would have $f_{i}(w)=\frac{1}{2}(w\cdot x_{i}-y_{i})^{2}$ , where $(x_{i},y_{i})$ is an observed data point.) However, unlike the classical (uncorrupted) setting where we assume that $f_{1},\ldots,f_{n}\sim p^{\ast}$ , we allow for an $\varepsilon$ -fraction of the points to be arbitrary outliers:

In the $\varepsilon$ -contamination model, the adversary is allowed to both add and remove points. Our theoretical results hold in this strong robustness model. Our experimental evaluation uses corrupted instances in which the adversary is only allowed to add corrupted points. Additive corruptions essentially correspond to Huber’s contamination model [Hub64] in robust statistics.

Finally, we will often assume access to a black-box learner, which we denote by $\mathcal{L}$ , which takes in functions $f_{1},\ldots,f_{n}$ and outputs a parameter vector $w\in\mathcal{H}$ . We want to stipulate that $\mathcal{L}$ approximately minimizes $\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)$ . For this purpose, we introduce the following definition:

We are now ready to define the notion of a $\gamma$ -approximate learner:

In other words, $\mathcal{L}$ always finds an approximate critical point of the empirical learning objective. We note that most common learning algorithms (such as stochastic gradient descent) satisfy the $\gamma$ -approximate learner property. For our example of linear regression, gradient descent could be performed using the gradient $\frac{1}{n}\sum_{i=1}^{n}x_{i}(w\cdot x_{i}-y_{i})$ . However, in some cases, a more efficient and direct method is to set the gradient equal to 0 and solve for $w$ . In our linear regression example, this gives us a closed form solution for the optimal parameter vector.

2 Algorithm and Theory

As outlined in Figure 1, our algorithm works by post-processing the gradients of a black-box learning algorithm. The basic intuition is as follows: we want to ensure that the outliers do not have a large effect on the learned parameters. Intuitively, for the outliers to have such an effect, their corresponding gradients should be (i) large in magnitude and (ii) systematically pointing in a specific direction. We can detect this via singular value decomposition–if both (i) and (ii) hold then the outliers should be responsible for a large singular value in the matrix of gradients, which allows us to detect and remove them.

This is shown more formally via the pseudocode in Algorithm 1.

For concreteness, we describe how the algorithm would work for our running example of linear regression. First, we would solve for the optimal parameter vector on the dataset, disregarding issues of robustness. Specifically, we let $\hat{w}$ be the solution to $\sum_{i=1}^{n}x_{i}(\hat{w}\cdot x_{i}-y_{i})=0$ : setting the gradient equal to will give us a critical point, as desired. We compute the average gradient, $\frac{1}{n}\sum_{i=1}^{n}x_{i}(\hat{w}\cdot x_{i}-y_{i}),$ and use this to compute the matrix of centered gradients $G$ . That is, the $j$ th row of $G$ , $G_{j}$ , is the vector $x_{j}(\hat{w}\cdot x_{j}-y_{j})-\frac{1}{n}\sum_{i=1}^{n}x_{i}(\hat{w}\cdot x_{i}-y_{i})$ . We compute the top right singular vector of $G$ , project the data into this direction, and square the resulting magnitudes to derive a score for each point: $\tau_{j}=(G_{j}\cdot v)^{2}$ . With these scores in place, we run Algorithm 2, to (randomly) remove some of the points with the largest scores. We re-run the entire procedure on this subset of points, until Algorithm 2 does not remove any points, at which point we terminate.

Our first theoretical result says that as long as the data is not too heavy-tailed, Sever will find an approximate critical point of the true function $\overline{f}$ , even in the presence of outliers.

Then our algorithm Sever applied to $f_{1},\ldots,f_{n},\sigma$ returns a point $w\in\mathcal{H}$ that, with probability at least $9/10$ , is a $(\gamma+O(\sigma\sqrt{\varepsilon}))$ -approximate critical point of $\overline{f}$ .

The key take-away from Theorem 2.1 is that the error guarantee has no dependence on the underlying dimension $d$ . In contrast, most natural algorithms incur an error that grows with $d$ , and hence have poor robustness in high dimensions.

We show that under some niceness assumptions on $p^{\ast}$ , the deterministic regularity conditions are satisfied with high probability with polynomially many samples:

The reader is referred to Proposition B.5 in the appendix for a detailed formal statement.

While Theorem 2.1 is very general and holds even for non-convex loss functions, we might in general hope for more than an approximate critical point. In particular, for convex problems, we can guarantee that we find an approximate global minimum. This follows as a corollary of Theorem 2.1:

If $\overline{f}$ is convex, the algorithm finds a $w\in\mathcal{H}$ such that $\overline{f}(w)-\overline{f}(w^{*})=O((\sigma\sqrt{\varepsilon}+\gamma)r)$ .

If $\overline{f}$ is $\xi$ -strongly convex, the algorithm finds a $w\in\mathcal{H}$ such that $\overline{f}(w)-\overline{f}(w^{*})=O\left((\varepsilon\sigma^{2}+\gamma^{2})/{\xi}\right)$ .

For our theory to hold, we need to use the randomized filtering algorithm shown in Algorithm 2 (which is essentially the robust mean estimation algorithm of [DKK+17]), and filter until the stopping condition in line 11 of Algorithm 1 is satisfied. However, in practice we found that the following simpler algorithm worked well: in each iteration simply remove the top $p$ fraction of outliers according to the scores $\tau_{i}$ , and instead of using a specific stopping condition, simply repeat the filter for $r$ iterations in total. This is the version of Sever that we use in our experiments in Section 3.

Another application allows us to use the least-squares loss function ( $L(w,(X,Y))=(Y-w\cdot X)^{2}$ ) to perform linear regression under somewhat more restrictive assumptions (see Theorem E.2 for the full statement):

3 Overview of Sever and its Analysis

In the robust setting, stochastic optimization becomes quite challenging: Even for the most basic special cases of this problem (e.g., mean estimation, linear regression) a single adversarially corrupted sample can substantially change the location of the minimum for $\hat{f}$ . Moreover, naive outlier removal methods can only tolerate a negligible fraction $\varepsilon$ of corruptions (corresponding to $\varepsilon=O(d^{-1/2})$ ).

A first idea to get around this obstacle is the following: We consider the standard (projected) gradient descent method used to find the minimum of $\hat{f}$ . This algorithm would proceed by repeatedly computing the gradient of $\hat{f}$ at appropriate points and using it to update the current location. The issue is that adversarial corruptions can completely compromise this algorithm’s behavior, since they can substantially change the gradient of $\hat{f}$ at the chosen points. The key observation is that approximating the gradient of $\overline{f}$ at a given point, given access to an $\varepsilon$ -corrupted set of samples, can be viewed as a robust mean estimation problem. We can thus use the robust mean estimation algorithm of [DKK+17], which succeeds under fairly mild assumptions about the good samples. Assuming that the covariance matrix of $\nabla f(w)$ , $f\sim p^{\ast}$ , is bounded, we can thus “simulate” gradient descent and compute an approximate minimum for $\overline{f}$ .

In summary, the first algorithmic idea is to use a robust mean estimation routine as a black-box in order to robustly estimate the gradient at each iteration of (projected) gradient descent. This yields a simple robust method for stochastic optimization with polynomial sample complexity and running time in a very general setting (See Appendix D for details.)

We are now ready to describe Sever (Algorithm 1) and the main insight behind it. Roughly speaking, Sever only calls our robust mean estimation routine (which is essentially the filtering method of [DKK+17] for outlier removal) each time the algorithm reaches an approximate critical point of $\hat{f}$ . There are two main motivations for this approach: First, we empirically observed that if we iteratively filter samples, keeping the subset with the samples removed, then few iterations of the filter remove points. Second, an iteration of the filter subroutine (Algorithm 2) is more expensive than an iteration of gradient descent. Therefore, it is advantageous to run many steps of gradient descent on the current set of corrupted samples between consecutive filtering steps. This idea is further improved by using stochastic gradient descent, rather than computing the average at each step.

An important feature of our analysis is that Sever does not use a robust mean estimation routine as a black box. In contrast, we take advantage of the performance guarantees of our filtering algorithm. The main idea for the analysis is as follows: Suppose that we have reached an approximate critical point $w$ of $\hat{f}$ and at this step we apply our filtering algorithm. By the performance guarantees of the latter algorithm we are in one of two cases: either the filtering algorithm removes a set of corrupted functions or it certifies that the gradient of $\hat{f}$ is “close” to the gradient of $\overline{f}$ at $w$ . In the first case, we make progress as we produce a “cleaner” set of functions. In the second case, our certification implies that the point $w$ is also an approximate critical point of $\overline{f}$ and we are done.

Experiments

In this section we apply Sever to regression and classification problems. Our code is available at https://github.com/hoonose/sever. As our base learners, we used ridge regression and an SVM, respectively. We implemented the latter as a quadratic program, using Gurobi [Gur16] as a backend solver and YALMIP [Löf04] as the modeling language.

In both cases, we ran the base learner and then extracted gradients for each data point at the learned parameters. We then centered the gradients and ran MATLAB’s svds method to compute the top singular vector $v$ , and removed the top $p$ fraction of points $i$ with the largest outlier score $\tau_{i}$ , computed as the squared magnitude of the projection onto $v$ (see Algorithm 1). We repeated this for $r$ iterations in total. For classification, we centered the gradients separately (and removed points separately) for each class, which improved performance.

We compared our method to six baseline methods. All but one of these all have the same high-level form as Sever (run the base learner then filter top $p$ fraction of points with the largest score), but use a different definition of the score $\tau_{i}$ for deciding which points to filter:

loss: remove points with large loss (measured at the parameters output by the base learner).

RANSAC: repeatedly subsample points uniformly at random, and find the best fit with the subsample. Then, choose the best fit amongst this set of learners. Note that this method is not “filter-based”.In practice, heuristics must often be applied to choose the best fit. In our experiments, we “cheat” slightly by in fact choosing the best fit post-hoc by reporting the best error achieved by any learner in this way. Despite strengthening RANSAC in this way, we observe that it still has poor performance.

Both ridge regression and SVM have a single hyperparameter (the regularization coefficient). We optimized this based on the uncorrupted data and then kept it fixed throughout our experiments. In addition, since the data do not already have outliers, we added varying amounts of outliers (ranging from $0.5\%$ to $10\%$ of the clean data); this process is described in more detail below.

We found that centering the data points decreased error noticeably on the drug discovery dataset, while scaling each coordinate to have variance $1$ decreased error by a small amount on the synthetic data. To center in the presence of outliers, we used the robust mean estimation algorithm from [DKK+17].

We devised a method of generating outliers that fools all of the baselines while still inducing high test error. At a high level, the outliers cause ridge regression to output $w=0$ (so the model always predicts $y=0$ ).

In our experiments we found that this method generated successful attacks as long as the fraction of outliers was at least roughly $2\%$ for synthetic data, and roughly $5\%$ for the drug discovery data.

In Figure 2 we compare the test error of our defense against the baselines as we increase the fraction $\varepsilon$ of added outliers. To avoid cluttering the figure, we only show the performance of l2, loss, gradientCentered, RANSAC, and Sever; the performance of the remaining baselines is qualitatively similar to the baselines in Figure 2.

For both the baselines and our algorithms, we iterate the defense $r=4$ times, each time removing the $p=\varepsilon/2$ fraction of points with largest score. For consistency of results, for each defense and each value of $\varepsilon$ we ran the defense 3 times on fresh attack points and display the median of the 3 test errors.

When the attack parameters $\alpha$ and $\beta$ are tuned to defeat the baselines (Figure 2 left and center), our defense substantially outperforms the baselines as soon as we cross $\varepsilon\approx 1.5\%$ for synthetic data, and $\varepsilon\approx 5.5\%$ for the drug discovery data. In fact, most of the baselines do worse than not removing any outliers at all (this is because they end up mostly removing good data points, which causes the outliers to have a larger effect). Even when $\alpha$ and $\beta$ are instead tuned to defeat Sever, its resulting error remains small (Figure 2 right).

To understand why the baselines fail to detect the outliers, in Figure 3 we show a representative sample of the histograms of scores of the uncorrupted points overlaid with the scores of the outliers, for both synthetic data and the drug discovery dataset with $\varepsilon=0.1$ , after one run of the base learner. The scores of the outliers lie well within the distribution of scores of the uncorrupted points. Thus, it would be impossible for the baselines to remove them without also removing a large fraction of uncorrupted points.

Interestingly, for small $\varepsilon$ all of the methods improve upon the uncorrupted test error for the drug discovery data; this appears to be due to the presence of a small number of natural outliers in the data that all of the methods successfully remove.

2 Support Vector Machines

In contrast to ridge regression, we did not perform centering and rescaling for these datasets as it did not seem to have a large effect on results.

In all experiments for this section, each method removed the top $p=\frac{n_{-}+n_{+}}{\min\{n_{+},n_{-}\}}\cdot\frac{\varepsilon}{r}$ of highest-scoring points for each of $r=2$ iterations, where $n_{+}$ and $n_{-}$ are the number of positive and negative training points respectively. This expression for $p$ is chosen in order to account for class imbalance, which is extreme in the case of the Enron dataset – if the attacker plants all the outliers in the smaller class, then a smaller value of $p$ would remove too few points, even with a perfect detection method.

We considered fractions of outliers ranging from $\varepsilon=0.005$ to $\varepsilon=0.03$ . By performing a sweep across hyperparameters of the attack, we generated $56$ distinct sets of attacks for each value of $\varepsilon$ . In Figure 4, we show results for the attack where the loss baselines does the worst, as well as for the attack where our method does the worst. When attacks are most effective against loss, Sever substantially outperforms it, nearly matching the test accuracy of $5.8\%$ on the uncorrupted data, while loss performs worse than $30\%$ error at just a $1.5\%$ fraction of injected outliers. Even when attacks are most effective against Sever, it still outperforms loss, achieving a test error of at most $9.05\%$ . We note that other baselines behaved qualitatively similarly to loss, and the results are displayed in Section F.

For results on Enron, we used the same values of $\varepsilon$ , and considered $96$ distinct hyperparameters for the attack. There was not a single attack that simultaneously defeated all of the baselines, so in Figure 5 we show two attacks that do well against different sets of baselines, as well as the attack that performs best against our method.

At $\varepsilon=0.01$ , the worst performance of our method against all attacks was $7.34\%$ , in contrast to $13.43\%-20.48\%$ for the baselines (note that the accuracy is $3\%$ in the absence of outliers). However, at $\varepsilon=0.03$ , while we still outperform the baselines, our error is relatively large— $13.53\%$ .

To investigate this further, we looked at all $48$ attacks and found that while on $42$ out of $48$ attacks our error never exceeded $7\%$ , on $6$ of the attacks (including the attack in Figure 5) the error was substantially higher. Figure 6 shows what is happening. The leftmost figure displays the scores assigned by Sever after the first iteration, where red bars indicate outliers. While some outliers are assigned extremely large scores and thus detected, several outliers are correctly classified and thus have 0 gradient. However, once we remove the first set of outliers, some outliers which were previously correctly classified now have large score, as displayed in the middle figure. Another iteration of this process produces the rightmost figure, where almost all the remaining outliers have large score and will thus be removed by Sever. This demonstrates that some outliers may be hidden until other outliers are removed, necessitating multiple iterations.

Motivated by this, we re-ran our method against the $6$ attacks using $r=3$ iterations instead of $2$ (and decreasing $p$ as per the expression above). After this change, all $6$ of the attacks had error at most $7.4\%$ .

Discussion

In this paper we have presented an algorithm, Sever, that has both strong theoretical robustness properties in the presence of outliers, and performs well on real datasets. Sever is based on the idea that learning can often be cast as the problem of finding an approximate stationary point of the loss, which can in turn be cast as a robust mean estimation problem, allowing us to leverage existing techniques for efficient robust mean estimation.

There are a number of directions along which Sever could be improved: first, it could be extended to handle more general assumptions on the data; second, it could be strengthened to achieve better error bounds in terms of the fraction of outliers; finally, one could imagine automatically learning a feature representation in which Sever performs well. We discuss each of these ideas in detail below.

The main underlying assumption on which Sever rests is that the top singular value of the gradients of the data is small. While this appeared to hold true on the datasets we considered, a common occurence in practice is for there to be a few large singular values, together with many small singular values. It would therefore be desirable to design a version of Sever that can take advantage of such phenomena. In addition, it would be worthwhile to do a more detailed empirical analysis across a wide variety of datasets investigating properties that can enable robust estimation (the notion of resilience in [SCV18] could provide a template for finding such properties).

In theory, Sever has a $O(\sqrt{\varepsilon})$ dependence in error on the fraction $\varepsilon$ of outliers (see Theorem 2.1). While without stronger assumptions this is likely not possible to improve, in practice we would prefer to have a dependence closer to $O(\varepsilon)$ . Therefore, it would also be useful to improve Sever to have such an $O(\varepsilon)$ -dependence under stronger but realistic assumptions. Unfortunately, all existing algorithms for robust mean estimation that achieve error better than $O(\sqrt{\varepsilon})$ either rely on strong distributional assumptions such as Gaussianity [DKK+16, LRV16], or else require expensive computation involving e.g. sum-of-squares optimization [HL17, KS17a, KS17b]. Improving the robustness of Sever thus requires improvements on the robust mean estimation algorithm that Sever uses as a primitive.

Finally, we note that Sever performs best when the features have small covariance and strong predictive power. One situation in particular where this holds is when there are many approximately independent features that are predictive of the true signal.

It would be interesting to try to learn a representation with such a property. This could be done, for instance, by training a neural network with some cost function that encourages independent features (some ideas along these general lines are discussed in [Ben17]). An issue is how to learn such a representation robustly; one idea is learn a representation on a dataset that is known to be free of outliers, and hope that the representation is useful on other datasets in the same application domain. Beyond these specific questions, we view the general investigation of robust methods (both empirically and theoretically) as an important step as machine learning moves forwards. Indeed, as machine learning is applied in increasingly many situations and in increasingly automated ways, it is important to attend to robustness considerations so that machine learning systems behave reliably and avoid costly errors. While the bulk of recent work has highlighted the vulnerabilities of machine learning (e.g. [SZS+14, LWSV16, SKL17, EEF+18, CLL+17]), we are optimistic that practical algorithms backed by principled theory can finally patch these vulnerabilities and lead to truly reliable systems.

References

Appendix

The appendix is organized as follows. In Section A, we describe additional preliminaries required for our technical arguments. In Section B, we analyze our main algorithm, Sever. In Section C, we specialize our analysis to the important case of Generalized Linear Models (GLMs). In Section D, we describe a variant of our algorithm which performs robust filtering on each iteration of projected gradient descent, and works under more general assumptions. In Section E, we describe concrete applications of Sever – in particular, how it can be used to robustly optimize in the settings of linear regression, logistic regression, and support vector machines. Finally, in Section F, we provide additional plots from our experimental evaluations.

Appendix A Preliminaries

In this section, we formally introduce our setting for robust stochastic optimization.

In addition, some of our bounds will make use of the following quantities:

The strong convexity parameter $\xi$ of $\overline{f}$ , if it exists. This is the maximal $\xi$ such that $\overline{f}(w)\geq\overline{f}(w_{0})+\langle w-w_{0},\nabla\overline{f}(w_{0})\rangle+\frac{\xi}{2}\|w-w_{0}\|_{2}^{2}$ for all $w,w_{0}\in\mathcal{H}$ .

The strong smoothness parameter $\beta$ of $\overline{f}$ , if it exists. This is the minimal $\beta$ such that $\overline{f}(w)\leq\overline{f}(w_{0})+\langle w-w_{0},\nabla\overline{f}(w_{0})\rangle+\frac{\beta}{2}\|w-w_{0}\|_{2}^{2}$ for all $w,w_{0}\in\mathcal{H}$ .

The Lipschitz constant $L$ of $\overline{f}$ , if it exists. This is the minimal $L$ such that $\overline{f}(w)-\overline{f}(w_{0})\leq L\|w-w_{0}\|_{2}$ for all $w,w_{0}\in\mathcal{H}$ .

Appendix B General Analysis of Sever

This section is dedicated to the analysis of Algorithm 1, where we do not make convexity assumptions about the underlying functions $f_{1},\ldots,f_{n}$ . In this case, we can show that our algorithm finds an approximate critical point of $\overline{f}$ . When we specialize to convex functions, this immediately implies that we find an approximate minimal point of $\overline{f}$ .

Our proof proceeds in two parts. First, we define a set of deterministic conditions under which our algorithm finds an approximate minimal point of $\overline{f}$ . We then show that, under mild assumptions on our functions, this set of deterministic conditions holds with high probability after polynomially many samples.

For completeness, we recall the definitions of a $\gamma$ -approximate critical point and a $\gamma$ -approximate learner:

We first explicitly demonstrate a set of deterministic conditions on the (uncorrupted) data points. Our deterministic regularity conditions are as follows:

In Section B.1, we prove the following theorem, which shows that under Assumption B.1 our algorithm succeeds:

Observe that the above theorem holds quite generally; in particular, it holds for non-convex functions. As a corollary of this theorem, in Section B.2 we show that this immediately implies that Sever robustly minimizes convex functions, if Assumption B.1 holds:

If $\overline{f}$ is convex, the algorithm finds a $w\in\mathcal{H}$ such that $\overline{f}(w)-\overline{f}(w^{*})=O((\sigma_{0}r+\sigma_{1}r^{2})\sqrt{\varepsilon}+\gamma r)$ .

If $\overline{f}$ is $\xi$ -strongly convex, the algorithm finds a $w\in\mathcal{H}$ such that

In the strongly convex case and when $\sigma_{1}>0$ , we can remove the dependence on $\sigma_{1}$ and $r$ in the above by repeatedly applying Sever with decreasing $r$ :

using at most $O(\log(r\xi/(\gamma+\sigma_{0}\sqrt{\varepsilon})))$ calls to Sever.

To concretely use Theorem B.2, Corollary B.3, and Corollary B.4, in Section B.4 we show that the Assumption B.1 is satisfied with high probability under mild conditions on the distribution over the functions, after drawing polynomially many samples:

an $\varepsilon$ -corrupted set of points $f_{1},\ldots,f_{n}$ with high probability satisfy Assumption B.1.

The remaining subsections are dedicated to the proofs of Theorem B.2, Corollary B.3, Corollary B.4, and Proposition B.5.

B.1 Proof of Theorem B.2

If the samples satisfy (1) of Assumption B.1, and if $|S|\geq 2n/3$ then if $S^{\prime}$ is the output of $\textsc{Filter}(S,\tau,\sigma)$ , we have that

If the samples satisfy Assumption B.1, $\textsc{Filter}(S,\tau,\sigma)=S$ , and $n-|S|\leq 11\varepsilon n$ , then

Before we prove these lemmata, we show how together they imply Theorem B.2.

First, we note that the algorithm must terminate in at most $n$ iterations. This is easy to see as each iteration of the main loop except for the last must decrease the size of $S$ by at least $1$ .

Thus it suffices to prove these two lemmata. We first prove Lemma B.6:

This is the supremum over unit vectors $v$ of

and therefore, $\delta=O(\sigma\sqrt{\varepsilon})$ as desired. ∎

B.2 Proof of Corollary B.3

In this section, we show that the Sever algorithm finds an approximate global optimum for convex optimization in various settings, under Assumption B.1. We do so by simply applying the guarantees of Theorem B.2 in a fairly black box manner.

Before we proceed with the proof of Corollary B.3, we record a simple lemma that allows us to translate an approximate critical point guarantee to an approximate global optimum guarantee:

If $f$ is $\xi$ -strongly convex, then $|f(x)-f(y)|\leq 2\delta^{2}/\xi$ and $\|x-y\|_{2}\leq 2\delta/\xi$ .

Let $r=\|x-y\|_{2}>0$ and $g(t)=f(x+tv)$ . We have that $g(0)=f(x),g(r)=f(y)$ and that $g$ is convex (or $\xi$ -strongly convex) with $g^{\prime}(0)\geq-\delta$ and $g^{\prime}(r)\leq\delta$ . By convexity, the derivative of $g$ is increasing on $[0,r]$ and therefore $|g^{\prime}(t)|\leq\delta$ for all $t\in[0,r]$ . This implies that

To show the second part of the lemma, we note that if $g$ is $\xi$ -strongly convex that $g^{\prime\prime}(t)\geq\xi$ for all $t$ . This implies that $g^{\prime}(r)>g^{\prime}(0)+\xi r$ . Since $g^{\prime}(r)-g^{\prime}(0)\leq 2\delta$ , we obtain that $r\leq 2\delta/\xi$ , from which the second statement follows. ∎

By applying the algorithm of Theorem B.2, we can find a point $w$ that is a $\gamma^{\prime}\stackrel{{\scriptstyle\text{def}}}{{=}}(\gamma+O(\sigma\sqrt{\varepsilon}))$ -approximate critical point of $\overline{f}$ , where $\sigma\stackrel{{\scriptstyle\text{def}}}{{=}}\sigma_{0}+\sigma_{1}\|w^{\ast}-w\|_{2}$ . That is, for any unit vector $v$ pointing towards the interior of $\mathcal{H}$ , we have that $v\cdot\nabla\overline{f}(w)\geq-\gamma^{\prime}.$

To prove (i), we apply Lemma B.8 to $\overline{f}$ at $w$ which gives that

To prove (ii), we apply Lemma B.8 to $\overline{f}$ at $w$ which gives that

Plugging in parameters appropriately then immediately gives the desired bound. ∎

B.3 Proof of Corollary B.4

We apply Sever iteratively starting with a domain $\mathcal{H}_{1}=\mathcal{H}$ and radius $r_{1}=r$ . After each iteration, we know the resulting point is close to $w^{\ast}$ will be able to reduce the search radius.

At step $i$ , we have a domain of radius $r_{i}$ . As in the proof of Corollary B.3 above, we apply algorithm of Theorem B.2, we can find a point $w_{i}$ that is a $\gamma^{\prime}_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}(\gamma+O(\sigma^{\prime}_{i}\sqrt{\varepsilon}))$ -approximate critical point of $\overline{f}$ , where $\sigma^{\prime}_{i}\stackrel{{\scriptstyle\text{def}}}{{=}}\sigma_{0}+\sigma_{1}r_{i}$ . Then using Lemma B.8, we obtain that $\|w_{i}-w^{\ast}\|_{2}\leq 2\gamma^{\prime}_{i}/\xi$ .

Now we can define $\mathcal{H}_{i+1}$ as the intersection of $\mathcal{H}$ and the ball of radius $r_{i+1}=2\gamma^{\prime}_{i}/\xi$ around $w_{i}$ and repeat using this domain. We have that $r_{i+1}=2\gamma^{\prime}_{i}/\xi=2\gamma/\xi+O(\sigma_{0}\sqrt{\varepsilon}/\xi+\sigma_{1}\sqrt{\varepsilon}r_{i}/\xi)$ . Now if we choose the constant $C$ such that the constant in this $O()$ is $C/4$ , then using our assumption that $\xi\geq 2\sigma_{1}\sqrt{\varepsilon}$ , we obtain that

Now if $r_{i}\geq 8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi$ , then we have $r_{i+1}\leq r_{i}/2$ and if $r_{i}\leq 8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi$ then we also have $r_{i+1}\leq 8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi$ . When $r_{i}$ is smaller than this we stop and output $w_{i}$ . Thus we stop in at most $O(\log(r)-\log(8\gamma/\xi+2C\sigma_{0}\sqrt{\varepsilon}/\xi))=O(\log(r\xi/(\gamma+\sigma_{0}\sqrt{\varepsilon}))$ iterations and have $r_{i}=O(\gamma/\xi+C\sigma_{0}\sqrt{\varepsilon})$ . But then $\gamma^{\prime}_{i}=\gamma+O(\sigma^{\prime}_{i}\sqrt{\varepsilon}))\leq\gamma+C(\sigma_{0}+\sigma_{1}r^{\prime}_{i})\sqrt{\varepsilon}/8=O(\gamma+\sigma_{0}\sqrt{\varepsilon}).$ Using Lemma B.8 we obtain that

as required. The bound on $\|\widehat{w}-w^{*}\|_{2}$ follows similarly.

While we don’t give explicit bounds on the number of calls to the approximate learner needed by Sever, such bounds can be straightforwardly obtained under appropriate assumptions on the $f_{i}$ (see, e.g., the following subsection). Two remarks are in order. First, in this case we cannot take advantage of assumptions that only hold at $\overline{f}$ but might not on the corrupted average $f$ . Second, our algorithm can take advantage of a closed form for the minimum. For example, for the case of linear regression considered in Section E, $f_{i}$ is not Lipschitz with a small constant if $x_{i}$ is far from the mean, but there is a simple closed form for the minimum of the least squares loss.

B.4 Proof of Proposition B.5

We will proceed by a cover argument. First we claim that for each $w\in\mathcal{H}$ that (3) and (4) hold with high probability. For Equation (3), it suffices to show that for each unit vector $v$ in a cover $\mathcal{N}$ of size $2^{O(d)}$ of the sphere that

Since $|v\cdot(\nabla f(w)-\overline{f})|$ is always bounded by $L$ , Equation (5) holds for each $v,w$ with probability at least $1-\exp(-\Omega(n\sigma^{2}/L^{2}))$ by a Chernoff bound (noting that the removal of an $\varepsilon$ -fraction of points cannot increase this by much). Similarly, to show Equation 4, it suffices to show that for each such $v$ that

A Chernoff bound implies that with probability $1-\exp(-\Omega(n\sigma^{2}\varepsilon/L^{2}))$ that the average over our original set of $f$ ’s of $(v\cdot(\nabla f(w)-\overline{f}))$ is $O(\sigma\sqrt{\varepsilon})$ . Assuming that Equation (5) holds, removing an $\varepsilon$ -fraction of these $f$ ’s cannot change this value by more than $O(\sigma\sqrt{\varepsilon})$ . By union bounding over $\mathcal{N}$ and standard net arguments, this implies that Equations (3) and (4) hold with probability $1-\exp(\Omega(d-n\sigma^{2}\varepsilon/L^{2}))$ for any given $w$ .

To show that our conditions hold for all $w\in\mathcal{H}$ , we note that by $\beta$ -smoothness, if Equation (4) holds for some $w$ , it holds for all other $w^{\prime}$ in a ball of radius $\sqrt{\sigma^{2}\varepsilon}/\beta$ (up to a constant multiplicative loss). Similarly, if Equation (3) holds at some $w$ , it holds with bound $\sigma^{2}I$ for all $w^{\prime}$ in a ball of radius $\sigma^{2}/(2L\beta)$ . Therefore, if Equations (3) and (4) hold for all $w$ in a $\min(\sqrt{\sigma^{2}\varepsilon}/\beta,\sigma/(2L\beta))$ -cover of $\mathcal{H}$ , the assumptions of Theorem B.2 will hold everywhere. Since we have such covers of size $\exp(O(d\log(r\beta L/(\sigma^{2}\varepsilon))))$ , by a union bound, this holds with high probability if

Appendix C Analysis of Sever for GLMs

A case of particular interest is that of Generalized Linear Models (GLMs):

A sample from this GLM is given by $f_{X,Y}(w)$ where $(X,Y)\sim D_{xy}$ .

Our goal, as usual, is to approximately minimize $\overline{f}$ given $\varepsilon$ -corrupted samples from $D_{xy}$ . Throughout this section we assume that $\mathcal{H}$ is contained in the ball of radius $r$ around , i.e. $\mathcal{H}\subseteq B(0,r)$ . Moreover, we will let $w^{*}=\operatorname*{arg\,min}_{w\in\mathcal{H}}\overline{f}(w)$ be a minimizer of $\overline{f}$ in $\mathcal{H}$ .

This case covers a number of interesting applications, including SVMs and logistic regression. Unfortunately, the tools developed in Appendix B do not seem to be able to cover this case in a simple manner. In particular, it is unclear how to demonstrate that Assumption B.1 holds after taking polynomially many samples from a GLM. To rectify this, in this section, we demonstrate a different deterministic regularity condition under which we show Sever succeeds, and we show that this condition holds after polynomially many samples from a GLM. Specifically, we will show that Sever succeeds under the following deterministic condition:

Equation (1) holds with $\sigma_{1}=0$ and the same $\sigma_{0}$ , and

In this section, we will show the following two statements. The first demonstrates that Assumption C.1 implies that Sever succeeds, and the second shows that Assumption C.1 holds after polynomially many samples from a GLM. Formally:

If the link functions are $\xi$ -strongly convex, the algorithm finds a $w\in\mathcal{H}$ such that

Suppose moreover that the following conditions all hold:

$|\sigma_{Y}(0)|\leq 1$ for all $Y\in\mathcal{Y}$ .

Then with probability at least $9/10$ over the original set of samples, there is a set of $(1-\varepsilon)n$ of the $f_{i}$ that satisfy Assumption C.1 on $\mathcal{H}$ with $\sigma_{0}=2$ , $\sigma_{1}=0$ and $\sigma_{2}=1+r$ .and $\sigma_{2}=1+r$ .

As before, since Sever either terminates or throws away at least one sample, clearly it cannot run for more than $n$ iterations. Thus the runtime bound is simple, and it suffices to show correctness.

Let $f_{1},\ldots,f_{n}$ satisfy Assumption C.1. Then with probability at least $9/10$ , Sever applied to $f_{1},\ldots,f_{n},\sigma_{0}$ returns a point $w\in\mathcal{H}$ which is a $(\gamma+O(\sigma_{0}\sqrt{\varepsilon}))$ -approximate critical point of $\hat{f}$ .

so they satisfy (1), since the RHS is bounded by Assumption C.1. Thus this lemma follows from an application of Theorem B.2. ∎

With this critical lemma in place, we can now prove Theorem C.2:

Condition on the event that Lemma C.4 holds, and let $w\in\mathcal{H}$ be the output of Sever. By Assumption C.1, we know that $\hat{f}(w^{*})\geq\overline{f}(w^{*})-\sigma_{2}\sqrt{\varepsilon}$ , and moreover, $w^{*}$ is a $\gamma+\sigma_{0}\sqrt{\varepsilon}$ -approximate critical point of $\hat{f}$ .

Since each link function is convex, so is $\hat{f}$ . Hence, by Lemma B.8, since $w$ is a $(\gamma+O(\sigma_{0}\sqrt{\varepsilon}))$ -approximate critical point of $\hat{f}$ , we have $\hat{f}(w)-\hat{f}(w^{*})\leq r(\gamma+O(\sigma_{0}\sqrt{\varepsilon}))$ . By Assumption B.1, this immediately implies that $\overline{f}(w)-\overline{f}(w^{*})\leq r(\gamma+O(\sigma_{0}\sqrt{\varepsilon}))+O(\sigma_{2}\sqrt{\varepsilon})$ , as claimed.

The bound for strongly convex functions follows from the exact argument, except using the statement in Lemma B.8 pertaining to strongly convex functions. ∎

C.2 Proof of Proposition C.3

We first note that $\nabla f_{X,Y}(w)=X\sigma_{Y}^{\prime}(w\cdot X).$ Thus, under Assumption C.1, we have for any $v$ that

In particular, since this last expression is independent of $w$ , we only need to check this single matrix bound.

Thus, for $n$ at least a sufficiently large multiple of $d\log(dr/\varepsilon)/\varepsilon$ , this holds for all $w$ in our cover of $\mathcal{H}$ with high probability. This completes the proof. ∎

Appendix D An Alternative Algorithm: Robust Filtering in Each Iteration

In this section, we describe another algorithm for robust stochastic optimization. This algorithm uses standard robust mean estimation techniques to compute approximate gradients pointwise, which it then feeds into a standard projective gradient descent algorithm. This algorithm in practice turns out to be somewhat slower than the one employed in the rest of this paper, because it employs a filtering algorithm at every step of the projective gradient descent, and does not remember which points were filtered between iterations. On the other hand, we present this algorithm for two reasons. Firstly, because it is a conceptually simpler interpretation of the main ideas of this paper, and secondly, because the algorithm works under somewhat more general assumptions. In particular, this algorithm only requires that for each $w\in\mathcal{H}$ that there is a corresponding good set of functions, rather than that there exists a single good set that works simultaneously for all $w$ .

In particular, we can make do with the following somewhat weaker assumption:

We make essential use of the following result, which appears in both [DKK+17, SCV18]:

Our general robust algorithm for stochastic optimization will make calls to Algorithm ${\cal A}$ in a black-box manner, as well as to the projection operator onto $\mathcal{H}$ . We will measure the cost of our algorithm by the total number of such calls.

While it is not needed for the theoretical results established in this subsection, we note that the robust mean estimation algorithm of [DKK+17] relies on an iterative outlier removal method only requiring basic eigenvalue computations (SVD), while the [SCV18] algorithm employs semidefinite programming. In our experiments, we use the algorithm in [DKK+17] and variants thereof.

Using the above black-box, together with known results on convex optimization with errors, we obtain the following meta-theorem:

We note that by applying Algorithm $\mathcal{A}$ on $\{\nabla f_{i}(w)\}$ , we can find an approximation to $\nabla\overline{f}(w)$ with error $O(\sigma\sqrt{\varepsilon})$ . We note that standard projective gradient descent algorithms can be made to run efficiently even if the gradients given are only approximate, and this can be used to find our $O(\sigma\sqrt{\varepsilon})$ -approximate critical point. ∎

In this section we give a brief comparison of this algorithm to Sever. The algorithm presented in this section is much simpler to state, and also requires weaker conditions on the data. However, because the algorithms work in somewhat different settings, the comparison is a bit delicate, so we explain in more detail below.

There is a major conceptual difference between these two algorithms: namely, Sever works with a black-box non-robust learner, and requires the filter algorithm for robust mean estimation. In contrast, the algorithm in this section works with a black-box robust mean estimation algorithm, and plugs it into a specific non-robust learning algorithm, specifically (approximate) stochastic gradient descent. When instantiated with the same primitives, these algorithms have similar theoretical runtime guarantees.

However, in the practical implementation, we prefer Sever for a couple of reasons. First, in practice we find that in practice, Sever often only requires a constant number of runs of the base black-box learner, and so incurs only a constant factor overhead. In contrast, the algorithm presented in this section requires at least linear time per iteration of SGD, since it needs to run a robust mean estimation algorithm on the entire dataset (and the total number of iterations needed is comparable). In contrast, SGD typically runs in constant time per iteration, so this presents a major bottleneck for scalability.

Second, we find it is much more useful from a practical point of view to allow for black-box non-robust learners, than to allow for black-box robust mean estimation algorithms. This is simply because the former allows Sever to be much more problem-specific, and allow for optimizations for each individual learning problem. For instance, it is what allows us to use optimized libraries in our experiments for linear regression and SVM. In contrast, there is relatively little reward to allow for black-box robust mean estimation algorithms, as not only are there relatively few options, but also this does not allow us really to specialize the algorithm to the problem at hand.

Appendix E Applications of the General Algorithm

In this section, we present three concrete applications of our general robust algorithm. In particular, we describe how to robustly optimize models for linear regression, support vector machines, and logistic regression, in Sections E.1, E.2, E.3, respectively.

Given the model for linear regression described above, assume the following conditions for $D_{e}$ and $D_{x}$ :

$\operatorname{Var}_{e\sim D_{e}}[e]\leq\xi$ ;

Our main result for linear regression is the following:

Let $\varepsilon>0$ , and let $D_{xy}$ be a distribution over pairs $(X,Y)$ which satisfies the conditions of Assumption E.1. Suppose we are given $O\left(\frac{d^{5}}{\varepsilon^{2}}\right)$ $\varepsilon$ -noisy samples from $D_{xy}$ . Then in either of the following two cases, there exists an algorithm that, with probability at least $9/10$ , produces a $\widehat{w}$ with the following guarantees:

If $\|w^{*}\|_{2}\leq r$ , then $\overline{f}(\widehat{w})\leq\overline{f}(w^{*})+O(((\sqrt{\xi}+\sqrt{\varepsilon})r+\sqrt{C}r^{2})\sqrt{\varepsilon})$ .

The proof will follow from two lemmas (proved in Section E.1.1 and E.1.2, respectively). First, we will bound the covariance of the gradient, in Lemma E.3:

With this in hand, we can prove Lemma E.4, giving us a polynomial sample complexity which is sufficient to satisfy the conditions of Assumption B.1.

Suppose $D_{xy}$ satisfies the conditions of Assumption E.1. Given $O(d^{5}/\varepsilon^{2})$ $\varepsilon$ -noisy samples from $D_{xy}$ , then with probability at last $9/10$ , they satisfy Assumption B.1 with parameters $\sigma_{0}=30\sqrt{\xi}+\sqrt{\varepsilon}$ and $\sigma_{1}=18\sqrt{C+1}$ .

The proof concludes by applying Corollary B.4 or case (i) of Corollary B.3 for the first and second cases respectively.

Note that for this setting we have that $f(w,z)=f(w,x,y)=(y-\langle w,x\rangle)^{2}$ . We then have that $\nabla_{w}f(w,z)=-2(\langle w^{\ast}-w,x\rangle+e)x$ . Our main claim is the following:

where we again used the fact that the noise $e$ is independent of $x$ and its expectation is zero.

Given the above claim, we can bound from above the spectral norm of the covariance matrix of the gradients as follows: Specifically, for a unit vector $v$ , the quantity $v^{T}\operatorname{Cov}[\nabla_{w}f(w,z)]v$ is bounded from above by a constant times the following quantities:

The second term is at most the upper bound of the variance of the noise $\xi$ times $\sigma^{2}$ .

The third term is at most $v^{T}\Sigma(w^{\ast}-w)(w^{\ast}-w)^{T}\Sigma v$ , which by our bounded covariance assumption is at most $\sigma^{4}\|w^{\ast}-w\|_{2}^{2}.$

This gives the parameters in the meta-theorem.

E.1.2 Proof of Lemma E.4

We will repeatedly make use of the following, which bounds how much removing points or probability mass affects an expectation in terms of its variance:

We can thus take $\sigma_{0}=30\sqrt{\xi}+\sqrt{\varepsilon}$ and $\sigma_{1}=18\sqrt{C+1}\geq 16\sqrt{C}$ to get (2).

To get both (2) and (1) hold with $\sigma_{0}=30\sqrt{\xi}+\sqrt{\varepsilon}$ and $\sigma_{1}=18\sqrt{C+1})$ . This happens with probability at least $9/10$ by a union bound on the probabilistic assumptions above.

E.2 Support Vector Machines

In this section, we demonstrate how our results apply to learning support vector machines (i.e., halfspaces under hinge loss). In particular, we describe how SVMs fit into the GLM framework described in Section C.

One technical point is that $f_{i}$ does not have a gradient everywhere – instead, we will be concerned with the sub-gradients of the $f_{i}$ ’s. All our results which operate on the gradients also work for sub-gradients. To be precise, we will take the sub-gradient to be when the gradient is undefined:

Let $\nabla f_{i}$ be the sub-gradient of $f_{i}(w)$ with respect to $w$ , where $\nabla f_{i}=-y_{i}x_{i}$ if $y_{i}(w\cdot x_{i})<1$ , and otherwise.

To get a bound on the error of hinge loss, we will need to assume the marginal distribution $D_{x}$ is anti-concentrated.

A distribution is $\delta$ -anticoncentrated if at most an $O(\delta)$ -fraction of its probability mass is within Euclidean distance $\delta$ of any hyperplane.

Given the model for SVMs as described above, assume the following conditions for the marginal distribution $D_{x}$ :

$D_{x}$ is $\varepsilon^{1/4}$ -anticoncentrated.

Our main result on SVMs is the following:

Let $\varepsilon>0$ , and let $D_{xy}$ be a distribution over pairs $(X,Y)$ , where the marginal distribution $D_{x}$ satisfies the conditions of Assumption E.7. Then there exists an algorithm that with probability $9/10$ , given $O(d\log(d/\varepsilon)/\varepsilon)$ $\varepsilon$ -noisy samples from $D_{xy}$ , returns a $\widehat{w}$ such that for any $w^{\ast}$ ,

Our approach will be to fit this problem into the GLM framework developed in Section C. First, we will restrict our search over $w$ to $\mathcal{H}$ , a ball of radius $r=\varepsilon^{-1/4}$ . As we argue in Lemma E.9, this restriction comes at a cost of at most $O(\varepsilon^{1/4})$ in our algorithm’s loss. With this restriction, we will argue that the problem satisfies the conditions of Proposition C.3. This allows us to argue that, with a polynomial number of samples, we can obtain a set of $f_{i}$ ’s satisfying the conditions of Assumption C.1. This will allow us to apply Theorem C.2, concluding the proof.

We start by showing that, due to anticoncentration of $D$ , there is a $w^{\prime}\in\mathcal{H}$ with loss close to $w^{*}$ :

Otherwise, we break into case analysis, based on the value of $(x,y)$ :

$|w^{\prime}\cdot x|>1$ : If $y(w^{\prime}\cdot x)>1$ , then $L(w^{\prime},(x,y))=L(w^{*},(x,y))=0$ . If $y(w^{\prime}\cdot x)<-1$ , then $L(w^{\prime},(x,y))=1-y(w^{\prime}\cdot x)\leq 1-y(w^{*}\cdot x)=L(w^{*},(x,y))$ . Both cases use the fact that $\|w^{\prime}\|_{2}<\|w^{\ast}\|_{2}$ .

$|w^{\prime}\cdot x|\leq 1$ : In this case, we have that $L(w^{\prime},(x,y))\leq 2$ . Since $L(w^{*},(x,y))\geq 0$ , we have that $L(w^{\prime},(x,y))\leq L(w^{*},(x,y))+2$ .

We first show that this problem fits into the GLM framework, in particular, satisfying the conditions of Proposition C.3. The link function is $\sigma_{y}(t)=\max\{0,1-yt\}$ , giving us the loss function $L(w,(x,y))=\sigma_{y}(w\cdot x)$ . We let $\mathcal{H}$ be the set $\|w\|_{2}\leq\varepsilon^{-1/4}$ , giving us the parameter $r=\varepsilon^{-1/4}$ . Condition 1 is satisfied by Assumption E.7. For $y\in\{-1,1\}$ , $\sigma^{\prime}_{y}(t)=0$ for $yt\geq 1$ and $\sigma^{\prime}_{y}(t)=-y$ for $yt<1$ . Thus we have that $|\sigma^{\prime}_{1}(t)|\leq 1$ for all $t$ and $y$ , satisfying Condition 2. Finally, one can observe that $\sigma_{y}(0)=1$ for all $y$ , satisfying Condition 3. Thus we can apply Proposition C.3: if we take $O(d\log(dr/\varepsilon)/\varepsilon)$ $\varepsilon$ -corrupted samples, then they satisfy Assumption C.1 on $\mathcal{H}$ with $\sigma_{0}=2$ , $\sigma_{1}=0$ and $\sigma_{2}=1+\varepsilon^{-1/4}$ , with probability $9/10$ .

Now we can apply the algorithm of Theorem C.2. Since the loss is convex, we get a vector $\widehat{w}$ with $\overline{f}(\widehat{w})-\overline{f}({w^{*}}^{\prime})=O((\sigma_{0}r+\sigma_{1}r^{2}+\sigma_{2})\sqrt{\varepsilon})=O((2\varepsilon^{-1/4}+\varepsilon^{-1/4})\sqrt{\varepsilon})=O(\varepsilon^{1/4})$ where ${w^{*}}^{\prime}$ is the minimizer of $\overline{f}$ on $\mathcal{H}$ .

We thus have that $\overline{f}(\hat{w})\leq\overline{f}({w^{*}}^{\prime})+O(\varepsilon^{1/4})\leq\overline{f}(w^{\prime})+O(\varepsilon^{1/4})\leq\overline{f}(w^{*})+O(\varepsilon^{1/4})$ . The second inequality follows because ${w^{*}}^{\prime}$ is the minimizer of $\overline{f}$ on $\mathcal{H}$ , and the third inequality follows from Lemma E.9. ∎

E.3 Logistic Regression

In this section, we demonstrate how our results apply to logistic regression. In particular, we describe how logistic regression fits into the GLM framework described in Section C.

The gradient of this function is $\nabla L(w,(x,y))=\frac{1}{2}(\phi(w\cdot x)-\phi(-w\cdot x)-y)x$ . The goal is to find a $\widehat{w}$ approximately minimizing the objective function

Given the model for logistic regression as described above, assume the following conditions for the marginal distribution $D_{x}$ :

$D_{x}$ is $\varepsilon^{1/4}\sqrt{\log(1/\varepsilon)}$ -anticoncentrated.

We can get a similar result to that for hinge loss for logistic regression:

Let $\varepsilon>0$ , and let $D_{xy}$ be a distribution over pairs $(X,Y)$ , where the marginal distribution $D_{x}$ satisfies the conditions of Assumption E.10. Then there exists an algorithm that with probability $9/10$ , given $O(d\log(d/\varepsilon)/\varepsilon)$ $\varepsilon$ -noisy samples from $D_{xy}$ , returns a $\widehat{w}$ such that for any $w^{\ast}$ ,

The approach is very similar to that of Theorem E.8, which we repeat here for clarity. First, we will restrict our search over $w$ to $\mathcal{H}$ , a ball of radius $r=\varepsilon^{-1/4}\sqrt{\log(1/\varepsilon)}$ . As we argue in Lemma E.12, this restriction comes at a cost of at most $O(\varepsilon^{1/4}\sqrt{\log(1/\varepsilon)})$ in our algorithm’s loss. With this restriction, we will argue that the problem satisfies the conditions of Proposition C.3. This allows us to argue that, with a polynomial number of samples, we can obtain a set of $f_{i}$ ’s satisfying the conditions of Assumption C.1. This will allow us to apply Theorem C.2, concluding the proof.

We start by showing that, due to anticoncentration of $D$ , there is a $w^{\prime}\in\mathcal{H}$ with loss close to $w^{*}$ :

Recalling that $\phi=1/(1+\exp(-t))$ , we have that $-\ln(\phi(t)\phi(-t))=\ln(\exp(t)+\exp(-t)+2)$ . Since $\exp(t)+\exp(-t)+2\geq\exp(|t|)$ , we have $|t|\leq-\ln(\phi(t)\phi(-t))$ . On the other hand, $\ln(\exp(t)+\exp(-t)+2)=|t|+\ln(1+2\exp(-|t|)+\exp(-2|t|))\leq|t|+\ln(1+3\exp(-|t|))\leq|t|+3\exp(-|t|)$ . ∎

With this in hand, we can conclude with the proof of Theorem E.11.

We first show that this problem fits into the GLM framework, in particular, satisfying the conditions of Proposition C.3. The link function is $\sigma_{y}(t)=\frac{1}{2}(-\ln(\phi(t)\phi(-t))-yt)$ , giving us the loss function $L(w,(x,y))=\sigma_{y}(w\cdot x)$ . We let $\mathcal{H}$ be the set $\|w\|_{2}\leq\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}$ , giving us the parameter $r=\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}$ . Condition 1 is satisfied by Assumption E.10. For $y\in\{-1,1\}$ , $\sigma^{\prime}_{y}(t)=\frac{1}{2}(\phi(t)-\phi(-t)-y)$ , which gives that $|\sigma^{\prime}_{y}(t)|\leq 1$ for all $t$ and $y$ , satisfying Condition 2. Finally, $\sigma_{y}(0)=\ln 2<1$ for all $y$ , satisfying Condition 3. Thus we can apply Proposition C.3: if we take $O(d\log(dr/\varepsilon)/\varepsilon)$ $\varepsilon$ -corrupted samples, then they satisfy Assumption C.1 on $\mathcal{H}$ with $\sigma_{0}=2$ , $\sigma_{1}=0$ and $\sigma_{2}=1+\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}$ , with probability $9/10$ .

Now we can apply the algorithm of Theorem C.2. Since the loss is convex, we get a vector $\widehat{w}$ with $\overline{f}(\widehat{w})-\overline{f}({w^{*}}^{\prime})=O((\sigma_{0}r+\sigma_{1}r^{2}+\sigma_{2})\sqrt{\varepsilon})=O((2\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)}+\varepsilon^{-1/4}\sqrt{\ln(1/\varepsilon)})\sqrt{\varepsilon})=O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)})$ where ${w^{*}}^{\prime}$ is the minimizer of $\overline{f}$ on $\mathcal{H}$ .

We thus have that $\overline{f}(\hat{w})\leq\overline{f}({w^{*}}^{\prime})+O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)})\leq\overline{f}(w^{\prime})+O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)})\leq\overline{f}(w^{*})+O(\varepsilon^{1/4}\sqrt{\ln(1/\varepsilon)})$ . The second inequality follows because ${w^{*}}^{\prime}$ is the minimizer of $\overline{f}$ on $\mathcal{H}$ , and the third inequality follows from Lemma E.12. ∎

Appendix F Additional Experimental Results

In this section, we provide additional plots of our experimental results, comparing with all baselines considered.