Certified Defenses for Data Poisoning Attacks
Jacob Steinhardt, Pang Wei Koh, Percy Liang
Introduction
Traditionally, computer security seeks to ensure a system’s integrity against attackers by creating clear boundaries between the system and the outside world (Bishop, 2002). In machine learning, however, the most critical ingredient of all–the training data–comes directly from the outside world. For a system trained on user data, an attacker can inject malicious data simply by creating a user account. Such data poisoning attacks require us to re-think what it means for a system to be secure.
The focus of the present work is on data poisoning attacks against classification algorithms, first studied by Biggio et al. (2012) and later by a number of others (Xiao et al., 2012; 2015b; Newell et al., 2014; Mei and Zhu, 2015b; Burkard and Lagesse, 2017; Koh and Liang, 2017). This body of work has demonstrated data poisoning attacks that can degrade classifier accuracy, sometimes dramatically. Moreover, while some defenses have been proposed against specific attacks (Laishram and Phoha, 2016), few have been stress-tested against a determined attacker.
Are there defenses that are robust to a large class of data poisoning attacks? At development time, one could take a clean dataset and test a defense against a number of poisoning strategies on that dataset. However, because of the near-limitless space of possible attacks, it is impossible to conclude from empirical success alone that a defense that works against a known set of attacks will not fail against a new attack.
In this paper, we address this difficulty by presenting a framework for studying the entire space of attacks against a given defense. Our framework applies to defenders that (i) remove outliers residing outside a feasible set, then (ii) minimize a margin-based loss on the remaining data. For such defenders, we can generate approximate upper bounds on the efficacy of any data poisoning attack, which hold modulo two assumptions—that the empirical train and test distribution are close together, and that the outlier removal does not significantly change the distribution of the clean (non-poisoned) data; these assumptions are detailed more formally in Section 3. We then establish a duality result for our upper bound, and use this to generate a candidate attack that nearly matches the bound. Both the upper bound and attack are generated via an efficient online learning algorithm.
We consider two different instantiations of our framework: first, where the outlier detector is trained independently and cannot be affected by the poisoned data, and second, where the data poisoning can attack the outlier detector as well. In both cases we analyze binary SVMs, although our framework applies in the multi-class case as well.
In the first setting, we apply our framework to an “oracle” defense that knows the true class centroids and removes points that are far away from the centroid of the corresponding class. While previous work showed successful attacks on the MNIST-1-7 (Biggio et al., 2012) and Dogfish (Koh and Liang, 2017) image datasets in the absence of any defenses, we show (Section 4) that no attack can substantially increase test error against this oracle—the -error of an SVM on either dataset is at most against any of the attacks we consider, even after adding poisoned data.We note Koh and Liang’s attack on Dogfish targets specific test images rather than overall test error. Moreover, we provide certified upper bounds of and , respectively, on the two datasets. On the other hand, on the IMDB sentiment corpus (Maas et al., 2011) our attack increases classification test error from to with only poisoned data, showing that defensibility is very dataset-dependent: the high dimensionality and abundance of irrelevant features in the IMDB corpus give the attacker more room to construct attacks that evade outlier removal.
For the second setting, we consider a more realistic defender that uses the empirical (poisoned) centroids. For small amounts of poisoned data () we can still certify the resilience of MNIST-1-7 and Dogfish (Section 5). However, with more () poisoned data, the attacker can subvert the outlier removal to obtain stronger attacks, increasing test error on MNIST-1-7 to —much higher than the upper bound of for the oracle defense. In other words, defenses that rely on the (potentially poisoned) data can be much weaker than their data-independent counterparts, underscoring the need for outlier removal mechanisms that are themselves robust to attack.
Problem Setting
We consider the causative attack model (Barreno et al., 2010), which consists of a game between two players: the defender (who seeks to learn a model ), and the attacker (who wants the learner to learn a bad model). The game proceeds as follows:
data points are drawn from to produce a clean training dataset .
The attacker adaptively chooses a “poisoned” dataset of poisoned points, where parametrizes the attacker’s resources.
The defender trains on the full dataset to produce a model , and incurs test loss .
The defender’s goal is to minimize the quantity while the attacker’s goal is to maximize it.
We assume the attacker has full knowledge of the defender’s algorithm and of the clean training data . While this may seem generous to the attacker, it is widely considered poor practice to rely on secrecy for security (Kerckhoffs, 1883; Biggio et al., 2014a); moreover, a determined attacker can often reverse-engineer necessary system details (Tramèr et al., 2016).
The causative attack model allows the attacker to add points but not modify existing ones. Indeed, systems constantly collect new data (e.g., product reviews, user feedback on social media, or insurance claims), whereas modification of existing data would require first compromising the system.
Attacks that attempt to increase the overall test loss , known as indiscriminate availability attacks (Barreno et al., 2010), can be thought of as a denial-of-service attack. This is in contrast to targeted attacks on individual examples or sub-populations (e.g., Burkard and Lagesse, 2017). Both have serious security implications, but we focus on denial-of-service attacks, as they compromise the model in a broad sense and interfere with fundamental statistical properties of learning algorithms.
1 Data Sanitization Defenses
A defender who trains naïvely on the full (clean + poisoned) data is doomed to failure, as even a single poisoned point can in some cases arbitrarily change the model (Liu and Zhu, 2016; Park et al., 2017). In this paper, we consider data sanitization defenses (Cretu et al., 2008), which examine the full dataset and try to remove the poisoned points, for example by deleting outliers. Formally, the defender constructs a feasible set and trains only on points in :
Given such a defense , we would like to upper bound the worst possible test loss over any attacker (choice of )—in symbols, . Such a bound would certify that the defender incurs at most some loss no matter what the attacker does. We consider two classes of defenses:
Fixed defenses, where does not depend on . One example for text classification is letting be documents that contain only licensed words (Newell et al., 2014). Other examples are oracle defenders that depend on the true distribution . While such defenders are not implementable in practice, they provide bounds: if even an oracle can be attacked, then we should be worried.
Data-dependent defenses, where depends on . These defenders try to estimate from and thus are implementable in practice. However, they open up a new line of attack wherein the attacker chooses the poisoned data to change the feasible set .
Here are thresholds (e.g. chosen so that 30% of the data is removed). Note that both defenses are oracles ( depends on ); in Section 5, we consider versions that estimate from .
Figure 1 depicts both defenses on the MNIST-1-7 and IMDB datasets. Intuitively, the constraints on MNIST-1-7 make it difficult for an attacker, whereas IMDB looks far more attackable. In the next section, we will see how to make these intuitions concrete.
Attack, Defense, and Duality
Recall that we are interested in the worst-case test loss . To make progress, we consider three approximations. First, (i) we pass from the test loss to the training loss on the clean data, and (ii) we consider the training loss on the full (clean + poisoned) data, which upper bounds the loss on the clean data due to non-negativity of the loss. For any model , we then have:
The approximation (i) could potentially be invalid due to overfitting; however, if we regularize the model appropriately then we can show that train and test are close by standard concentration arguments (see Appendix B for details). Note that (ii) is always a valid upper bound, and will be relatively tight as long as the model ends up fitting the poisoned data well.
Putting it all together, the worst-case test loss from any attack with elements is approximately upper bounded as follows:
The approximations (i) and (iii) define the assumptions we need for our certificates to hold; as long as both approximations are valid, will give an approximate upper bound on the worst-case test loss.
Our algorithm, shown in Algorithm 1, is very simple: in each iteration, it alternates between finding the worst attack point with respect to the current model and updating the model in the direction of the attack point, producing . The attack is the set of points thus found.
To derive the algorithm, we simply swap min and max in (3) to get an upper bound on , after which the optimal attack set for a fixed is realized by a single point :
Hence, any algorithm whose average regret is small will have a nearly optimal candidate attack . There are many algorithms that have this property (Shalev-Shwartz, 2011); the particular algorithm depicted in Algorithm 1 is a variant of regularized dual averaging (Xiao, 2010). In summary, we have a simple learning algorithm that computes an upper bound on the minimax loss along with a candidate attack (which provides a lower bound). Of course, the minimax loss is only an approximation to the true worst-case test loss (via (3)). We examine the tightness of this approximation empirically in Section 4.
2 Data-Dependent Defenses: Upper and Lower Bounds
We now turn our attention to data-dependent defenders, where the feasible set depends on the data (and hence can be influenced by the attacker). For example, consider the slab defense (see (2)) that uses the empirical (poisoned) mean instead of the true mean:
where is the empirical mean over ; the notation tracks the dependence of the feasible set on . Similarly to Section 3.1, we analyze the minimax loss , which we can bound as in (5): .
However, unlike in (5), it is no longer the case that the optimal places all points at a single location, due to the dependence of on ; we must jointly maximize over the full set . To improve tractability, we take a continuous relaxation: we think of as a probability distribution with mass on each point in , and relax this to allow any probability distribution . The constraint then becomes (where denotes the support), and the analogue to (5) is
The constraint set for is non-convex, so duality (Proposition 1) no longer holds. In particular, the average of two feasible might not itself be feasible.
Experiments I: Oracle Defenses
We also ran the candidate attack from Algorithm 1 as well as two baselines — gradient descent on the test loss (varying the location of points in , as in Biggio et al. (2012) and Mei and Zhu (2015b)), and a simple baseline that inserts copies of points from with the opposite label (subject to the flipped points lying in ). The results are in in Figure 2c. Our attack consistently performs strongest; label flipping seems to be too weak, while the gradient algorithm seems to get stuck in local minima.Though Mei and Zhu (2015b) state that their cost is convex, they communicated to us that this is incorrect. Though it is not shown in the figure, we note that the maximum test 0-1 error against any attack, for up to 0.3, was 4%, confirming the robustness suggested by our certificates.
Finally, we visualize our attack in Figure 1a. Interestingly, though the attack was free to place points anywhere, most of the attack is tightly concentrated around a single point at the boundary of .
We next consider attacks on text data. Beyond the the sphere and slab constraints, a valid attack on text data must satisfy additional integrity constraints (Newell et al., 2014): for text, the input consists of binary indicator features (e.g., presence of the word “banana”) rather than arbitrary reals.Note that in the previous section, we ignored such integrity constraints for simplicity.
Algorithm 1 still applies in this case — the only difference is that the QP from Section 4 has the added constraint and hence becomes an integer quadratic program (IQP), which can be computationally expensive to solve. We can still obtain upper bounds simply by relaxing the integrity constraints; the only issue is that the points in the corresponding attack will have continuous values, and hence don’t correspond to actual text inputs. To address this, we use the IQP solver from Gurobi (Gurobi Optimization, Inc., 2016) to find an approximately optimal feasible . This yields a valid candidate attack, but it might not be optimal if the solver doesn’t find near-optimal solutions.
We ran both the upper bound relaxation and the IQP solver on two text datasets, the Enron spam corpus (Metsis et al., 2006) and the IMDB sentiment corpus (Maas et al., 2011). The Enron training set consists of e-mails ( spam and non-spam), with distinct words. The IMDB training set consists of product reviews with distinct words. We used bag-of-words features, which yields test accuracy and , respectively, in the absence of poisoned data. IMDB was too large for Gurobi to even approximately solve the IQP, so we resorted to a randomized rounding heuristic to convert the continuous relaxation to an integer solution.
Results are given in Figure 3; there is a relatively large gap between the upper bound and the attack. Despite this, the attacks are relatively successful. Most striking is the attack on IMDB, which increases test error from to for , despite having to pass the oracle defender.
To understand why the attacks are so much more successful in this case, we can consult Figure 1b. In contrast to MNIST-1-7, for IMDB the defenses place few constraints on the attacker. This seems to be a consequence of the high dimensionality of IMDB and the large number of irrelevant features, which increase the size of without a corresponding increase in separation between the classes.
Experiments II: Data-Dependent Defenses
We now revisit the MNIST-1-7 and Dogfish datasets. Before, we saw that they were unattackable provided we had an oracle defender that knew the true class means. If we instead consider a data-dependent defender that uses the empirical (poisoned) means, how much can this change the attackability of these datasets? In this section, we will see that the answer is quite a lot.
The geometry of the attack is depicted in Figure 4b. By carefully choosing the location of the attack, the attacker can place points that lie substantially outside the original (clean) feasible set. This is because the poisoned data can substantially change the the direction of the slab constraint, while the sphere constraint by itself is not enough to effectively filter out attacks. There thus appears to be significant danger in employing data-dependent defenders—beyond the greater difficulty of analyzing them, they seem to actually be more vulnerable to attack.
Related Work
Due to their increased use in security-critical settings such as malware detection, there has been an explosion of work on the security of machine learning systems; see Barreno et al. (2010), Biggio et al. (2014a), Papernot et al. (2016b), and Gardiner and Nagaraja (2016) for some recent surveys.
Our contribution relates to the long line of work on data poisoning attacks; beyond linear classifiers, others have studied the LASSO (Xiao et al., 2015a), clustering (Biggio et al., 2013; 2014c), PCA (Rubinstein et al., 2009), topic modeling (Mei and Zhu, 2015a), collaborative filtering (Li et al., 2016), neural networks (Yang et al., 2017), and other models (Mozaffari-Kermani et al., 2015; Vuurens et al., 2011; Wang, 2016). There have also been a number of demonstrated vulnerabilities in deployed systems (Newsome et al., 2006; Laskov and Šrndic̀, 2014; Biggio et al., 2014b). We provide formal scaffolding to this line of work by supplying a tool that can certify defenses against a range of attacks.
A striking recent security vulnerability discovered in machine learning systems is adversarial test images that can fool image classifiers despite being imperceptible from normal images (Szegedy et al., 2014; Goodfellow et al., 2015; Carlini et al., 2016; Kurakin et al., 2016; Papernot et al., 2016a). These images exhibit vulnerabilities at test time, whereas data poisoning is a vulnerability at training time. However, recent adversarial attacks on reinforcement learners (Huang et al., 2017; Behzadan and Munir, 2017; Lin et al., 2017) do blend train and test vulnerabilities. A common defense against adversarial test examples is adversarial training (Goodfellow et al., 2015), which alters the training objective to encourage robustness.
We note that generative adversarial networks (Goodfellow et al., 2014), despite their name, are not focused on security but rather provide a game-theoretic objective for training generative models.
Finally, a number of authors have studied the theoretical question of learning in the presence of adversarial errors, under a priori distributional assumptions on the data. Robust algorithms have been exhibited for mean and covariance estimation and clustering (Diakonikolas et al., 2016; Lai et al., 2016; Charikar et al., 2017), classification (Klivans et al., 2009; Awasthi et al., 2014), regression (Nasrabadi et al., 2011; Nguyen and Tran, 2013; Chen et al., 2013; Bhatia et al., 2015) and crowdsourced data aggregation (Steinhardt et al., 2016). However, these bounds only hold for specific (sometimes quite sophisticated) algorithms and are focused on good asymptotic performance, rather than on giving good numerical error guarantees for concrete datasets/defenses.
Discussion
In this paper we have presented a tool for studying data poisoning defenses that goes beyond empirical validation by providing certificates against a large family of attacks modulo the approximations from Section 3. We stress that our bounds are meant to be used as a way to assess defense strategies in the design stage, rather than guaranteeing performance of a deployed learning algorithm (since our method needs to be run on the clean data, which we presumably would not have access to at deployment time). For instance, if we want to build robust defenses for image classifiers, we can assess the performance against attacks on a number of known image datasets, in order to gain more confidence in the robustness of the system that we actually deploy.
Our framework currently does not handle non-convex losses: while our method might still be meaningful as a way of generating attacks, our upper bounds would no longer be valid. The issue is that an attacker could try to thwart the optimization process and cause the defender to end up in a bad local minimum. Finding ways to rule this out without relying on convexity would be quite interesting.
Separately, the bound was useful because admits the natural minimax formulation (5), but the worst-case can be expressed directly as a bilevel optimization problem (Mei and Zhu, 2015b), which is intractable in general but admits a number of heuristics (Bard, 1999). Bilevel optimization has been considered in the related setting of Stackelberg games (Brückner and Scheffer, 2011; Brückner et al., 2012; Zhou and Kantarcioglu, 2016), and is natural to apply here as well.
To conclude, we quote Biggio et al., who call for the following methodology for evaluating defenses:
To pursue security in the context of an arms race it is not sufficient to react to observed attacks, but it is also necessary to proactively anticipate the adversary by predicting the most relevant, potential attacks through a what-if analysis; this allows one to develop suitable countermeasures before the attack actually occurs, according to the principle of security by design.
The existing paradigm for such proactive anticipation is to design various hypothetical attacks against which to test the defenses. However, such an evaluation is fundamentally limited because it leaves open the possibility that there is a more clever attack that we failed to think of. Our approach provides a first step towards surpassing this limitation, by not just anticipating but certifying the reliability of a defender, thus implicitly considering an infinite number of attacks before they occur.
The code and data for replicating our experiments is available on GitHub (http://bit.ly/gt-datapois) and Codalab Worksheets (http://bit.ly/cl-datapois).
Acknowledgments.
JS was supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Research Fellowship. This work was also partially supported by a Future of Life Institute grant and a grant from the Open Philanthropy Project. We are grateful to Daniel Selsam, Zhenghao Chen, and Nike Sun, as well as to the anonymous reviewers, for a great deal of helpful feedback.
References
Appendix A Proof of Proposition 1
Proposition 1 follows by standard duality arguments which we reproduce here. First recall the definition of : for a sequence of loss functions , , and an algorithm with iterates , regret is defined as
Substituting into (10) and averaging over , we have
Appendix B Defending Against Overfitting Attacks
In Section 3 we claimed that it is possible to defend against overfitting attacks with appropriate regularization. In this section we justify this claim. The key is the classical theory of uniform convergence, which allows us to say that, with probability , the following uniform bound holds:
where is an error bound that is roughly . More precisely, Kakade et al. (2009) show the following:
By setting appropriately relative to and we can therefore guarantee that the train and test losses in (14) are close together, and therefore rule out any overfitting attack (because any attack that makes the test loss high would also have to make the train loss high).
Appendix C Regret Bound for Adaptive RDA
Our optimization algorithm (Algorithm 1) is similar in spirit to Regularized Dual Averaging (Xiao, 2010), but the known regret bounds for RDA do not apply directly because the regularizer is chosen adaptively to ensure the norm constraint holds. In fact, a somewhat different analysis is required in this case, closer in spirit to that given by Steinhardt et al. (2014) for sparse linear regression. While the details would take us beyond the scope of this paper, we state the regret bound here:
After steps of the update in Algorithm 1, the regret of Algorithm 1 can be bounded as
We make two observations: first, since necessarily, by setting to be on the order of we can ensure average regret . On the other hand, in many instances will actually increase linearly with (in order to enforce the norm constraints ) in which case the average regret decreases at the faster rate . In either case, the average regret goes to as .
Now, let , , , and be the weights of these points under . Letting and be the empirical means of the positive and negative class over , and and the empirical probability of the two classes, we have the following expression for :
using the assumption that the are support vectors and the are not.
Now, the sphere and slab constraints may be written as
for , . We also have the constraints
for (encoding the constraints that the are support vectors and the are not).
A careful examination reveals that, for fixed , all terms in (18-22) can be written as linear inequality constraints in the inner products between the vectors . Therefore, by changing variables to the Gram matrix among these vectors, we can express the maximization over in (16) as a semidefinite program over these variables, with equality constraints for the known inner products between , , and .
Moreover, for any matrix satisfying these equality constraints, it is possible to recover vectors , , , and (depending on , , and ) whose inner products match the Gram matrix . Precisely, if G=\left[\begin{array}[]{cc}G_{11}&G_{12}\\ G_{21}&G_{22}\end{array}\right] is the Gram matrix (with block being the vectors , and block being the known vectors ), then for any vectors orthogonal to the span of , , and , we can take
where and , and denotes pseudoinverse. This means that solving the SDP allows us to not only compute the optimal objective value, but also to actually recover vectors realizing it.
To finish, we must handle the fact that the weights are not known. However, they comprise only a -dimensional parameter space, and hence we can approximate the maximum over all through Monte Carlo simulation (i.e., randomly sample the weights a sufficiently large number of times and take the best).