On the Effectiveness of Mitigating Data Poisoning Attacks with Gradient Shaping

Sanghyun Hong, Varun Chandrasekaran, Yiğitcan Kaya, Tudor Dumitraş, Nicolas Papernot

Introduction

A common paradigm for building a new machine learning (ML) system is to collect the required training data from several sources. Ideally, the data should come from trustworthy sources. However, the scale of modern ML tasks and challenges in establishing trust often force practitioners to resort to unvetted sources. This exposes ML systems to potentially dangerous training data and enables data poisoning attacks. In poisoning attacks, the attacker inserts poison in the victim’s training set to induce it into learning a model whose behavior is advantageous to the attacker. Data poisoning has been demonstrated in malware classification , spam filtering , and DoS detection .

Poisoning has sparked an arms race between proposed defenses and attacks that defeat them. For example, a data-sanitization defense called RONI relies on the assumption that a poison sample in the training set necessarily hurts the model’s accuracy . While RONI could defend against some indiscriminate attacks, it was shown to be ineffective against novel and adaptive targeted attack . In general, prior attacks and defenses emphasized the diversity of information available to the adversary. In this paradigm, safeguarding ML models against poisoning requires a strong understanding of the threat surface exposed by learning algorithms.

In this work, we challenge prior taxonomies of poisoning in ML and make a first step towards a unified view of the threat surface. Specifically, we ask: What are some essential characteristics shared across various poisoning attacks? To answer this question, we focus on the cornerstone of how most ML models are trained – using gradients. During training, the gradients computed on data dictate how a model’s parameters should be updated; this determines properties of the resulting model. Stochasticity found in many popular optimizers results in poisoners having limited control over gradients computed during training. This gives defenders an edge to reason about differences between clean and poisoned gradients.

Poisoning attacks rely on two broad approaches to manipulate a model’s behavior: feature collision and feature insertion. For example, to degrade performance, indiscriminate poisoning attacks rely on feature collision to overwrite clean features. Targeted poisoning attacks , on the other hand, rely mainly on feature insertion from the target samples into the model to cause misclassification on a few test-time targets without hurting overall model performance. Prior work treats these as distinct attacks, because they make different assumptions about the adversary’s capabilities, such as the amount of poison that can be inserted in a dataset.

As we consider how poisoning affects gradients, we find that both scenarios craft poisons that share two properties. First, the gradients computed from the poison and the clean samples have observable magnitude and orientation differences. Second, these differences grow as we introduce stronger poison samples. We use these properties to unify the threat surface of both indiscriminate and targeted attacks.

These properties also suggest design guidelines for a generic defense strategy effective against more forms of poisoning. First, gradient-level differences—magnitude and orientation—between the poison and the clean samples result in model parameter updates in favor of the adversary. In consequence, an ideal defense should minimize such differences to ensure that the poison cannot dominate a model’s behavior. Further, as attacks can still be effective even when poison samples closely resemble clean ones, a defense also should not rely on data sanitization, i.e., identifying and removing malicious samples. Most prior defenses rely on a form of sanitization, which means they have to make attack-specific assumptions. Based on these desiderata, we propose gradient shaping as a step toward defenses that generalize.

In gradient shaping, a defense aims to mitigate poisoning at the gradient-level during training and remains agnostic to training samples. As a concrete implementation of gradient shaping, we experiment with an off-the-shelf tool: differentially private stochastic gradient descent (DP-SGD) . DP-SGD is originally a training algorithm that provides differential privacy guarantees with respect to training data. Because it clips the norm of individual gradients and adds noise to them; we find DP-SGD is a suitable candidate for a gradient shaping mechanism. Thus, we study the feasibility of gradient shaping with DP-SGD against a wide range of attacks.

We evaluate DP-SGD against two indiscriminate poisoning attacks and a strong clean-label targeted poisoning attack . Our results on three ML models, linear regression, multi-layer perceptrons, and convolutional neural networks—trained on three popular ML tasks, Purchase-100, FashionMNIST, and CIFAR-10—reveal that DP-SGD can be effective against multiple poisoning attacks, even when DP-SGD only provides trivial privacy guarantees. For example, against an indiscriminate attack (random label-flipping), it reduces the performance degradation by half, and against a one-shot targeted attack, it prevents targets from being misclassified. Furthermore, against a multi-poison targeted attack, it also forces the adversary to blend more poisons and, therefore, increases the attack’s cost. However, even though we still observe gradient-level differences, DP-SGD is relatively ineffective against a strong, albeit unrealistic, indiscriminate attack . We believe this exposes the limitations of DP-SGD in performing gradient shaping; therefore, designing even more suitable mechanisms is an important direction for future research.

Contributions. In summary, we make four contributions:

We expose common gradient-level properties across various forms of poisoning, in contrast to previous taxonomies. In particular, we identify that poisoned gradients have higher magnitudes and are oriented differently when compared to clean gradients.

We take a step towards unifying the poisoning threat surface based on our gradient-level analysis.

Based on our unified view, we discuss the desiderata for a generic, attack-agnostic, defense against poisoning and propose gradient shaping as a defense approach that fulfills these requirements.

We utilize DP-SGD as an off-the-shelf gradient shaping tool. We evaluate the effectiveness of DP-SGD against various poison attacks with a systematic study on three ML models and three ML tasks.

Preliminaries on ML and Poisoning

In probably approximately correct (PAC) learning , there is an underlying data distribution $\mathcal{Z}\!=\!\mathcal{X}\times\mathcal{Y}$ where $\mathcal{X}$ is the domain of inputs and $\mathcal{Y}$ the outputs. For example, in spam email classification, inputs could be emails and outputs are labels indicating whether these emails are spam or ham. The objective of a learning algorithm $\mathcal{Q}$ is to learn a parameterized model $f_{\theta}\in\mathcal{H}$ , where $\mathcal{H}$ is the space of hypotheses. For instance, a restricted $\mathcal{H}$ could be the space of all parameters for a particular neural network whose weights and biases $\theta$ need to be learned to obtain a model $f_{\theta}$ . The model itself is a mapping between inputs and labels i.e., $f_{\theta}:\mathcal{X}\rightarrow\mathcal{Y}$ .

To train such a model, $\mathcal{Q}$ has access to a dataset $\mathcal{D}$ that is drawn from the underlying data distribution $\mathcal{Z}$ . In the supervised learning setting , $\mathcal{D}$ is partitioned into disjoint subsets: the training $\mathcal{D}_{tr}$ and test $\mathcal{D}_{ts}$ datasets. We also assume the existence of a non-negative, real-valued loss function $\mathcal{L}(f_{\theta}(x),y)$ , which quantifies how correct (or incorrect) the prediction of a model is given an input $x$ and label $y$ .

Gradient Descent. Gradient descent has established itself as the de-facto approach, in particular when it comes to training neural networks. Known as the backpropagation algorithm in the context of neural networks, gradient descent updates model parameters with a multiplicative of the derivative of the empirical risk with respect to model parameters $\theta$ at each iteration $t$ :

where $\eta$ is the learning rate and controls the magnitude of changes made at each iteration. When it is not feasible to compute the empirical risk over the entire training set, one samples a single example from the training set and uses it to compute empirical risk and update the model. This variant is known as stochastic gradient descent (SGD). To obtain an unbiased estimate of the true gradient that would be computed on the entire set , one may alternatively sample a mini-batch, i.e., a small number of examples, $(\mathbf{x}_{t},\mathbf{y}_{t})$ from the dataset at each iteration—rather than a single example or the whole set: This learning procedure is called mini-batch SGD.

Data poisoning is a training-time attack which manipulates a ML model’s behavior in favor of the attacker, by injecting maliciously crafted samples, i.e., poisons, into the training set. If a ML model is trained on the contaminated training set, an attacker can prevent the trained model from generalizing well to the test data or cause misclassifications of specific test-time samples, i.e., targets, without degrading the generalization performance of the trained model. The former attacks are known as indiscriminate poisoning whereas the latter are referred to as targeted poisoning. These attacks are highly effective when an adversary cannot control the test-time samples, or when an attacker is not able to alter the training procedures.

Adversaries exploit two mechanisms to trigger the intended misclassifications: feature collisions and feature insertions.

$\bullet$ Feature Collision: An attacker can blend poisons such that a victim model learns the opposite of what it would from clean data. For example, the attacker can create poisons by just flipping the label of a clean samples because both the clean and poison samples have the same input space information. If the victim trains a model on the poisoned training data, the information learned from the clean and poison samples contrast each other. The indiscriminate attackers cause the collisions to occur with the features useful for the classification of most test-time samples, while the feature collisions happen locally by the targeted attackers so that they cause misclassifications on a small subset of testing samples.

$\bullet$ Feature Insertion: In addition, the attacker can make a victim model learn new latent representations useful for misclassification. For instance, in image classification, the attacker can identify the features (mostly in the input space) that commonly appear in the targets, but are not critical to classify the target samples. The adversary can proceed to craft the poisons to include those features, but with the label that the attacker wants for the targets to be misclassified as. If a model trained with the poisons encounters samples, and if it does not find the features added by the adversary, the model classifies them correctly, while the targets are classified into the attacker’s label. Thus, the mechanism is useful for targeted poisoning.

1.2 Threat Model

In this subsection, we specify the threat model we operate in.

Capability: Our work assumes an attacker who cannot modify the victim model $f$ and its parameters $\theta$ directly, or control/modify the training procedure $\mathcal{Q}$ to cause the misclassification. We consider an attacker who can craft poisons offline and blend them into the training data $\mathcal{D}_{tr}$ . More specifically, we consider two scenarios: (1) the attacker blends poisons at the beginning of the training phase, or (2) adds poisons when the learner attempts to update the trained model. Thus, what the attacker can control is (1) the number of poisons that will be added to the training data, and (2) the time (or the training stage) when the attacker decides to blend poisons.

Knowledge: Prior work characterized the poisoning attacker’s knowledge using four dimensions: (1) the training data $\mathcal{D}_{tr}$ , (2) the subset of features the victim uses $\mathcal{X}$ , and (3) the learning algorithm $\mathcal{Q}$ , and (4) the model parameters $\theta$ . The black-box setting considers the attacker has no knowledge about the four dimensions whereas in the white-box setting, where the attacker knows all the four components, at least partially. As we discuss defensive mechanisms against data poisoning, we consider strong white-box attackers.

Poisoning Mechanisms and Gradients

In this section, we characterize the attack surface exploited by data poisoning attacks through a systematic analysis of gradients computed by training algorithms. Here, we focus on the impact of the poisoning mechanisms that we specified in §2.1.1 on the gradients computed during the training of a model. Since the attacker cannot modify the victim model, perturb its parameters, and control the training procedures (specified in §2.1.2), the attacker cannot directly manipulate the gradients computed on the poisons by the victim’s training algorithm but can influence them through poisons. In contrast to previous taxonomies structured around the knowledge and capabilities of adversaries , our characterization provide a unified perspective on data poisoning attacks in §4.

Poisons: We use the following techniques to craft poisons:

$\bullet$ Feature Collision: To cause the feature collision by poisons, we use the watermarking technique. We randomly choose the same number of samples from the two classes (i.e., dress and coat) in the training set $\mathcal{D}_{tr}$ of FashionMNIST; for each pair of a dress and coat images, we overlay the coat image to the dress and label the resulting image still as a dress. We control the intensity of feature collisions by modifying the interpolation ratio $\alpha$ . In Figure 2, we show the poison samples crafted with the different $\alpha$ in the upper row; the more we overlay the coat image to the dress image, the more dissimilar an interpolated image and its label (dress) become.

$\bullet$ Feature Insertion: To exploit the feature insertion by poisons, we utilize backdooring mechanisms . We first randomly select 1% of the training samples from any class and attach a small white-square in the top-left corner of each image. We then assign the label coat for the patched images and add them to the training set. Here, we control the intensity of feature insertion by increasing the patch size: $\{1\!\times\!1,4\!\times\!4,7\!\times\!7,10\!\times\!10,14\!\times\!14\}$ as shown in the bottom row of Figure 2. As the patch size increases, the model is required to update its parameters more.

2 Gradient Analysis: Feature Collision

Here, we focus on two scenarios where (1) a model is trained from scratch on the training set containing multiple poisons, and (2) we update a trained model on the poisoned training set. They represent common indiscriminate and targeted poisoning scenarios, respectively. In the first scenario, we train a linear regression (LR) model on a subset of the FashionMNIST dataset. We use the Adam optimizer and train the model for 40 epochs with the batch size 300 and the learning rate 0.01. The subset is a binary dataset comprising of samples from the dress and coat classes; the training set includes 10,800 samples (5,400 from each class), and the testing set contains 2000 samples. We construct the poisoned training set by adding 100 interpolated dress samples. In the second scenario, we first train a multi-layer perceptron (MLP) with two hidden layers on the entire FashionMNIST dataset. We then re-train the model on the poisoned training set containing the same 100 poisons. We use the SGD optimizer and re-train the model for 20 epochs with the batch size 100 and the learning rate 0.04.

The first and second columns in Figure 1 illustrate the results from our feature collision analysis. Overall, we observe that the magnitude ratio between the gradients from individual poison and clean samples is high during training in both the scenarios—i.e., the magnitude of poison gradients are larger than that of clean gradients. We also found the ratio becomes larger when we re-train a model with the same poisons; in this case, the model already has learned about the clean samples so the gradients computed from the clean data are smaller than the poison gradients. In terms of the orientation differences, we can see the cosine similarity scores are less than zero in both cases. This indicates the features are colliding—i.e., the information from the poison gradients is in contrast to that from the clean ones. Moreover, we identified that, as the intensity of the feature collision increases, the magnitude ratio and orientation difference become more stable and visible. For instance, in the first scenario, the case of $\alpha\!=\!1.0$ shows fewer oscillations of the magnitude ratio and cosine similarity score than the $\alpha\!=\!0.1$ case. In the re-training scenario, when $\alpha$ approaches one, the ratio increases, and the cosine similarity gets closer to minus one.

3 Gradient Analysis: Feature Insertion

In targeted poisoning, an attacker can also exploit the feature insertion to cause local misclassifications on a few target samples. Since the feature insertion is useful against the re-training of a high-capacity model such as a neural network, we consider the scenario where the victim updates an MLP model on the poisoned training set. We take the MLP model in §3.2, trained on the entire FashionMNIST, and construct the poisoned training set by adding 1% of patched samples to the original (clean) training set. We then re-train the model on the poisoned training set for 20 epochs by using the SGD optimizer with the batch size 100 and the learning rate 0.01.

The last column in Figure 1 shows the results from our feature insertion analysis. In the earlier epochs, we found similar patterns to the results in §3.2, there is a significant difference in the magnitude of the gradients computed from the poisons and clean samples. However, we can also see the differences that we observed in the earlier epochs are reduced during training and, at the end of the re-training, both the magnitude ratio and cosine similarity become closer to zero. This implies that the model learns the new features coming from our poisons during re-training with minimal collisions with the existing features; thus, the attacker can cause misclassifications of the test-time samples including the new features without hurting the model’s original behaviors.

Unifying Data Poisoning Attacks

In this section, we unify the attack surface exploited by data poisoning attacks based on our analysis of their impact on the magnitude ratio and orientation differences between the clean and poison gradients. We start with an overview of the existing data poisoning attacks based on the taxonomy structured around the knowledge and capability of an attacker . We then focus on poisoning mechanisms that the indiscriminate or targeted attacks use and identify shared features that can be observed in the magnitude ratio and orientation differences. Building on this intuition, we lay out essential properties for a generic defense against poisoning attacks.

Estimated Impact on Gradients: To maximize the test-time loss during training, the attacker needs to craft poisons that cause the largest disturbance to the gradients computed from the clean data. Hence, the indiscriminate attackers exploit the feature collision (see column 11 in Table 1). As the main purpose of the indiscriminate attack is to cause a significant accuracy drop, the attacker increases the intensity of feature collision by blending multiple poisons into the training set. This scenario is similar to our analysis of feature collision in §3.2 when we train an LR model on the binary training set containing 100 poisons. Thus, we expect to observe a similar magnitude ratio and orientation difference between the poison and clean gradients across the indiscriminate poisoning attacks—i.e., cosine similarity becomes -1, and the magnitude ratio is high. We specify this intuition in the columns 12-13.

2 Targeted Poisoning Attacks

Overview of Existing Work: The bottom half of Table 1 shows the targeted attacks. Most work , except , performed targeted attacks on large capacity models such as deep neural networks (DNNs). In the columns 3-6, we show that initial work on targeted poisoning considers the white-box attacker who knows $\mathcal{D}_{tr}$ , $\mathcal{X}$ , $f_{\theta}$ , and $\mathcal{Q}$ ; however, the following work identified that an adversary can perform effective attacks without the knowledge of the training set $\mathcal{D}_{tr}$ . The most recent work demonstrates the successful attacks without knowing the target model $f_{\theta}$ and its training algorithm $\mathcal{Q}$ by exploiting the transferability of poisons across different DNNs. In targeted poisoning, the attacker crafts poisons using the test samples in the target class. For example, if an attacker wants to misclassify a small subset of dog images (targets) into fish, the attacker first picks a few test samples in the fish class. In the creation process, the attacker minimizes the distance between the poisons and targets in the internal representation space of a model and the perturbations in the input features $\mathcal{X}$ . This makes targeted attacks inconspicuous—i.e., the poisons are perceptually indistinguishable to a human, but they affect the model’s decision locally without a significant accuracy drop.

Estimated Impact on Gradients: To minimize the accuracy drop caused by poisoning samples, a targeted attacker exploits both the feature collision and insertion (see column 11 in Table 1): (1) To cause the local misclassifications, the attacker can cause collisions with the features important for the target classifications, but the model does not rely on them for the entire classification. (2) On the other hand, the attacker inserts new features to the target model and increases their importance on the classification of targets in training. Considering that the targeted attacks were more successful when the attacker blends poisons during the re-training of a victim model, we expect to observe the magnitude ratio and orientation difference showed in §3.3—i.e., the ratio becomes high, and the orientation differences become unstable or slightly less than -1. However, as shown in Figure 1, cosine similarity can decrease and become 0 when the attacker reduces the intensity—the number of poisons—of both the mechanisms.

Mitigate Poisoning with Gradient Shaping

The unified view of data poisoning attacks in §4 outlines a series of requirements for effective defenses. Here, we introduce gradient shaping, a property for anti-poison defenses, that satisfies these requirements. Using this property, we taxonomize the existing defenses against data poisoning attacks and discuss the effectiveness and limits of those mechanisms. We then instantiate a defense by leveraging differentially private (DP) optimizers and demonstrate its effectiveness in §6.

Gradient Shaping: By controlling both the magnitude ratio and orientation difference of the gradients, we could intuitively safeguard ML models against data poisoning attacks. We refer to this property as gradient shaping. Any defense mechanism is used to minimize the differences in gradients before the model uses them to update its parameters. In Figure 3, we illustrate how this property reduces the differences in gradients during the training of a model with poisons in the context of indiscriminate attacks. When we train a model on the poisoned training set, we compute clean gradients (the green lines in the figure) and the poison gradients (the red lines in the figure) at each iteration. We sum the two gradients and update the model parameters.

If the data does not include poisons, the training algorithms updates the model parameters following the trajectory shown in the green dashed-lines—i.e., it updates the parameters in the direction of reducing loss. However, when the data contains poisons, the updates at each iteration are affected by the poison gradients (see the red dashed-lines). Thus, the model parameters land on the surface with high loss, which leads to the accuracy drop of the trained model. If gradient shaping is used, the magnitude and orientation differences between the poison and clean gradients are reduced (see the brown lines). In this case, the updates of the model parameters deviate less, following the trajectory illustrated in the brown dashed-lines, and land on the surface with low loss.

2 Existing Poisoning Defenses

Table 3 outlines defenses against poisoning. Prior work focused primarily on outlier removal (also known as data sanitization). In outlier removal, the defender considers outliers as poisons and removes them from the training data, which meets the requirements of gradient shaping as it clears away the poison gradients by removing a set of poisons.

However, outlier removal is brittle: they identify poisons based on analysis of nearest neighbors, training loss, and dimensionality reduction techniques—all of which are dependent on the training data $\mathcal{D}_{tr}$ and/or the model $f_{\theta}$ and its parameters $\theta$ . Worse so, when the attack uses inconspicuous poisons, e.g., targeted poisoning, it is difficult to detect them by the sanitization techniques ; as witnessed in the abundant work in evasion attacks , malicious samples with similar input representations often produce distinctly different gradients. An extended version of the RONI defense called tRONI examines the misclassification of targets during training; however, tRONI assumes a defender knows the targets that an attacker wishes to misclassify. Moreover, outlier removal scales poorly as it increases the computational overheads during training. They require an iterative analysis of the training samples or rely on robust optimizations using higher-order derivatives. Approaches from robust optimization could indeed apply here, but they were found by Jagielski et al. to perform poorly in the presence of adversary-induced poisoning .

3 Generic Defense: DP Optimizers

Recall that the requirements for an effective anti-poison defense include (1) controlling the norm of the poison gradient, and (2) restricting differences in the orientation between the poison and clean gradients. Similar requirements are needed to ensure that learning algorithms train a model privately. We adopt the differential privacy framework, which can be thought of as requiring that the model updates are not influenced overly by any of the individual examples contained in the training data. Abadi et al. propose a differentially private mechanism for off-the-shelf optimizers, e.g., SGD (henceforth referred to as DP optimizers). By choosing a pre-defined clipping norm, they bound the influence of an individual gradient to a model. Also, they proceed to make the gradients indistinguishable by adding Gaussian noise (see Algorithm 1 in Appendix B). Thus, we utilizes DP optimizers to meet the requirements for an effective anti-poison defense. The key advantage of using DP optimizers is that they are generic mechanisms agnostic to the dataset and model used by a defender, and the techniques used to craft poisons (see Table 3). In subsequent sections, we validate these points.

Out-of-Scope: DP optimizers are designed to control the privacy leakage ( $\varepsilon$ ), often at a significant cost in model utility . Instead, we focus on how certain parameter configurations of DP optimizers can defend against poisoning with minimal utility loss, regardless of the privacy provided.

Evaluation

In this section, we evaluate the effectiveness of training a model with DP optimizers as a defense against data poisoning attacks. We start with an overview of our experimental setup (§6.1). We then quantify the resilience of training a model with DP optimizers against indiscriminate (§6.2) and targeted poisoning attacks (§6.3 and §6.4). In experiments, we individually vary the two parameters—the clipping norm $C$ and the noise multiplier $\sigma$ —for each attack, to analyze the impact of parameter configurations on the resilience. To understand the impact of a specific parameter choice, we compare the magnitude ratio and orientation difference between the poison and clean gradients observed in the training without DP optimizers and those seen when we use them. Moreover, we turn our attention to distinct defense scenarios where the resilience of a model itself is necessary (§6.5).

Our analysis framework quantifies the effectiveness of using DP optimizers against data poisoning attacks. Given an attack, the framework crafts poisons, trains multiple models using the specified training algorithms (SGD/Adam or DP-SGD/DP-Adam) with the poisoned training set, and reports the metrics that we define. We build this framework using Python 3.73 and TensorFlow 1.14.0. To train a model with DP optimizers, we use the open-source library, TensorFlow-Privacy .

Datasets: We conduct our analysis with three datasets: Purchase-100 , FashionMNIST , and CIFAR-10 . Purchase-100 consists of 200k customer purchase records of size 100 each (corresponding to the 100 frequently purchased items), and the records are categorized into 100 classes based on the customers’ purchase style. Here, we use 10k randomly-chosen records for training and 10k randomly-selected non-training samples for the test set . FashionMNIST is composed of 28x28 grayscale images of 70k fashion products from 10 categories, with 7k images per class, which contains 60k training and 10k testing samples. CIFAR-10 includes 32x32 pixels, colored natural images of 10 classes, containing 50k training and 10k testing samples.

Models: We consider a logistic regression (LR), a multi-layer perceptron (MLP), and a convolutional neural network (CNN). We include the network configurations that we used in Appendix C. For Purchase-100, we use LR and MLP models; however, for the FashionMNIST and CIFAR10, we use MLP and CNN models because the LR models have poor accuracy ( $<$ 50%) on the test set. In all figures and discussion of results, we add the prefix vanilla- and DP- to denote models trained with SGD and DP-SGD respectively.

Metrics: Since the indiscriminate attacker aims to cause significant accuracy drop of a model over the test set, we utilize the relative accuracy drop (RAD) to measure the attacker’s success. RAD is the accuracy drop of a model caused by poisoning over the accuracy of the clean model—the larger the RAD, more effective the attack. For targeted poisoning, we consider an attack to be successful when the target becomes misclassified at any epoch during re-training without causing significant accuracy degradation. Specifically, successful attacks are those where RAD $<$ 0.05, the same threshold used by Suciu et al. . Moreover, we measure the attack intensity as the number of poisons added to the clean training set. For indiscriminate attacks, we denote the intensity as a ratio of the number of poisons to the number of clean samples. In targeted attacks, the intensity is the number of poisons.

2 Mitigating Indiscriminate Poisoning

Experimental Methodology: Indiscriminate poisoning is known to be effective against binary classification tasks that utilize linear models ; thus, we focus our analysis on the LR models trained on the subset of FashionMNIST used in §3.2. Our analysis considers two attacks: (1) the random label-flipping (LF) that manipulates the labels of clean samples, and (2) the state-of-the-art (SOTA) attack formulated by Steinhardt et al. . For each attack, we first construct the poisoned training sets that include varying number of poisons synthesized using one of the attacks specified above. On each poisoned training set, we train models with DP-Adam using different clipping norms and noise multipliers and compare their RAD with that of the vanilla-model.

Figure 4 illustrates the RAD of the LR models constructed using the above methodology. We display (1) the results from the random LF attacks in the upper row, and (2) the results from the SOTA attacks in the bottom row. For our analysis with DP-Adam, we choose the clipping norm from $\{8.0,4.0,2.0,1.0,0.1\}$ and the noise multiplier from $\{10^{-1},10^{-2},10^{-3},10^{-5},10^{-6},10^{-7}\}$ We use these clipping norms since the median value of parameter updates observed during the training of a model lies in that range (as recommended by Abadi et al. ). Once we identify which clipping norm provides the best resilience, we examine noise multipliers that do not cause an accuracy drop more than 10% of the model trained with the clipping norm.. The batch size and the learning rate are fixed to $300$ and $0.01$ respectively, and we train a model over $40$ epochs. For each model, we increase the intensity of our attacks by blending 0, 1, 2, 3, 4, 5, 10, 20, 30, and 40% of poisons like . For the vanilla-models, we observe that the RAD caused by the SOTA attack ( $\sim$ 0.19) is significantly higher than the random LF attack ( $\sim$ 0.03). We also observe that when the attackers blend more poisons, the trained model generally suffers from a larger RAD.

Impact of the Clipping Norm: We first examined whether using only the clipping norm can reduce the success of an indiscriminate attack, as using the noise multiplier causes the utility loss of a model. We found that setting the clipping norm to a particular value in [2.0, 8.0] can reduce RAD caused by random LF attacks by $2\times$ . Figure 4(a) shows the RAD of DP-models trained on the data containing different numbers of poisons from the random LF attacks. The DP-models have smaller RAD than vanilla-models. In particular, we achieve the lowest RAD when the clipping norm is 4.0—the DP-model trained with 40% of poisons has 0.011 (RAD) whereas the vanilla-model shows 0.028 (RAD). We also examine the clipping norms in $\{1.0,0.1\}$ ; however, they could not achieve a RAD smaller than 0.011.

In contrast, we observed that using the clipping norm cannot reduce the RAD of a model caused by the SOTA attacks. Figure 4(d) shows the DP-models trained with the clipping norm in [4.0, 8.0] could not lead to a smaller RAD in all the attacks. Also, when we use the smaller clipping norm 2.0, the RAD of the DP-models becomes worse than that of the vanilla-models. Training with 40% of poisons, the RAD of the DP-model is 0.217 whereas that of the vanilla-model is 0.178. To understand why DP optimizers are ineffective in the SOTA attacks, we conduct an extensive analysis in §7.

Impact of the Noise Multiplier: Here, we evaluate whether combining the noise multipliers with a particular choice of clipping norm can reduce the RAD caused by the attacks further than individually using the parameters. We set the clipping norm to 4.0—the best setting found from our analysis—and vary the noise between $\{10^{-1},10^{-3},10^{-5},10^{-7}\}$ . Figure 4(b) and 4(e) show that a defender could not benefit from combining the noise multiplier with a specific clipping norm. In the random LF attacks, the DP-models shows more RAD when we combine the noise multipliers and the clipping norm 4.0 than the models trained without the noise. In the SOTA attacks, using the noise multiplier could not provide any benefit for the defender as the RAD of the DP-models is similar to that of the vanilla-models. We revisit this in §7.

Impact on the Gradients: In §4.1, we identified that indiscriminate poisoning attacks induce contrasting magnitude and orientations between the gradients from poisons and clean samples during training. Hence, for the attack cases where using the clipping norm is an effective defense, the magnitude and orientation differences in training has to be reduced. We found that is the case. In Figure 4(c) and 4(f), we compare the magnitude differences observed in the training of the vanilla- and DP-model with 40% of poisons from both the random LF and SOTA attacks. We set the clipping norm to 4.0. For the random LF attack, we observe the magnitude ratio decreases when the clipping norm is used; on average, the ratio is 2.527 in the vanilla-model and 2.221 in the DP-model. However, in the SOTA attack, the ratio becomes higher in the DP-model (3.645) than what we see in the vanilla-model (2.497). This implies that the magnitude of poison gradients is smaller in the DP-model than that in the vanilla-model; thus, during training of a model with DP optimizers, the influence of poisons on the model is less than that in the vanilla training.

3 Mitigating Targeted Poisoning

Experimental Methodology: We evaluate the effectiveness of our defense against the realistic, worst-case targeted attacker formulated by . This attack considers the white-box adversary who has the full knowledge of the target model and its parameters. By exploiting this internal information, the attacker becomes inconspicuous, but effective—i.e., the adversary crafts poisons perceptually indistinguishable by a human but can cause misclassification on targets with small number of poisons, e.g., a single poison. Moreover, the attack does not modify the original label of poisons (clean-label). This is currently considered a worst-case attack because such inconspicuous poisons are difficult to be filtered out by using the existing outlier-based defenses in §5.2. To maximize the influence of poisons on a model, the attacker blends them into the training set used for re-training of the model. We denote the case where the attacker uses a single poison as one-shot and multi-poison when they use multiple.

Figure 5 shows the success rate of the one-shot poisoning attacks and the RAD caused by training with DP-SGD in three different models (LR, MLP and CNN) trained using Purchase-100, FashionMNIST, and CIFAR-10 respectively. During our training of a model with DP-SGD, we use the clipping norm in $\{8.0,4.0,2.0,1.0,0.1\}$ and the noise multiplier in $\{0.001,0.01,0.1,0.4,0.8,1.0,2.0,4.0\}$ . The batch size is fixed to 100, and we use the learning rate 0.08 for the LR and MLP models and 0.02 for CNN models. We first train a model from scratch on the clean training set for 100 epochs and then re-train the same model for 50 epoch with the same training set containing poisons. Since using the noise multiplier decreases a model’s utility, we first examine the impact of the clipping norm by setting the noise multiplier to zero. Then, we fix the clipping norm to a specific value and repeat the same set of analyses while varying the noise multiplier.

Impact of the Clipping Norm: We observe that setting the clipping norm to a small value can decrease the success rate of the one-shot poisoning attack significantly. This result is consistent with our intuition in §3.2; since the attacker exploits feature collision locally, the magnitude difference between the poison and clean gradients becomes high, and the orientation difference oscillates during re-training. Thus, suppressing the parameter updates from the poison by setting the clipping norm can be an effective defense. In the LR models trained on Purchase-100, using the clipping norm 0.1 reduces the attack success rate from 46.58% to 9.33% with RAD $<$ 0.1. For the MLP models trained on FashionMNIST, we also observe that the attacker’s success rate decreases to 1.33% with a RAD of 0.04 when we use the clipping norm of 1.0. Moreover, in the CNN models trained on CIFAR-10, setting the clipping norm to 0.1 also reduces the success rate by more than $2\times$ —from 50.00% to 21.00%. However, we sacrifice the model’s utility 0.48 in RAD to achieve the resilience.

Impact of the Noise Multiplier: Our previous analysis raises a question: can we achieve better resilience with the same RAD by combining the noise multiplier? To answer this question, we choose the clipping norm from our analysis results and examine different noise multipliers. The results are shown in the second row of Figure 5. We found that combining the noise multiplier with a specific value of the clipping norm reduces the attacker’s success rate further with RAD $<$ 0.1. In Purchase-100, we use the noise multiplier 0.01 with the clipping norm 4.0, and we decrease the success rate of the attacker to 8.97%. We achieve 0% attack success rate with the clipping norm 2.0 and the noise multiplier 0.8 in FashionMNIST. In CIFAR-10, using only the clipping norm 0.1, we can reduce the attacker’s success rate by $2\times$ , but we lose the model’s utility by 0.48 in RAD. However, when we use the noise multiplier 0.4, the attacker’s success rate drops to 21.21%, similar to the results by using only the clipping norm, but the utility loss is much smaller ( $0.15$ in RAD).

Impact on the Gradients in Training: Here, we conduct an analysis of the impact of re-training with DP-SGD on the magnitude ratio and orientation difference. If the DP-SGD is an effective anti-poison defense against the one-shot attacks, the magnitude ratio becomes smaller and the orientation difference is stabilized when we re-train a model with the optimizer. Figure 7 illustrates our analysis results. We compare the magnitude ratios between the re-training process with SGD and that with DP-SGD in the upper plot; the orientation differences are shown in the lower plot. For DP-SGD, we use the clipping norm 4.0 and the noise multiplier 0.1.

We found that re-training of a model with DP-SGD reduces the magnitude ratios between the poison and clean gradients and stabilizes the orientation differences. The magnitude ratios decrease from 7.46 to 1.14 on average over 50 epochs. Also, the standard deviation of the orientation differences seen in the successful attack is 0.073, whereas the value becomes 0.01 when we use DP-SGD during re-training.

4 Mitigating Multi-Poison Attacks

In this subsection, we extend our previous analysis by conducting the multi-poison attacks on the LR and MLP models trained on Purchase-100. Figure 6 illustrates our results: we show (1) the attacker’s success rate, (2) the number of poisons required for a successful attack on average, and (3) the RAD of a model caused by DP-SGD. We use the same set of clipping norms and noise multipliers as in §6.3.

Impact of the Clipping Norm: We found that setting the clipping norm to a small value increases the number of required poisons for an successful attack with RAD $<$ 0.05, but this could not reduce the success rate of the attack. First, in Figure 6(a), we can see the attacker consistently achieves a success rate over 94.67% in all the clipping norms. However, we found that the number of required poisons on average increases from 1.08 up to 5.43 as we decrease the clipping norm from 8.0 to 0.1, with the small amount of utility loss (0.011 in RAD). We also examine the clipping norms smaller than 0.1—i.e., 0.4 and 0.01; nevertheless, they do not provide more benefits to a defender. The number of required poisons for an attack saturates, and the utility loss starts to increase.

Impact of the Noise Multiplier: Can we reduce the success rate of the attacker by combining the noise multiplier with a specific value of the clipping norm? To answer this question, we repeat the same experiments with the fixed clipping norm (4.0) and vary the noise multipliers. Figure 6(b) and 6(c) illustrates our results in LR and MLP models trained on Purchase-100. We found that using the noise multiplier with the clipping norm is helpful to reduce the attack success rate and to increase the number of required poisons for an successful attack, but this comes with the significant utility loss. In the LR models with the noise multiplier 0.4, the attacker’s success rate becomes 7.14%, and the number of required poisons is 27.50. In the MLP models, when we use the noise multiplier 1.0, we make the success rate of the attacker 0%. However, in both cases, the utility loss of the LR and MLP models are 0.748 and 0.612 in RAD respectively.

Impact on the Gradients in Training: DP-SGD works as a defense by clipping the norm of an individual gradient and adding Gaussian noise to it (see Algorithm 1). Hence, the multi-poison attacker can neutralize the impact of the clipping norm by using multiple poisons. Contrary to the one-shot attack where each batch includes at-most one poison, the multi-poison attack enforces each batch to contain more than one poison. Even if each poison gradient is bounded by a small clipping norm, the influence of total poison gradients to the model parameter updates during training can be sufficient to cause a successful attack.

To decrease the success rate of the multi-poison attacker, we need to set the noise multiplier. The noise added to each gradient prevents the gradients computed from multiple poisons orienting towards a similar direction. Thus, their sum in re-training is insufficient to cause the misclassification. This is also true that the attacker can neutralize the noise by blending multiple poisons—i.e., the expected sum of the noise added to each gradient is zero; however, due to the randomness, the number of required poisons to remove the noise is a lot more than what the attacker needs to evade the clipping norm.

5 Distinct Defense Scenarios

In this section, we consider the distinct defense scenarios such as transfer learning cases where a defender can only use DP-SGD in the specific stage of the training process.

Re-training a Vanilla-Model with DP-SGD: This happens when a defender cannot train a model from scratch. For example, it is difficult to train a large model from scratch with DP-SGD as the training takes a week on a super-computer cluster. Considering such a scenario, we evaluate whether re-training of a vanilla-model with DP-SGD on the poisoned training set can be resilient against the targeted attacks.

We take the vanilla LR model trained on Purchase-100 and re-train the model using two methods: (1) we continue to train with Adam or (2) use DP-Adam with the clipping norm 4.0 and the noise multiplier 0.01. During re-training, we perform both the one-shot and multi-poison attacks. Table 3 shows the attacker’s success rate and the number of required poisons for a successful attack. We found that a defender can make the re-training of a vanilla model resilient against targeted poisoning by using DP optimizers. In the one-shot attack, the success rate decreased from 46.58% to 17.81%. In the multi-poison attack, the attacker’s success rate decreased from 100.00% to 61.64%, and the number of required poisons increased from 1.79 to 2.47, which shows better resilience than the cases in §6.4.

Re-training a DP-Model with SGD: We now direct our attention to the scenario where we re-train a DP-model with DP-Adam on the poisoned training data. Here, we take the DP-LR model trained on Purchase-100 using the same clipping norm and noise multiplier and re-train the model using Adam. During re-training, we perform the one-shot and multi-poison attacks. Our results is in Table 3. We observe that when a model trained with DP is re-trained with vanilla optimizer, the model becomes resilient against targeted poisoning even if we do not use our defense during re-training. The one-shot attacker’s success rate on the model re-trained with Adam is 16.67% whereas the same attacks are 46.58% successful when we re-train a vanilla-model with Adam. In the multi-poison cases, the success rate of the attack does not decrease, but the number of required poisons are 3.47 on average, which is higher than the vanilla-model case (1.79). This result implies that when a defender distributes a teacher model for transfer learning, it is safer to train the model with a DP optimizer. To understand this resilience, we analyze the decision boundary of a DP-model and include the discussion in Appendix D.

Discussion

In this section, we discuss the scenarios where DP optimizers become ineffective. Our previous analysis identified that: (1) training an LR model with DP-Adam cannot mitigate the attacks formulated by Steinhardt et al. (§6.2), and (2) setting the noise multipliers to a high value accompanies a significant utility loss of a trained model (§6.4). We conduct analysis of those cases to understand the limits of DP optimizers when we use them as a mechanism for realizing gradient shaping and discuss potential improvements to address them.

We compare the distribution of poison samples to understand why we cannot defeat the SOTA attack. We found that the SOTA attack uses unrealistic poisons that can exploit the weakness of linear models. Figure 8 shows the distribution differences between the clean and poison samples from the random LF and SOTA attacks. With the 10,800 clean samples and 4,320 (40%) poisons from each attack, we perform principal component analysis (PCA) to reduce the dimension and then use KMeans on the 2-dimensional data for clustering. In the figure, we observe that the poisons from the SOTA attack are unrealistic; they consists of the same 3,970 coat samples (a single point in the left) and 350 dress samples in two types (two points in the right). When we train an LR model on this data, the model first fits its decision boundary that splits the poisons well and then adjusts the boundary to classify clean samples (see Appendix E for details). Contrarily, the 4,320 poisons from the random LF attack have the similar distribution to the clean samples; thus, the model trained on the data can learn its boundary from the majority of clean samples and become resilient to the poisons.

2 Case Study: Multi-Poison Attack

Related Work

Gradient Regularization as a Defense Prior work, especially in the context of neural networks, has proposed imposing regularization penalties on a model’s gradients to improve the accuracy , interpretability or robustness against adverarial examples . In contrast to our threat model, by explicitly penalizing the input gradients, these mechanisms aim to regulate the test-time predictions, while assuming a clean training set. We propose gradient shaping against training-time attacks to suppress the adverse gradient signatures poisons produce during training. Further, gradient penalties rely on computationally restrictive “double backpropagation”, whereas we implement gradient shaping with a more efficient DP-SGD mechanism.

Model Poisoning Attacks against Machine Learning: In the context of distributed learning scenarios, such as federated learning, recent work has proposed model poisoning attacks that directly manipulate the parameter updates (gradients) end-hosts send to the shared model . These attacks have been considered effective when the attacker and the victim interactively train a model. Model poisoning is out-of-scope for our work because in the poisoning attacks we consider, the adversary manipulates the training set, not the gradients.

Privacy Attacks in Machine Learning: Due to the high capacity of machine learning models, especially neural networks, models trained on private data, such as health-care or face datasets, may potentially leak sensitive information about their training sets. Prior work has demonstrated attackers who aim to extract sensitive information from trained models. DP-SGD is developed as a tool to protect the model from these attacks by clipping and adding noise to the gradients during training. We use DP-SGD as an exemplar tool to demonstrate the feasibility of gradient shaping as a defense against data poisoning attacks.

Conclusions

This work tackles data poisoning in machine learning with a unifying view of the threat landscape. We focus on a common element of all poisoning attacks: they manipulate gradients computed during training to update models. We identified two main artifacts shared by various forms of poisoning—(1) gradients computed on poisoned data have significantly higher magnitudes than their counterparts on clean data, and (2) their orientations also differ. Building on this analysis, we next introduced gradient shaping—the prerequisite for an attack-agnostic defense to poisoning—that bounds gradient magnitudes and minimizes angular differences. Gradient shaping allows us to move towards a generic defense, in contrast to prior defenses that exploit attack-specific properties or rely on the identification of points that were poisoned. To study the feasibility of gradient shaping, we consider DP-SGD—a natural candidate algorithm for training with gradient shaping because it clips and perturbs the gradients to provide privacy guarantees. Our experiments with DP-SGD show that it reduces the model’s accuracy drop in the presence of indiscriminate attacks, mitigates one-shot targeted attacks, and increases the adversary’s cost in multi-poison targeted attacks. We also observed that DP-SGD becomes ineffective against a strong, yet unrealistic, indiscriminate attack. This highlights designing an effective gradient shaping mechanism is a promising direction towards an ideal poisoning defense.

Availability

Our code is available under an open-source license from: https://github.com/Sanghyun-Hong/Gradient-Shaping.

Acknowledgments

We thank Dana Dachman-Soled, Furong Huang, Matthew Jagielski, Yuzhe Ma, and Michael Davinroy for their constructive feedback. We acknowledge Tom Goldstein, Ronny W. Huang, and the University of Maryland super-computing resourceshttp://hpcc.umd.edu (DeepThought2) made available for conducting the experiments reported in our paper. This research was partially supported by the Department of Defense and Canadian Institute for Advanced Research (CIFAR).

References

Appendix A Intuition Behind Our Gradient-Analysis

Discussion Related to §5.2: There is a limitation in characterizing the goodness of a set of parameters computed by the gradient descent sorely based on the value of the training loss. Because the training objective for a neural network is non-convex, there exists multiple local minima with associated losses taking values comparable to the global minimum, denoted as the blue pentagon in Figure 9. In the benign setting, all of these local minima correspond to models with comparable performance when it comes to predicting on test data. The ML community identified that finding one of the local minima is sufficient even if it is not the global minimum . One of these minima is the green point obtained by gradient descent on legitimate data—the green trajectory in Figure 9. When poison is inserted in the training data, the adversary forces training to follow an alternative descent that may achieve either (1) a loss indicated by the orange dot that is similar to the green dot in the case of targeted poisoning attacks, or (2) a higher loss value for the red dot when the attack is indiscriminate. This distinction makes it difficult to characterize poisoning solely based on the loss achieved upon completion of training, in particular when the attack is targeted. Instead, one should capture the trajectory taken by gradient descent in the presence of poisoned data. Thus, we focus on the norm and orientation of gradients.

Appendix B Differentially Private (DP) SGD

Details of DP-SGD Discussed in §5.1: To train a model with a provable privacy guarantee, we commonly use DP-SGD —i.e., a simple modification to the popular training mechanism, mini-batch SGD. In Algorithm 1, we highlighted the modifications in blue. For each sample in a mini-batch, DP-SGD first bounds the gradient computed from a sample on the predefined value $C$ (line 5) and adds Gaussian noise to the gradient proportionally (line 6). DP-SGD also provides an accounting mechanism that measures the total privacy budget spent up to a certain iteration, which enables to estimate the worst-case privacy leakage $\varepsilon$ of a model. In practice, an ML expert controls the clipping norm $C$ and the variance $\sigma$ (noise multiplier) of the distribution where the noise is drawn to train a model that achieves a reasonable accuracy and privacy guarantee. The training procedure stops when the total privacy expenditure exceeds a privacy leakage of $\varepsilon$ . In our work, we do not utilize DP-SGD’s privacy accounting mechanism; we use the algorithm as a tool to realize gradient shaping.

Appendix C Neural Network Architectures

Information Relevant to §6.1: We show our two baseline networks (an MLP and CNN) in Table 4. The description in a parenthesis indicates the activation functions used (R: ReLU, and ’-’: None). The output of each network is equal to the number of classes: 10 for FashionMNIST and 100 for CIFAR10 and Purchase-100.

Appendix D Why is the DP-Model Resilient?

Discussion Related to §6.5: To understand why the DP-model can reduce the success rate of the targeted attacks even if the model is trained with SGD, we conduct the decision boundary analysis of two models: one trained without DP-SGD and the one with. For this experiment, we utilize the 2-dimensional, two-moons dataset that consists of 700 training and 300 testing samples. We trained two MLP models. We set both the clipping norm and noise multiplier to 1.0; the models achieve the accuracy of 0.98 and 0.97 respectively.

Figure 10 illustrates the decision boundary of the vanilla- and DP-model. The leftmost figure shows the distribution of the two-moons dataset with the color red and blue corresponding to each class, and in the other two figures, we display the boundary in the figure’s background. The contour colors indicates the decision confidence: if the color is darker, the higher the confidence is. The white area in between is where the decision boundary lies.

Here, we found that, when a model is trained with DP-SGD in a way that the model achieves the best possible accuracy, the model learns complex decision boundary—i.e., it overfits to the training data. We also observe that, in the DP-model, the white-area where a model is uncertain about its decisions becomes narrower. This means the overall confidence of the model’s decision will increase; thus, the amount of parameter updates—the sum of the gradients required for misclassifications of targets—will increase. In consequence, in the one-shot attack, the success rate of the attacker decreases and the multi-poison attack has to use more poisons.

Appendix E Analysis of Training-time Accuracy in the Indiscriminate Poisoning Attacks

Discussion Related to §7.1: The training-time accuracy observed during training is an indicator that shows whether the model learns its decision boundary based on a specific set of samples. Hence, we monitor the accuracy of a model over the clean data and poisons during training in the random LF and the SOTA attacks formulated by . Figure 11 illustrates the training-time accuracy monitored in both the attacks. In the random LF attack (upper), the accuracy of a model over clean samples is more than 80% over 40 epochs whereas the accuracy over the poisons is below 30%. We can see, in this case, the model achieves 80% accuracy—the same as the accuracy over clean data—over the testing set. On the other hand, the training-time accuracy of a model over the poisons formulated by the SOTA attack is over 90% whereas the accuracy over the clean samples is below 80% over 40 epochs. This means the LR model trained in the SOTA attack formulates the decision boundary based on the poisons and cannot be modified easily during training. From our analysis in §7.1, we know the poisons consist of a single image of the class coat and two images of the class dress; thus, the model uses these poisons as pivots for the linear decision boundary and, during training, it is changed by the clean samples marginally.

Appendix F Trade-offs Between Model’s Utility and Privacy Leakage

Analysis of the Utility Loss When We Use DP-SGD (§6): Here, we discuss the trade-offs between the model’s utility and privacy leakage ( $\varepsilon$ ) in Figure 12. DP-SGD/-Adam is designed to control the leakage of a model by adding noise to gradients, which inherently causes the performance degradation of a resulting model. Observe that one can minimize the noise added to each gradient (ergo improve the utility of the model learned) by minimizing the clipping norm. However, setting a very small value for the clipping norm destroys important information carried in the gradients. Similarly, one can choose to retain this information by choosing a large value for the clipping norm. This, in turn, translates to a large value of noise required to the gradients, degrading utility.