Data Poisoning against Differentially-Private Learners: Attacks and Defenses

Yuzhe Ma, Xiaojin Zhu, Justin Hsu

Introduction

As machine learning is increasingly used for consequential decisions in the real world, the security concerns have received more and more scrutiny. Most machine learning systems were originally designed without much thought to security, but researchers have since identified several kinds of attacks under the umbrella of adversarial machine learning. The most well-known example is adversarial examples , where inputs are crafted to fool classifiers. More subtle methods aim to recover training examples , or even extract the model itself .

In data poisoning attacks , the adversary corrupts examples at training time to manipulate the learned model. As is generally the case in adversarial machine learning, the rise of data poisoning attacks has outpaced the development of robust defenses. While attacks are often evaluated against specific target learning procedures, effective defenses should provide protection—ideally guaranteed—against a broad class of attacks. In this paper, we study differential privacy as a general defensive technique against data poisoning. While differential privacy was originally designed to formalize privacy—the output should not depend too much on any single individual’s private data—it also provides a defense against data poisoning. Concretely, we establish quantitative bounds on how much an adversary can change the distribution over learned models by manipulating a fixed number of training items. We are not the first to design defenses to data poisoning , but our proposal has several notable strengths compared to previous work. First, our defense provides provable protection against a worst-case adversary. Second, our defense is general. Differentially-private learning procedures are known for a wide variety of tasks, and new algorithms are continually being proposed. These learners are all resilient against data poisoning.

We complement our theoretical bounds with an empirical evaluation of data poisoning attacks on differentially private learners. Concretely, we design an attack based on stochastic gradient descent to search for effective training examples to corrupt against specific learning algorithms. We evaluate attack on two private learners: objective perturbation and output perturbation . Our evaluation confirms the theoretical prediction that private learners are vulnerable to data poisoning attacks when the adversary can poison sufficiently many examples. A gap remains between performance of attack and theoretical limit of data poisoning attacks—it seems like differential privacy provides a stronger defense than predicted by theory. We discuss possible causes and leave further investigation to future work.

Preliminaries

where the probability is taken over $b\sim\nu$ . When $\delta=0$ , we say that $\mathcal{M}$ is $\epsilon$ -differentially private.

Many standard machine learning tasks can be phrased as optimization problems modeling empirical risk minimization (ERM). Broadly speaking, there are two families of techniques in the privacy literature for solving these problems. Objective perturbation injects noise into the objective function of a vanilla machine learner to train a randomized model . In contrast, output perturbation runs the vanilla learner but after training, it randomly perturbs the output model . We will consider data poisoning attacks against both kinds of learners.

We first fix the adversarial attack setting.

Knowledge of the attacker: The attacker has full knowledge of the full training set $D$ and the differentially-private machine learner $\mathcal{M}$ , including the noise distribution $\nu$ . However, the attacker does not know the realization of $b$ .

This formulation differs from previous works (e.g. ) because the differentially-private learner produces a stochastic model, so the objective takes the expected value of $C$ . We now show a few attack problem instances formulated as (2).

Differential Privacy Resists Data Poisoning

Our first result is for $\epsilon$ -differentially private learners: if an attacker can manipulate at most $k$ items in the clean data, then the attack cost is lower bounded.

Since $J(D)$ is non-negative, by the integral identity, we have

Let $\mathcal{M}$ be an $\epsilon$ -differentially-private learner. Assume $J(D)\neq 0$ . Let $\tau\geq 1$ . Then the attacker has to modify at least $k\geq\lceil\frac{1}{\epsilon}\log\tau\rceil$ items in $D$ in order to achieve

To generalize Theorem 1 to $(\epsilon,\delta)$ -private learners, we need an additional assumption that $C$ is bounded.

Adding $\frac{\bar{C}\delta}{e^{\epsilon}-1}$ to both sides, we have

We then recursively apply (11) $k$ times to obtain

Next, we consider $-\bar{C}\leq C\leq 0$ . Define $J_{i}$ as before, then by Lemma 5

Subtracting both sides by $\frac{\bar{C}\delta}{e^{\epsilon}-1}$ we have

Combined with the trivial lower bound $-\bar{C}$ , we have

Note that $J(D)$ is non-negative, thus by the integral identity, we have

Next we consider $-\bar{C}\leq C\leq 0$ . Define $\Theta(a)=\{\theta:C(\theta)<a\}$ , $\forall a\leq 0$ Again, due to $\mathcal{M}$ being differentially private, we have

As a corollary, we can lower-bound the minimum number of modifications needed to sufficiently reduce attack cost.

Let $\mathcal{M}$ be an $(\epsilon,\delta)$ -differentially-private learner. Assume $J(D)\neq 0$ . Then

Data Poisoning Attacks on Private Learners

The results in the previous section provide a theoretical bound on how effective a data poisoning attack can be against a private learner. To evaluate how strong the protection is in practice, we propose a range of attacks targeting general differentially-private learners. Our adversary will modify (not add or delete) any continuous features (for classification or regression) and the continuous labels (for regression) on at most $k$ items in the training set. Restating the attack problem (2):

This is a combinatorial problem—the attacker must pick $k$ items to modify. We propose a two-step attack procedure.

use heuristic methods to select $k$ items to poison.

In step II), performing stochastic gradient descent could lead to poisoned data taking arbitrary features or labels. However, differentially-private learners typically assume training examples are bounded. To avoid trivially-detectable poisoned examples, we project poisoned features and labels back to a bounded space after each iteration of SGD. We now detail step II), assuming that the attacker has fixed $k$ items to poison. We’ll return to step I) later.

Under certain conditions (see appendix for the details), one can exchange the order of integration and differentiation, and we have

We first consider objective-perturbed private learners:

1.2 Instantiating Attack on Output Perturbation

We now consider the output perturbation mechanism:

The derivations of stochastic gradient are similar to objective perturbation, thus we skip the details. Again, we instantiate on two examples where the base learner is logistic regression and ridge regression, respectively.

The output-perturbed logistic regression takes the following form:

The output-perturbed ridge regression takes the following form

where $\Theta=\{\theta:\|\theta\|_{2}\leq\rho\}$ . The KKT condition of (51) is

2 SGD on Surrogate Victims (SV)

Since the objective is deterministic, we can work with its gradient rather than its stochastic gradient, plugging in $b=\mathbf{0}$ into all derivations in section 4.1. Note that the attack found by (55) will still be evaluated w.r.t. differentially-private learners in our experiments.

3 Selecting Items to Poison

Now, we return to step I) of our attack: how can we select the $k$ items for poisoning? We give two heuristic methods: shallow and deep selection.

In summary, we have four attack algorithms based on combinations of methods used in step I) and step II): shallow-SV, shallow-DPV, deep-SV and deep-DPV. In the following, we evaluate the performance of these four algorithms.

Experiments

We now evaluate our attack with experiments, taking objective-perturbed learners and output-perturbed learners as our victims. Through the experiment, we fix a constant step size $\eta=1$ for (stochastic) gradient descent. After each iteration of SGD, we project poisoned items to ensure feature norm is at most 1, and label stays in $ $. To perform shallow selection when targeting DPV, we draw$ 10^{3} $samples of parameter$ b $to evaluate the gradient of each clean item. To perform deep selection, we use$ \alpha=10^{-4}$ to solve the corresponding relaxed attack problem.

2 Attacks Can Achieve Different Goals

We now study a 2D example to show that our attack is effective for different attack goals. The victim is the same as in section 5.1. The clean training set in Figure 1(1(b)) contains $n=317$ items uniformly sampled in a unit sphere with labels $y_{i}=\mathds{1}\left[x_{i}^{\top}\theta^{*}\geq 0\right]$ where $\theta^{*}=(1,1)$ . We now describe the attack settings for different attack goals.

We run deep-DPV with $k=n$ for $T=5\times 10^{3}$ iterations. In Figure 2(2(a))-(2(c)), we show the poisoning trajectories for the three attack goals respectively. Each translucent point is an original training point, and the connected solid point is its final position after attack. The curve connecting them shows the trajectory as the attack algorithm gradually poisons the data.

In label-aversion attack the attacker aims to maximize the logistic loss on the evaluation set. It ends up moving positive (negative) items to the left (right), so that the poisoned data deviates from the evaluation set.

In contrast, the label-targeting attack tries to minimize the logistic loss, thus items are moved to produce a poisoned data aligned with the evaluation set. However, the attack does not reproduce exactly the evaluation set. To understand it, we compute the model learnt by vanilla logistic regression on the evaluation set and the poisoned data, which are (2.60,0) and (2.94,0.01). Note that both models can predict the right label on the evaluation set, but the latter has larger norm, which leads to smaller attack cost, thus our result is a better attack.

3 Attacks Become More Effective as k𝑘k Increases

4 Attacks Effective on Both Privacy Mechanisms

We now show experiments on real data to illustrate that our attacks are effective on both objective and output perturbation mechanisms. We focus on label-targeting attack in this section.

The second data set is red wine quality from UCI. The data set contains 11 features of 1598 wine samples, and the task is predict the wine quality, a number between 0 and 10. We normalize the features so that the norm of any item is at most 1. Labels are also normalized to ensure the value is in $ $. The victim is private ridge regression with$ \epsilon=1 $,$ \lambda=10 $, and bound on model space$ \rho=2 $. To generate the evaluation set, we pick one item$ x_{i}^{*} $with the smallest quality value, and then set the target label to be$ y_{i}^{*}=1 $. The cost function is defined as$ C(\theta)=\frac{1}{2}({x_{i}^{*}}^{\top}\theta-y_{i}^{*})^{2} $. The other parameters of attack remain the same as in section 5.3. This attack aims to force the learner to predict a low-quality wine as having a high quality. Figure 5 shows the results on objective perturbation and output perturbation respectively. Note that the attack cost can be effectively reduced even if the attacker only poisons$ 100/1598\approx 6.3\%$ of the whole data set.

Given experimental results above, we make another two observations here. First, deep-DPV is in general the most effective method among four attack algorithms, see e.g. Figure 3(3(a)), (3(c)). Second, although deep-DPV is effective, there remains a gap between the attack performance and the theoretical lower bound, as is shown in all previous plots. One potential reason is that the lower bound is derived purely based on the differential privacy property, thus does not take into account the learning procedure of the victim. Analysis oriented toward specific victim learners may result in a tighter lower bound. Another potential reason is that our attack might not be effective enough. How to close the gap is left as future work.

5 Attack Is Easier with Weaker Privacy

Conclusion and Future Work

We showed that differentially private learners are provably resistant to data poisoning attacks, with the protection degrading exponentially as the attacker poisons more items. Then, we proposed attacks that can effectively poison differentially-private learners, and demonstrated the attack performance on a variety of privacy mechanisms and learners with both synthetic and real data. While the attacks are effective, there remains a gap between the theoretical lower bound and the empirical performance of our attacks. This could be because the lower bound is loose, or our attack is not effective enough; closing the gap remains future work. Taken together, our study is a first step towards understanding the strengths and weaknesses of differential privacy as a defensive measure against data poisoning.

We thank Steve Wright for helpful discussions. This work is supported in part by NSF 1545481, 1561512, 1623605, 1704117, 1836978, the MADLab AF Center of Excellence FA9550-18-1-0166, a Facebook TAV grant, and the University of Wisconsin.

References

Appendix: Regularity Condition for Differentiation-Integration Exchange

The following theorem characterizes a sufficient condition when the order of differentiation and integration can be exchanged (restated from Theorem 2.4.3 in ), which is called the regularity condition.

Suppose $h(z,b)$ is a differentiable function of $z$ . If for each $z$ , and there exist a function $g(z,b)$ and a constant $d>0$ that satisfy

$\|\frac{\partial}{\partial z}h(z,b)\mid_{z=z^{\prime}}\|\leq g(z,b)$ , for all $b$ and all $z^{\prime}$ such that $\|z^{\prime}-z\|\leq d$ .

Attacking objective-perturbed logistic regression.

Attacking objective-perturbed ridge regression.

Attacking output-perturbed logistic/ridge regression.

The proof for output-perturbed logistic regression and ridge regression are similar to the objective-perturbed learners, thus we omit them here.

For other attack goals such as label-aversion attack and label-targeting attack, the regularity condition also holds and the proof is similar.