Label-Only Membership Inference Attacks

Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot

Introduction

Machine learning algorithms are often trained on sensitive or private user information, e.g., medical records (Stanfill et al., 2010), conversations (Devlin et al., 2018), or financial information (Ngai et al., 2011). Trained models can inadvertently leak information about their training data (Shokri et al., 2016; Carlini et al., 2019)—violating users’ privacy.

In perhaps the simplest form of information leakage, membership inference (MI) (Shokri et al., 2016) attacks enable an adversary to determine whether or not a data point was used in the training data. Revealing just this information can cause harm—it leaks information about specific individuals instead of the entire population. Consider a model trained to learn the link between a cancer patient’s morphological data and their reaction to some drug. An adversary with a victim’s morphological data and query access to the trained model cannot directly infer if the victim has cancer. However, inferring that the victim’s data was part of the model’s training set reveals that the victim indeed has cancer.

Existing MI attacks exploit the higher confidence that models exhibit on their training data (Pyrgelis et al., 2017; Truex et al., 2018; Hayes et al., 2019; Salem et al., 2018). An adversary queries the model on a candidate data point to obtain the model’s confidence and infers the candidate’s membership in the training set based on a decision rule. The difference in prediction confidence is largely attributed to overfitting (Shokri et al., 2016; Yeom et al., 2018).

A large body of work has been devoted to understanding and mitigating MI leakage in ML models. Existing defense strategies fall into two broad categories and either

reduce overfitting (Truex et al., 2018; Shokri et al., 2016; Salem et al., 2018); or,

perturb a model’s predictions so as to minimize the success of known membership attacks (Nasr et al., 2018a; Jia et al., 2019; Yang et al., 2020).

Defenses in (1) use regularization techniques or increase the amount of training data to reduce overfitting. In contrast, the adversary-aware defenses of (2) explicitly aim to minimize the MI advantage of a particular attack. They do so either by modifying the training procedure (e.g., an additional loss penalty) or the inference procedure after training. These defenses implicitly or explicitly rely on a strategy that we call confidence-maskingSimilar to gradient masking from the adversarial examples literature (Papernot et al., 2017)., where the MI signal in the model’s confidence scores is masked to thwart existing attacks.

We introduce label-only MI attacks. Our attacks are more general: an adversary need only obtain (hard) labels—without prediction confidences—of the trained model. This threat model is more realistic, as ML models deployed in user-facing products need not expose raw confidence scores. Thus, our attacks can be mounted on any ML classifier.

In the label-only setting, a naive baseline predicts misclassified points as non-members. Our focus is surpassing this baseline. To this end, we will have to make multiple queries to the target model. We show how to extract fine-grained MI signal by analyzing a model’s robustness to perturbations of the target data, which reveals signatures of its decision boundary geometry. Our adversary queries the model for predicted labels on augmentations of data points (e.g., translations in vision domains) as well as adversarial examples.

In § 5.1, we introduce the first label-only attacks, which match confidence-vector attacks. By combining them, we outperform all others. In § 5.2, 5.3 and 5.4, we show that confidence masking is not a viable defense to privacy leakage, by breaking two canonical defenses that use it—MemGuard and Adversarial Regularization. In § 6, we evaluate two additional techniques to reducing overfitting: data augmentation and transfer learning. We find that data augmentation can worsen MI leakage while transfer learning can mitigate it. In § 7, we introduce “outlier MI”: a stronger property that defenses should satisfy to protect MI of worst-case inputs; at present, differentially private training and (strong) L2 regularization appear to be the only effective defenses. Our code is available at https://github.com/cchoquette/membership-inference.

Background and Related Works

Membership inference attacks (Shokri et al., 2016) are a form of privacy leakage that identify if a given data sample was in a machine learning model’s training dataset. Given a sample $x$ and access to a trained model $h$ , the adversary uses a classifier or decision rule $f_{h}$ to compute a membership prediction $f(x;h)\in\{0,1\}$ , with the goal that $f(x;h)=1$ whenever $x$ is a training point. The main challenge in mounting a MI attack is creating the attack classifier $f$ , under various assumptions about the adversary’s knowledge of $h$ and its training data distribution. Most prior work assumes that an adversary has only black-box access to the trained model $h$ , via a query interface that on input $x$ returns part or all of the confidence vector $h(x)\in^{C}$ (for a classification task with $C$ classes).

The attack classifier $f$ is often trained on a local shadow (or, source) model $\hat{h}_{i}$ , which is trained on the same (or a similar) distribution as $h$ ’s training data. Because the adversary trained $\hat{h}_{i}$ , they can assign membership labels to any input $x$ , and use this dataset to train $f$ . Salem et al. (2018) later showed that this strategy succeeds even when the adversary only has data from a different, but similar, task and that shadow models are unnecessary: a threshold predicting $f(x;h)=1$ when the max prediction confidence, $\max_{i}h(x)$ , is above a tuned threshold, suffices.

Yeom et al. (2018) investigate how querying related inputs $x^{\prime}$ to $x$ can improve MI. Song et al. (2019) explore how models explicitly trained to be robust to adversarial examples can become more vulnerable to MI (similar to our analysis of data augmentation in § 6). Both works are crucially different because they use a different attack methodology and assume access to the confidence scores. Sablayrolles et al. (2019) demonstrate that black-box attacks (like ours) can approximate white-box attacks by effectively estimating the model loss for a data point. Refer to Appendix § A for a detailed background, including on defenses.

Attack Model Design

Our proposed MI attacks improve on existing attacks by (1) combining multiple strategically perturbed samples (queries) as a fine-grained signal of the model’s decision boundary, and (2) operating in a label-only regime. Thus, our attacks pose a threat to any query-able ML service.

Label-only MI attacks face a challenge of granularity. For any query $x$ , our attack model’s information is limited to only the predicted class-label, $\operatorname*{argmax}_{i}h(x)_{i}$ . A simple baseline attack (Yeom et al., 2018)—that predicts any misclassified data point as a non-member of the training set—is a useful benchmark to assess the extra (non-trivial) information that MI attacks, label-only or otherwise, can extract. We call this baseline the gap attack because its accuracy is directly related to the gap between the model’s accuracy on training data ( $\text{acc}_{\text{train}}$ ) and held out data ( $\text{acc}_{\text{test}}$ ):

where $\text{acc}_{\text{train}},\text{acc}_{\text{test}}\in$ . To exploit additional leakage on top of this baseline attack (achieve non-trival MI), any label-only adversary must necessarily make additional queries to the model. To the best of our knowledge, this trivial baseline is the only attack proposed in prior work that uses only the predicted label, $y=\operatorname*{argmax}_{i}h(x)_{i}$ .

2 Attack Intuition

Our strategy is to compute label-only “proxies” for the model’s confidence by evaluating its robustness to strategic input perturbations of $x$ , either synthetic (i.e., data augmentation) or adversarial (examples) (Szegedy et al., 2013). Following a max-margin perspective, we predict that data points that exhibit high robustness are training data points. Works in the adversarial example literature share a similar perspective that non-training points are closer to the decision boundary and thus more susceptible to perturbations (Tanay & Griffin, 2016; Tian et al., 2018; Hu et al., 2019).

Our intuition for leveraging robustness is two-fold. First, models trained with data augmentation have the capacity to overfit to them (Zhang et al., 2016). Thus, we evaluate any “effective” train-test gap on the augmented dataset by evaluating $x$ and its augmentations, giving us a more fine-grained MI signal. For models not trained using augmentation, their robustness to perturbations can be a proxy for model confidence. Given the special case of (binary) logistic regression models, with a learned weight vector ${w}$ and bias $b$ , the model will output a confidence score for the positive class of the form: $h(x)\coloneqq\sigma(w^{\top}x+b)$ , where $\sigma(t)=\frac{1}{1+e^{-t}}\in(0,1)$ is the logistic function.

Here, there is a monotone relationship between the confidence at $x$ and the Euclidean distance to the model’s decision boundary. This distance is $(w^{\top}x+b)/||w||_{2}=\sigma^{-1}(h(x))/||w||_{2}$ . Thus, obtaining a point’s distance to the boundary yields the same information as the confidence score. Computing this distance is exactly the problem of finding the smallest adversarial perturbation, which can be done using label-only access to a classifier (Brendel et al., 2017; Chen et al., 2019). Our thesis is that this relationship will persist for deep, non-linear models. This thesis is supported by prior work that suggests that deep neural networks can be closely approximated by linear functions in the vicinity of the data (Goodfellow et al., 2014).

3 Data Augmentation

We experiment with two common data augmentations in the computer vision domain: image rotations and translations. For rotations, we generate $N=3$ images as rotations by a magnitude $\pm r^{\circ}$ for $r\in$ . For translations, we generate $N=4d+1$ translated images satisfying $|i|+|j|=d$ for a pixel bound $d$ , where we horizontal shift by $\pm i$ and vertical shift by $\pm j$ . In both we include the source image.

4 Decision Boundary Distance

These attacks predict membership using a point’s distance to the model’s decision boundary. Here, we extend the intuition that this distance can be a proxy for confidence of linear models (see § 3.2) to deep neural networks.

use only black-box access. We rely on label-only adversarial example attacks (Brendel et al., 2017; Chen et al., 2019). These attacks start from a random point $x^{\prime}$ that is misclassified, i.e., $h(x^{\prime})\neq y$ . They then “walk” along the boundary while minimizing the distance to $x$ . We use “HopSkipJump” (Chen et al., 2019), which closely approximates stronger white-box attacks.

is a simpler approach based on random perturbations. Again, our intuition stems from linear models: a point’s distance to the boundary is directly related to the model’s accuracy when it is perturbed by isotropic Gaussian noise (Ford et al., 2019). We compute a proxy for $d_{h}(x,y)$ by evaluating the accuracy of $h$ on $N$ points $\hat{x}_{i}=x+\mathcal{N}(0,\sigma^{2}\cdot I)$ , where $\sigma$ is tuned on $\hat{h}$ . For binary features we instead use Bernoulli noise: each $x_{j}\in x$ is flipped with probability $p$ , which is tuned on $\hat{h}$ .

can be combined to improve the attack performance. We evaluate $d_{h}(x,y)$ for augmentations of $x$ from § 3.3. We only evaluate this attack where indicated due to its high query cost (see § 5.5).

Evaluation Setup

Our evaluation is aimed at understanding how label-only MI attacks compare with prior attacks that rely on access to a richer query interface. To this end, we use an identical evaluation setup as prior works (Nasr et al., 2018b; Shokri et al., 2016; Long et al., 2017; Truex et al., 2018; Salem et al., 2018) (see Appendix § B). We answer the following questions in our evaluation, § 5, § 6 and § 7:

Can label-only MI attacks match prior attacks that use the model’s (full) confidence vector?

Are defenses against confidence-based MI attacks always effective against label-only attacks?

What is the query complexity of label-only attacks?

Which defenses prevent both label-only and full confidence-vector attacks?

To evaluate an attack’s success, we pick a balanced set of points from the task distribution, of which half come from the target model’s training set. We measure attack success as overall MI accuracy but find F1 scores to approximately match, with near 100% recall. See Supplement § B.2 for further discussion on this evaluation.

Overall, we stress that our main goal is to show that in settings where MI attacks have been shown to succeed, a label-only query interface is sufficient. In general, we should not expect our label-only attacks to exceed the performance of prior MI attacks since the former uses strictly less information from queries than the latter. There are three notable exceptions: our combined attackNote that this attack’s performance exceeds prior confidence-vector attacks, but that we do not test its confidence-vector analog. Our results indicate that it should perform comparably. (§ 5.1), “confidence masking” defenses (§ 5.2), and models trained with significant data augmentation (§ 6.1). In the latter two cases, we find that existing attacks severely underestimate MI.

We provide a detailed account of model architectures and training procedures in Supplement § B.1 and of our threat model in Supplement § C. We evaluate our attacks on $8$ datasets used by the canonical work of Shokri et al. (2016). These include $3$ computer vision tasksMNIST, CIFAR-10, and CIFAR-100: https://www.tensorflow.org/api_docs/python/tf/keras/datasets, which are our main focus because of the importance of data augmentation to them, and $4$ non-computer-vision tasksAdult Dataset: http://archive.ics.uci.edu/ml/datasets/Adult Texas-100, Purchase-100, and Locations datasets: https://github.com/privacytrustlab/datasets to showcase our attacks’ transferability. We train target neural networks on subsets of the original training data, exactly as performed by Shokri et al. (2016) and several later works (in both data amount and train-test gap). Controlling the training set size lets us control the amount of overfitting, which strongly influences the strength of MI attacks (Yeom et al., 2018). Prior work has almost exclusively studied (confidence-based) MI attacks on these small datasets where models exhibit a high degree of overfitting. Recall that our goal is to show that label-only attacks match confidence-based approaches: scaling MI attacks (whether confidence-based or label-only) to larger training datasets is an important area of future work.

Evaluation of Label-Only Attacks

We first focus on question 1). Recall from § 3.1 that any label-only attack (with knowledge of $y$ ) is always trivially lower-bounded by the baseline gap attack of Yeom et al. (2018), predicting any misclassified point as a non-member.

Our main result is that our label-only attacks consistently outperform the gap attack and perform on-par with prior confidence-vector attacks; by combining attacks, we can even surpass the canonical confidence-vector attacks.

Observing Figure 1 and Table 1d (a) and (c), we see that the confidence-vector attack outperforms the baseline gap attack, demonstrating that it exploits non-trivial MI. Remarkably, we find that our label-only boundary distance attack performs at least on-par with the confidence-vector attack. Moreover, our simpler but more query efficient (see § 5.5) data augmentation attacks also consistently outperform the baseline but fall short of the confidence-vector attacks. Finally, combining these two label-only attacks, we can consistently outperform every other attack. These models were not trained with data augmentation; in § 6.1, we find that when they are, our data augmentation attacks outperform all others. Finally, we verify that as the training set size increases, all attacks monotonically decrease because the train-test gap is reduced. Note that on CIFAR-100, we experiment with the largest training subset possible: $30{,}000$ data points, since we use the other half as the source model training set (and target model non-members).

We show that our label-only attacks can be applied outside of the image domain in Table 2h. Our label-only attack evaluates a model’s accuracy under random perturbations, by adding Gaussian noise for continuous-featured inputs, and flipping binary values according to Bernoulli noise (see § 3.4). Using $10{,}000$ queries, our attacks closely match (at most $4$ percentage-point degradation) confidence-based attacks. Note that our attacks could also be instantiated in audio or natural language domains, using existing adversarial examples attacks (Carlini & Wagner, 2018) and data augmentations (Zhang et al., 2015).

2 Breaking Confidence Masking Defenses

Answering question 2), we showcase an example where our label-only attacks outperform prior attacks by a significant margin, despite the strictly more restricted query interface that they assume. We evaluate defenses against MI attacks and show that while these defenses do protect against existing confidence-vector attacks, they have little to no effect on our label-only attacks. Because any ML classifier providing confidences also provides the predicted labels, our attacks fall within their threat model, refuting these defenses’ security claims.

We identify a common pattern to these defenses that we call confidence masking, wherein defenses aim to prevent MI by directly minimizing the privacy leakage in a model’s confidence scores. To this end, confidence-masking defenses explicitly or implicitly mask (or, obfuscate) the information contained in the model’s confidences, (e.g., by adding noise) to thwart existing attacks. These defenses, however, have a minimal effect on the model’s predicted labels. MemGuard (Jia et al., 2019) and prediction purification (Yang et al., 2020) explicitly maintain the invariant that the model’s predicted labels are not affected by the defense, i.e.,

where $h^{\text{defense}}$ is the defended version of the model $h$ .

An immediate issue with the design of confidence-masking defenses is that, by construction, they will not prevent any label-only attack. Yet, these defenses were reported to drive the success rates of existing MI attacks to within chance. This result suggests that prior attacks fail to properly extract membership information contained in the model’s predicted labels, and implicitly contained within its scores. Our label-only attack performances clearly indicate that confidence masking is not a viable defense strategy against MI.

We show that two peer-reviewed defenses, MemGuard (Jia et al., 2019) and adversarial regularization (Nasr et al., 2018a), fail to prevent label-only attacks, and thus, do not significantly reduce MI compared to an undefended model. Other proposed defenses, e.g., reducing the precision or cardinality of the confidence-vector (Shokri et al., 2016; Truex et al., 2018; Salem et al., 2018), and recent defenses like prediction purification (Yang et al., 2020), also rely on confidence masking: they are unlikely to resist label-only attacks. See Supplement § D for more details on these defenses.

3 Breaking MemGuard

We implement the strongest version of MemGuard that can make arbitrary changes to the confidence-vector while leaving the model’s predicted label unchanged. Observing Figure 1 and Table 1d (b) and (d), we see that MemGuard successfully defends against prior confidence-vector attacks, but as expected, offers no protection against our label-only attacks. All our attacks significantly outperform the (non-adaptive) confidence-vector and the baseline gap attack.

The main reason that Jia et al. (2019) found MemGuard to protect against confidence-vector attacks is because these attacks were not properly adapted to this defense. Specifically, MemGuard is evaluated against confidence-vector attacks that are tuned on source models without MemGuard enabled. This observation also holds for other defenses such as Yang et al. (2020). Thus, these attacks’ membership predictors are tuned to distinguish members from non-members based on high confidence scores, which these defenses obfuscate. In a sense, a label-only attack like ours is the “right” adaptive attack against these defenses: since the model’s confidence scores are no longer reliable, the adversary’s best strategy is to use hard labels, which these defenses explicitly do not modify. Moving forward, we recommend that the trivial gap baseline serve as an indicator of confidence masking: a confidence-vector attack should not perform significantly worse than the gap attack for a defense to protect against MI. Thus, to protect against (all) MI attacks, a defense cannot solely post-process the confidence-vector—the model will still be vulnerable to label-only attacks.

4 Breaking Adversarial Regularization

The work of Nasr et al. (2018a) differs from MemGuard and prediction purification in that it does not simply obfuscate confidence vectors at test time. Rather, it jointly trains a target model and a defensive confidence-vector MI classifier in a min-max fashion: the attack model to maximize MI and the target model to produce accurate outputs that yet fool the attacker. See Supplement § D for more details.

We train a target model defended using adversarial regularization, exactly as in (Nasr et al., 2018a). By varying its hyper-parameters, we achieve a defended state where the confidence-vector attack is within $3$ percentage points of chance, as shown in Supplement Figure 9. Again, our label-only attacks significantly outperform this attack (compare Figures 6 (a) and (b)) because the train-test gap is only marginally reduced; this defense is not entirely ineffective—it prevents label-only attacks from exploiting beyond $3$ percentage points of the gap attack. However, when label-only attacks are sufficiently defended, it achieves significantly worse test accuracy trade-offs than other defenses (see Figure 5). And yet, evaluating the defense solely on confidence-vector attacks overestimates the achieved privacy.

5 The Query Complexity of Label-Only Attacks

We now answer question 3) and investigate how the query budget affects the success rate of different label-only attacks.

Recall that our rotation attack evaluates $N=3$ queries of images rotated by $r^{\circ}$ and our translation attack $N=4d+1$ for shifts satisfying a total displacement of $d$ . Figure 2 (a)-(b) shows that there is a range of perturbation magnitudes for which the attack exceeds the baseline (i.e., $1\leq r\leq 8$ for rotations, and $1\leq d\leq 2$ for translations). When the augmentations are too small or too large, the attack performs poorly because the augmentations have a similar effect on both train and test samples (i.e., small augmentations rarely change model predictions and large augmentations often cause misclassifications). An optimal parameter choice ( $r$ and $d$ ) outperforms the baseline by $3$ - $4$ percentage-points, which an adversary can tune using its local source model. As we will see in § 6, these attacks outperform all others on models that used data augmentation at training time.

In Figure 2 (c), we compare different boundary distance attacks, discussed in § 3.4. With $\approx 2{,}500$ queries, the label-only attack matches the white-box upper-bound using $\approx 2{,}000$ queries and also matches the best confidence-vector attack (see Figure 1). With $\approx 12{,}500$ queries, our combined attack can outperform all others. Query limiting would likely not be a suitable defense, as Sybil attacks (Douceur, 2002) can circumvent it; even in low query regimes ( $<100$ ) our attacks outperform the trivial gap by $4$ percentage points. Finally, with $<300$ queries, our simple noise robustness attack outperforms our other label-only attacks. At large query budgets, our boundary distance attack produces more precise distance estimates and outperforms it. Note that the monetary costs are modest at $\approx\$ 0.25$ per samplehttps://www.clarifai.com/pricing.

Defending with Better Generalization

We explore three questions in this section:

How does training with data augmentation impact MI attacks, especially those that evaluate augmented data?

How well do other standard machine learning regularization techniques help in reducing MI?

How do these defenses compare to differential privacy, which can provide formal guarantees against any form of membership leakage?

Data augmentation is commonly used in machine learning to reduce overfitting and encourage generalization, especially in low data regimes (Shorten & Khoshgoftaar, 2019; Mikołajczyk & Grochowski, 2018). Data augmentation is an interesting case study for our attacks. As it reduces a model’s overfitting, one would expect it to reduce MI. But, a model trained with augmentation will have been trained to strongly recognize $x$ and its augmentations, which is precisely the signal that our data augmentation attacks exploit.

We train target models with data augmentation similar to § 3.3 and focus on translations as they are most common in computer vision. We use a simple pipeline where all translations of each image is evaluated in a training epoch. Though this differs slightly from the standard random sampling, we choose it to illustrate the maximum MI when the adversary’s queries exactly match the samples seen in training.

Observe from Figure 3 that augmentation reduces overfitting and improves generalization: test accuracy increases from $49.7\%$ without translations to $58.7\%$ at $d=5$ and the train-test gap decreases. Due to improved generalization, the confidence-vector and boundary distance attacks’ accuracies decrease. Yet, the success rate of the data augmentation attack increases. This increase confirms our initial intuition that the model now leaks additional membership information via its invariance to training-time augmentation. Though the model trained with $d=5$ pixel shifts has higher test accuracy, our data augmentation attack exceeds the confidence-vector performance on the non-augmented model.Though we find in Supplement Figure 8 that the attack is strongest when the adversary correctly guesses $d$ , we note that these values are often fixed for a domain and image resolution. Thus, adversarial knowledge of the augmentation pipeline is not a strong assumption. Thus, model generalization is not the only variable affecting its membership leakage: models that overfit less on the original data may actually be more vulnerable to MI because they implicitly overfit more on a related dataset.

We use, without modification, the pipeline from FixMatch (Sohn et al., 2020), which trains a ResNet-28 to $96\%$ accuracy on the CIFAR-10 dataset, comparable to the state of the art. As with our other experiments, this model is trained using a subset of CIFAR-10, which sometimes leads to observably overfit models indicated by a higher gap attack accuracy. We train models using four regularizations, all random: vertical flips, shifts by up to $d=4$ pixels, image cutout (DeVries & Taylor, 2017), and (non-random) weight decay of magnitude $0.0005$ . All are either enabled or disabled.

Our results here, shown in Figure 4 corroborate those obtained with the simpler pipeline above: though test accuracy improves, our data augmentation attacks match or outperform the confidence-vector attack.

2 Other Techniques to Prevent Overfitting

We explore questions B)-C) using other standard regularization techniques, with details in Supplement E. In transfer learning, we either only train a new last layer (last layer fine-tuning), or fine tune the entire model (full fine-tuning).

We pre-train a model on CIFAR-100 to $51.6\%$ test accuracy and then use transfer learning. We find that boundary distance attack performed on par with the confidence-vector in all cases. We observe that last layer fine-tuning degrades all our attacks to the generalization gap, preventing non-trivial MI (see Figure 10 in Supplement § F). This result corroborates intuition: linear layers have less capacity to overfit compared to neural networks. We observe that full fine-tuning leaks more membership inference but achieves better test accuracies, as shown in Figure 5.

Finally, DP training (Abadi et al., 2016) formally enforces that the trained model does not strongly depend on any individual training point—that it does not overfit. We use differentially private gradient descent (DP-SGD) (Abadi et al., 2016) (see Supplement § E). To achieve comparable test accuracies as undefended models, the formal privacy guarantees become mostly meaningless (i.e., $\epsilon>100$ ).

Worst-Case (Outlier) MI

Here, we perform MI only for a small subset of “outliers”. Even if a model generalizes well on average, it might still have overfit to unusual data in the tails of the distribution (Carlini et al., 2019). We use a similar but modified process as Long et al. (2018) to identify potential outliers.

First, the adversary uses a source model $\hat{h}$ to map each targeted data point, $x$ , to its feature space, or the activations of its penultimate layer, denoted as $z(x)$ . We define two points $x_{1},x_{2}$ as neighbors if their features are close, i.e., $d(z(x_{1}),z(x_{2}))\leq\delta$ , where $d(\cdot,\cdot)$ is the cosine distance and $\delta$ is a tunable parameter. An outlier is a point with less than $\gamma$ neighbors in $z(x)$ where $\gamma$ is another tunable parameter. Given a dataset $X$ of potential targets and an intended fraction of outliers $\beta$ (e.g., $1\%$ of $X$ ), we tune $\delta$ and $\gamma$ so that a $\beta$ -fraction of points $x\in X$ are outliers. We use precision as the MI success metric.

We run our attacks on the outliers of the same models as in Figure 6. We find in See Figure 11 in Supplement Section F, that we can always improve the attack by targeting outliers, but that strong $L2$ regularization and DP training prevent MI. As before, we find that the label-only boundary distance attack matches the confidence-vector attack performance.

Conclusion

We developed three new label-only membership inference attacks that can match, and even exceed, the success of prior confidence-vector attacks, despite operating in a more restrictive adversarial model. Their label-only nature requires fundamentally different attack strategies, that—in turn—cannot be trivially prevented by obfuscating a model’s confidence scores. We have used these attacks to break two state-of-the-art defenses to membership inference attacks.

We have found that the problem with these “confidence-masking” defenses runs deeper: they cannot prevent any label-only attack. As a result, any defenses against MI necessarily have to help reduce a model’s train-test gap.

Finally, via a rigorous evaluation across many proposed defenses to MI, we have shown that differential privacy (with transfer learning) provides the strongest defense, both in an average-case and worst-case sense, but that it may come at a cost in the model’s test accuracy.

To center our analysis on comparing the confidence-vector and label-only settings, we use the same threat model as prior work (Shokri et al., 2016) and leave a fine-grained analysis of label-only attacks under reduced adversarial knowledge (e.g., reduced data and model architecture knowledge (Yeom et al., 2018; Salem et al., 2018)) to future work.

Acknowledgments

We thank the reviewers for their insightful feedback. This work was supported by CIFAR (through a Canada CIFAR AI Chair), by NSERC (under the Discovery Program, NFRF Exploration program, and COHESA strategic research network), and by gifts from Intel and Microsoft. We also thank the Vector Institute’s sponsors.

References

Appendix A Background

We consider supervised classification tasks (Murphy, 2012; Shalev-Shwartz & Ben-David, 2014), wherein a model is trained to predict some class label $y$ , given input data $x$ . Commonly, $x$ may be an image or sentence and $y$ is then the corresponding label, e.g., a digit 0-9 or a text sentiment.

We focus our study on neural networks (Bengio et al., 2017): functions composed as a series of linear-transformation layers, each followed by a non-linear activation. The overall layer structure is called the model’s architecture and the learnable parameters of the linear transformations are the weights. For a classification problem with $K$ -classes, the last layer of a neural network outputs a vector v of $K$ values (often called logits). The softmax function is typically used to convert the logits into normalized confidence scores:While it is common to refer to the output of a softmax as a “probability vector” because its components are in the range $ $and sum to$ 1 $, we refrain from using this terminology given that the scores output by a softmax cannot be rigorously interpreted as probabilities (Gal, 2016).$ \text{softmax}(\textbf{v})_{i}\coloneqq{e^{v_{i}}}/{\sum_{i=1}^{K}e^{v_{i}}}\in $. For a model$ h $, we define the model’s output$ h(x) $as the vector of softmax values. The model’s predicted label is the class with highest confidence, i.e.,$ \text{argmax}_{i}\ h(x)_{i}$.

Augmentations are natural transformations of existing data points that preserve class semantics (e.g., small translations of an image), which are used to improve the generalization of a classifier (Cubuk et al., 2018; Sohn et al., 2020; Taylor & Nitschke, 2018). They are commonly used on state-of-the-art models (He et al., 2015; Cubuk et al., 2018; Perez & Wang, 2017) to increase the diversity of the finite training set, without the need to acquire more labeled data (in a costly process). Augmentations are especially important in low-data regimes (Sajjad et al., 2019; Fadaee et al., 2017; Cui et al., 2015) and are domain-specific: they apply to a certain type of input, (e.g., images or text).

A.1.2 Transfer Learning

Transfer learning is a common technique used to improve generalization in low-data regimes (Tan et al., 2018). By leveraging data from a source task, it is possible to transfer knowledge to a target task. Commonly, a model is trained on the data of the source task and then fine-tuned on data from the output task. In the case of neural networks, it is common to fine-tune either the entire model or just the last layer.

A.2 Membership Inference

Membership inference attacks (Shokri et al., 2016) are a form of privacy leakage that identify if a given data sample was in a machine learning model’s training dataset. Given a sample $x$ and access to a trained model $h$ , the adversary uses a classifier or decision rule $f_{h}$ to compute a membership prediction $f(x;h)\in\{0,1\}$ , with the goal that $f(x;h)=1$ whenever $x$ is a training point. The main challenge in mounting a membership inference attack is creating the classifier $f$ , under various assumptions about the adversary’s knowledge of $h$ and its training data distribution.

Prior work assumes that an adversary has only black-box access to the trained model $h$ , via a query interface that on input $x$ returns part or all of the confidence vector $h(x)$ .

The original membership inference attack of Shokri et al. (Shokri et al., 2016) creates a membership classifier $f(x;h)$ , tuned on a number of local “shadow” (or, source) models. Assuming the adversary has access to data from the same (or similar) distribution as $h$ ’s training data, the shadow model approach trains the auxiliary source models $\hat{h}_{i}$ on this data. Since $\hat{h}_{i}$ is trained by the adversary, they know whether or not any data point was in the training set, and can thus construct a dataset of confidence vectors $\hat{h}_{i}$ with an associated membership label $m\in\{0,1\}$ . The adversary trains a classifier $f$ to predict $m$ given $\hat{h}_{i}(x)$ . Finally, the adversary queries the targeted model $h$ to obtain $h(x)$ and uses $f$ to predict the membership of $x$ in $h$ ’s training data.

Salem et al. (Salem et al., 2018) later showed that this attack strategy can succeed even without data from the same distribution as $h$ , and only with data from a similar task (e.g., a different vision task). They also showed that training shadow models is unnecessary: applying a simple threshold predicting $f(x;h)=1$ ( $x$ is a member) when the max prediction confidence, $\max_{i}h(x)$ , is above a tuned threshold, suffices.

Yeom et al. (Yeom et al., 2018) propose a simple baseline attack: the adversary predicts a data point $x$ as being a member of the training set when $h$ classifies $x$ correctly. The accuracy of this baseline attack directly reflects the gap in the model’s train and test accuracy: if $h$ overfits (i.e., obtains higher accuracy) on its training data , this baseline attack will achieve non-trivial membership inference. We call this the gap attack. If the adversary’s target points are equally likely to be members or non-members of the training set (see Appendix B.2) , this attack achieves an accuracy of

where $\text{acc}_{\text{train}},\text{acc}_{\text{test}}\in$ are the target model’s accuracy on training data and held out data respectively.

To the best of our knowledge, this is the only attack proposed in prior work that makes use of only the model’s predicted label, $y=\operatorname*{argmax}_{i}h(x)_{i}$ . Our goal is to investigate how this simple baseline can be surpassed to achieve label-only membership inference attacks that perform on par with attacks that use access to the model’s confidence scores.

The work of Long et al. (Long et al., 2018) investigates membership inference through indirect access, wherein the adversary only queries $h$ on inputs $x^{\prime}$ that are related to $x$ , but not $x$ directly. Our label-only attacks similarly make use of information gleaned from querying $h$ on data points related to $x$ (specifically, perturbed versions of $x$ ).

The main difference is that we focus on label-only attacks, whereas the work of Long et al. (Long et al., 2018) assumes adversarial access to the model’s confidence scores. Our attacks will also be allowed to query and obtain the label at the chosen point $x$ .

Song et al. (Song et al., 2019) also make use of adversarial examples to infer membership. Their approach crucially differs from ours in two aspects: (1) they assume access to and predict membership using the confidence scores, and (2) they target models that were explicitly trained to be robust to adversarial examples. In this sense, (2) bares some similarities with our attacks on models trained with data augmentation (see Section 6, where we also find that a model’s invariance to some perturbations can leak additional membership signal).

Defenses against membership inference broadly fall into two categories.

First, standard regularization techniques, such as L2 weight normalization (Shokri et al., 2016; Jia et al., 2019; Truex et al., 2018; Nasr et al., 2018a), dropout (Jia et al., 2019), or differential privacy have been proposed to address the role that overfitting plays in a membership inference attack’s success rate (Shokri et al., 2016). Heavy regularization has been shown to limit overfitting and to effectively defend against membership inference, but may result in a significant degradation in the model’s accuracy. Moreover, Yeom et al. (Yeom et al., 2018) show that overfitting is sufficient, but not necessary, for membership inference to be possible.

Second, defenses may reduce the information contained in a model’s confidences, e.g., by truncating them to a lower precision (Shokri et al., 2016), reducing the dimensionality of the confidence-vector to only some top $k$ scores (Shokri et al., 2016; Truex et al., 2018), or perturbing confidences via an adversary-aware “minimax” approach (Nasr et al., 2018a; Yang et al., 2020; Jia et al., 2019). These defenses modify either the model’s training or inference procedure to produce minimally perturbed confidence vectors that thwart existing membership inference attacks. We refer to these defenses as “confidence-masking” defenses.

Most membership inference research is focused on protecting the average-case user’s privacy: the success of a membership inference attack is evaluated over a large dataset. Long et al. (Long et al., 2018) focus on understanding the vulnerability of outliers to membership inference. They show that some ( $<100$ ) outlier data points can be targeted and have their membership inferred to high (up to $90\%$ ) precision (Long et al., 2017, 2018). Recent work explores how overfitting impacts membership leakage from a defender’s (white-box) perspective, with complete access to the model (Leino & Fredrikson, 2019).

Appendix B Evaluation Setup

Because our main goal is to show that label-only attacks can match the success of prior attacks, we consider a similar threat model that matches prior work–except that we restrict the adversary to label-only queries.

As in prior work (Shokri et al., 2016), we assume that the adversary has: (1) full knowledge of the task; (2) knowledge of the target model’s architecture and training setup; (3) partial data knowledge, i.e., access to a disjoint partition of data samples from the same distribution as the target model’s training data (see below for more details); and (4) knowledge of the targeted points’ labels, $y$ .

Some works have explored generating data samples $x$ for which to perform membership inference on, which assumes the least data knowledge (Shokri et al., 2016; Fredrikson et al., 2015). These cases work best with minimal numbers of features or binary features because they can take many queries (Shokri et al., 2016). Other works assumes access to the confidence vectors (Fredrikson et al., 2015). Our work assumes that candidate samples have already been found by the adversary. We leave to future work the efficient discovery of these samples on high-dimensionality data using a label-only query interface.

In our threat model, we always use a disjoint, non-overlapping (i.e., no data points are shared) set of samples for training and test data for the target model. The source model uses another two separate subsets of the task’s total data pool. Due to the balanced priors we assume, all subsets (i.e., the target model training and test sets, and the source model training and test sets) are always of the same size. In the case of CIFAR100, we use the target models training dataset (members) as the source models test dataset (non-members), and vice versa.

For computer vision tasks, we use two representative model architectures, a standard convolutional neural network (CNN) and a ResNet (He et al., 2015). Our CNN has four convolution layers with ReLU activations. The first two $3\times 3$ convolutions have $32$ filters and the second two have 64 filters, with a max-pool in between the two. To compute logits we feed the output through a fully-connected layer with $512$ neurons. This model has $1.2$ million parameters. Our ResNet-28 is a standard Wide ResNet-28 taken directly from (Sohn et al., 2020) with $1.4$ million parameters. On Purchase-100, we use a fully connected neural network with one hidden layer of size $128$ and the $Tanh$ activation function, exactly as in (Shokri et al., 2016). For Texas-100, Adult, and Locations we mimic this model but add a second hidden layer matching the first.

For the attacks from prior work based on confidence vectors, and our new label-only attacks based on data augmentations, we use shallow neural networks as membership predictor models $f$ . Specifically, for augmentations, we use two layers of 10 neurons and LeakyReLU activations (Maas et al., 2013). The confidence-vector attack models use a single hidden layer of 64 neurons, as originally proposed by Shokri et al. (Shokri et al., 2016). We train a separate prediction model for each class We observe minimal changes in attack performance by changing the architecture, or by replacing the predictor model $f$ by a simple thresholding rule. Our combined boundary distance and augmentation attack uses neural networks as well. For simplicity, our decision boundary distance attacks use a single global thresholding rule, $2,500$ queries, and the L2 distance metric. See Section 3.4 for more details.

B.2 On Measuring Success

Some recent works have questioned the use of (balanced) accuracy as a measure of attack success and proposed other measures more suited for imbalanced priors: where any data point targeted by the adversary is a-priori unlikely to be a training point (Jayaraman et al., 2020). As our main goal is to study the effect of the model’s query interface on the ability to perform membership inference, we focus here on the same balanced setting considered in most prior work. We also note that the assumption that the adversary has a (near-) balanced prior need not be unrealistic in practice: For example, the adversary might have query access to models from two different medical studies (trained on patients with two different conditions) and might know a-priori that some targeted user participated in one of these studies, without knowing which.

Appendix C Threat Model

The goal of a membership inference attack is to determine whether or not a candidate data point was used to train a given model. In Table 3, we summarize different sets of assumptions made in prior work about the adversary’s knowledge and query access to the model.

The membership inference threat model originally introduced by Shokri et al. (Shokri et al., 2016), and used in many subsequent works (Long et al., 2017; Truex et al., 2018; Salem et al., 2018; Song et al., 2019; Nasr et al., 2018b), assumes that the adversary has black-box access to the model $h$ (i.e., they can only query the model for its prediction and confidence but not inspect its learned parameters ). Our work also assumes black-box model access, with the extra restriction (see Section C.2 for more details) that the model only returns (hard) labels to queries. Though studying membership inference attacks with white-box model access (Leino & Fredrikson, 2019) has merits (e.g., for upper-bounding the membership leakage), our label-only restriction inherently presumes a black-box setting (as otherwise, the adversary could just run $h$ locally to obtain confidence scores). Although we are focused on the label-only domain, our attack methodologies can be applied for analysis in the white-box domain.

Assuming a black-box query interface, there are a number of other dimensions to the adversary’s assumed knowledge of the trained model:

refers to global information about the model’s prediction task and, therefore, of its prediction API. Examples of task knowledge include the total number of classes, the class-labels (dog, cat, etc.), and the input format ( $32\times 32$ RGB or grayscale images, etc.). Task knowledge is always assumed to be known to the adversary, as it is necessary for the classifier service to be useful to a user.

refers to knowledge about the model architecture (e.g., the type of neural network, its number of layers, etc.) and how it was trained (the training algorithm, training dataset size, etc). This information could be publicly available or inferable from a model extraction attack (Tramèr et al., 2016; Wang & Gong, 2018).

constitutes knowledge about the data that was used to train the target model. Full knowledge of the training data renders membership inference trivial because the training members are already known. Partial knowledge may consist in having access to (or the ability to generate) samples from the same or a related data distribution.

refers to knowledge of the true label $y$ for each point $x$ for which the adversary is predicting membership. Whether knowledge of a data point implies knowledge of its true label depends on the application scenario. Salem et al. (Salem et al., 2018) show that attacks that rely on knowledge of query labels can often be matched by attacks that do not.

C.2 Query Interface

Our paper studies a different query interface than most prior membership inference work. The choice of query interface ultimately depends on the application needs where the target model is deployed. We define two types of query interfaces, with different levels of response granularity:

On a query $x$ , the adversary receives the full vector of confidence scores $h(x)$ from the classifier. In a multi-class scenario, each value in this vector corresponds to an estimated confidence that this class is the correct label. Restricting access to only part of the confidence vector has little effect on the adversary’s success (Shokri et al., 2016; Truex et al., 2018; Salem et al., 2018).

Here, the adversary only obtains the predicted label $y=\operatorname*{argmax}_{i}h(x)_{i}$ , with no confidence scores. This is the minimal piece of information that any query-able machine learning model must provide and is thus the most restrictive query interface for the adversary. Such a query interface is also realistic, as the adversary may only get indirect access to a deployed model in many settings. For example, the model may be part of a larger system taking actions based on the model’s predictions—the adversary can only observe the system’s actions but not the internal model’s confidence scores.

In this work, we focus exclusively on the above label-only regime. Thus, in contrast to prior research (Shokri et al., 2016; Hayes et al., 2019; Truex et al., 2018; Salem et al., 2018), our attacks can be mounted against any machine learning service, regardless of the granularity provided by the query interface.

Appendix D Confidence-Masking Defense Descriptions

This defense solves a constrained optimization problem to compute a defended confidence-vector $h^{\text{defense}}(x)=h(x)+n$ , where $n$ is an adversarial noise vector that satisfies the following constraints: (1) the model still outputs a vector of “probabilities”, i.e., $h^{\text{defense}}(x)\in^{K}$ and $\|h^{\text{defense}}(x)\|_{1}=1$ ; (2) the model’s predictions are unchanged, i.e., $\operatorname*{argmax}h^{\text{defense}}(x)=\operatorname*{argmax}h(x)$ ; and (3) the noisy confidence vector “fools” existing membership inference attacks. To enforce the third constraint, the defender locally creates a membership attack predictor $f$ , and then optimizes the noise $n$ to cause $f$ to mis-predict membership.

Prediction purification (Yang et al., 2020) is a similar defense. It trains a purifier model, $G$ , that is applied to the output vector of the target model. That is, on a query $x$ , the adversary receives $G(h(x))$ . The purifier model $G$ is trained so as to minimize the information content in the confidence vector, whilst preserving model accuracy. While the defense does not guarantee that the model’s labels are preserved at all points, the defense is by design incapable of preventing the baseline gap attack, and it is likely that our stronger label-only attacks would similarly be unaffected (intuitively, $G(h(x))$ is just another deterministic classifier, so the membership leakage from a point’s distance to the decision boundary should not be expected to change).

This defense trains the target model in tandem with a defensive membership classifier. This defensive membership classifier is a neural network that accepts both the confidence-vector, $h(x)$ , of the target model, and the true label, $y$ , that is one-hot encoded. Following the input $h(x)$ there are four fully connected layers of sizes $100$ , $1024$ , $512$ , $64$ . Following the input $y$ , there are three fully connected layers of sizes $100$ , $512$ , $64$ . The two $64$ neuron layers are concatenated (to make a layer of size $128$ ), and passed through three more fully connected layers of sizes $256$ , $64$ , and the output layer of size $1$ . ReLU activations are used after every layer except the output, which uses a sigmoid activation.

The defensive membership classifier and the target model are trained in tandem. First the target model is trained a few (here, $3$ ) epochs. Then for $k$ steps, the defensive membership classifier is trained using an equal batch on members and non-members (which should be different from the held-out set for the target model). After, the target model is trained on one batch of training data. The target model’s loss function is modified to include a regularization term using the output of the defensive classifier on the training data. This regularization term is weighted by $\lambda$ .

Appendix E Description of Common Regularizers

Dropout (Srivastava et al., 2014) is a simple regularization technique, wherein a fraction $\rho\in(0,1)$ of weights are randomly “dropped” (i.e., set to zero) in each training step. Intuitively, dropout samples a new random neural network at each step, thereby preventing groups of weights from overfitting. At test time, the model is deterministic and uses all the learned weights. We experiment with different dropout probabilities $\rho$ .

L1 and L2 regularization simply add an additional term of the form $\lambda\cdot||w||$ to the model’s training loss, where $w$ is a vector containing all of the model’s weights, the norm is either L1 or L2, and $\lambda>0$ is a hyper-parameter governing the scale of the regularization relative to the learning objective. Strong regularization (i.e., large $\lambda$ ) reduces the complexity of the learned model (i.e., it forces the model to learn smaller weights). We experiment with different regularization constants $\lambda$ .

Differential privacy guarantees that any output from a (randomized) algorithm on some dataset $D$ , would have also been output with roughly the same probability (up to a multiplicative $e^{\epsilon}$ factor) if one point in $D$ were arbitrarily modified. For differential privacy, we use DP-SGD (Abadi et al., 2016), a private version of stochastic gradient descent that clips per-example gradients to an L2 norm of $\tau$ , and adds Gaussian noise $\mathcal{N}(0,c^{2}\tau^{2})$ to each batch’s gradient. We train target models with fixed parameters $c=0.5$ and $\tau=2$ . We train for a varied number of steps, to achieve provable differential privacy guarantees for $10\leq\epsilon\leq 250$ .