ASSET: Robust Backdoor Data Detection Across a Multiplicity of Deep Learning Paradigms

Minzhou Pan, Yi Zeng, Lingjuan Lyu, Xue Lin, Ruoxi Jia

Introduction

Deployment of deep learning (DL) in critical services and infrastructures calls for special emphasis on security, given its susceptibility to erroneous predictions in the presence of attacks . Specifically, data-poisoning-based backdoor attacks - where attackers manipulate the training data to force certain outputs during testing - pose a significant threat. Successful attacks have been demonstrated on various computer vision tasks and beyond . This paper focuses on the problem of detecting the poisoned samples within a training set. An effective detection strategy allows one to mitigate the risk of backdoors by removing suspicious samples from training.

Poisoned samples can be regarded as outliers in a training set. However, unlike arbitrary outliers considered in the classical outlier detection and robust statistics literature, poisoned samples are special outliers that induce specific model behaviors, e.g., misleading the model to predict some target class(es). Hence, recent works on backdoor detection primarily leverage the model trained on the poisoned dataset (backdoored model hereinafter ) or information cached during training to help discover poisoned samples . For instance, most of the prior work starts by extracting the backdoored model’s output , intermediate activation patterns , gradient for each sample, and then separate poisons from clean samples based on the extracted information.

While taking advantage of the information collected from the downstream learning process provides a clear path to enhancing backdoor detection performance, it also raises the question: Can these detection methods maintain their performance across different DL settings? Particularly, existing detection methods are exclusively evaluated in only one learning setting—end-to-end supervised learning (SL), where a labeled poisoned dataset is used to train a model from scratch. On the other hand, new learning paradigms are increasingly adopted and have demonstrated state-of-the-art prediction performance with reduced annotation costs and computational burden . The two most representative and popular paradigms are self-supervised learning (SSL) adaptation and transfer learning (TL), as illustrated in Figure 1.

In SSL adaptation, one pre-trains a model on large unlabeled data (e.g., through contrastive learning or masked autoencoder (MAE) ) and then fine-tunes only the last layer using labeled data from a specific downstream task. Recent work has shown that an attacker can poison the unlabeled dataset to implant backdoors without any control over downstream fine-tuning processes. Thus, it is natural to ask: Can we detect the poisoned samples within an unlabeled dataset using existing methods? In TL, one starts with an existing pre-trained model and fine-tunes all layers of the model or just the last layer with labeled data. Despite the importance of TL in practice , we lack an understanding of backdoor detection in this setting: Can we detect the poisoned samples when they are used for fine-tuning an existing model instead of training it from scratch?

Our first contribution is a comprehensive evaluation of existing detection methods across different DL paradigms. The key findings are summarized as follows.

(Case-0) End-to-end SL: Despite the efficiency demonstrated by prior detection efforts in specific settings, the consistency of efficacy varies a lot across different attacks or poison ratios. In particular, all fail to detect the state-of-the-art clean-label backdoor attackClean-label attacks refer to those where the poisoned samples appear to be correctly labeled to a human inspector. and underperform in the very low or very high poison ratio setting (e.g., 0.05% or 20%).

(Case-1) SSL adaptation: There are no existing methods dedicated to detecting unlabeled poisoned samples in the SSL setting. Yet, some of the existing methods can be adapted to the SSL. For instance, those methods attempting to separate the poisoned samples from clean in the embedding space can employ an embedder learned from unlabeled data to generate the embedding for each sample We will elaborate on the adaptation techniques in Section 5.1.. However, the performance of these methods after adaptation is limited (e.g., their average detection rate over different attacks all falls below 26%).

Case-2 TL: While prior literature omitted TL in their evaluation, the detection methods can all be applied to it. However, the methods based on embeddings suffer a significant performance loss compared to the end-to-end SL setting because the poisoned samples are less distinguishable from clean ones in a fine-tuned embedding space than a trained-from-scratch one.

The limitations of existing methods per our evaluation are summarized in Table 1. Overall, there still lacks a detection method that is effective across different learning paradigms.

Our second contribution is the development of a robust, generic approach to backdoor detection that applies to the three representative learning paradigms discussed above. Like most existing literature , our approach also assumes that the defender has an extra set of clean samples (referred to as a base set hereinafter) with a size much smaller compared to the training set. In practice, these clean samples can be obtained through manual inspection or automatic screening . However, unlike the previous works, we do not require the base set to be labeled.

The key idea of our approach is to induce different model behaviors between poisoned samples and clean ones. To achieve this, we design a two-step optimization process: we first minimize some loss on the clean base set; then, we attempt to offset the effect of the first minimization on the clean distribution by maximizing the same loss on the entire training set including both clean and poisoned samples. The outcome of this two-step process is a model which returns high loss for poisoned samples and low loss for clean ones. Hence, we can decide whether a sample is poisoned or clean based on the corresponding loss value.

We found that the two-step optimization-based offset idea achieves strong detection performance except in settings where the poison ratio is low, or the learning of the poisoned samples happens slowly—at roughly the same speed as learning of clean samples. As we will explicate later in the paper, in these cases, the effect of the second maximization significantly outweighs that of the first minimization; as a result, both poisoned and clean samples achieve large losses and become inseparable.

To tackle the challenge, we propose a strengthened technique that involves two nested offset procedures, and the inner offset reinforces the outer one. Specifically, we use the inner offset procedure to identify the points most likely to be poisoned and mark them as suspicious; the outer offset procedure still minimizes some loss on the clean base set, but the maximization will now be performed on the points marked to be suspicious by the inner offset, instead of the entire poisoned dataset. As the proportion of clean samples within the suspicious set is much smaller than that within the entire poisoned set, the small loss of clean samples obtained from the first minimization would be impacted much less by the second maximization. This nested design effectively improves the separability between clean and poisoned samples.

Our third contribution is the provision of techniques that can adaptively set the loss threshold to discern poisoned samples. Some of the prior works assume the knowledge of poison ratio and mark a fixed number of samples as poisoned ones based on their respective criteria. Moreover, the poisoned and clean samples often do not have a clear separation based on their criteria (see examples in Figure 5); as a result, their detection performance is very sensitive to the estimated poison ratio. We argue that in practice, it is challenging to have an accurate estimate of the poison ratio. Hence, it is preferable to adapt detection to the data characteristics rather than relying on a fixed estimate. Herein, we design two adaptive thresholding techniques tailored to specific requirements imposed by inner and outer offset procedures (i.e., prioritizing precision vs. prioritizing true positive rate).

We conduct extensive experiments in comparison with seven representative or state-of-art backdoor data detection methods over 56 different attack settings across various DL paradigms and show that our proposed method, ASSET, is the only one that can provide reliable detection consistently across all the evaluated settings. This work is also the first practical backdoor detection for the SSL and the TL settingsOpen-source: https://github.com/ruoxi-jia-group/ASSET.

Background & Related Work

End-to-end supervised learning & transfer learning. The objective of end-to-end SL is to train a classifier f(θ):X[k]f(\cdot|\theta):\mathcal{X}\rightarrow[k], which predicts the label y[k]y\in[k] of an input xXx\in\mathcal{X}. θ\theta denotes the parameters of the classifier f(θ)f(\cdot|\theta). The standard end-to-end SL (Case-0) consists of two stages: training and testing. In the training stage, a learning algorithm is provided with a set of training data, D={(xi,yi)}i=1ND=\{(x_{i},y_{i})\}_{i=1}^{N}, consisting of examples from kk classes. Then, the learning algorithm seeks the model parameters, θ\theta, that minimize the empirical risk:

When f(θ)f(\cdot|\theta) is a deep neural network, the corresponding empirical risk is a non-convex function of θ\theta, and finding a global minimum is generally impossible. Hence, the standard practice is to look for a local minimum. Algorithmically, the model is initialized with random parameters and updated iteratively via stochastic gradient descent . In the test stage, the trained model f(θ)f(\cdot|{\theta^{*}}) takes input test examples and serves up predictions. TL (Case-2) shares the same optimization goal as the end-to-end SL. However, TL initializes the optimization with a pre-trained backbone model instead of random parameters. Within the scope of this paper, we consider two of the most popular TL schemes: (1) FT-all: the entire pre-trained model gets updated during training (e.g., ); (2) FT-last (or linear adaptation): only the last fully-connected layer is updated (e.g., ). In the context of TL, we will refer to solving the optimization (1) as fine-tuning and DD as the fine-tuning data.

Self-supervised learning. SSL usually consists of two phases: pretext training and fine-tuning. Pretext training aims to train an encoder f(θ):XZf(\cdot|\theta):\mathcal{X}\rightarrow\mathcal{Z} that can map the input xXx\in\mathcal{X} into the embedding zZz\in\mathcal{Z}. θ\theta denotes the parameters of the encoder f(θ)f(\cdot|\theta). This paper focuses primarily on two of the most recent SSL schemes: contrastive learning and masked auto-encoder (MAE). Their training processes are illustrated in Figure 2, where MM is a multi-layer perceptron (MLP) used to reduce the dimension of features, and PP is a predictor. The fundamental idea of contrastive learning, e.g., SimCLR , MoCo V3 , and BYOL , is to learn an encoder by bringing the embeddings corresponding to the augmentations of the same image (a.k.a. positive pairs) closer and distancing its embeddings from other images (a.k.a. negative pairs). All three methods pre-train f(θ)f(\cdot|\theta), MM and PP (if applicable) on large amounts of unlabeled data, and differ in how they generate positive and negative pairs and in the loss functions they use for training. We refer interested readers to for more details. By contrast, the recently proposed SSL method, MAE , trains the encoder f(θ)f(\cdot|\theta) by masking a portion of pixels in an image xx (the masked image is denoted by xx^{\prime}) and then using f(xθ)f(x^{\prime}|\theta) with a decoder d()d(\cdot) to restore xx. For all the aforementioned SSL methods, after the pretext training, the acquired encoder parameters θ\theta^{*} will be adapted to a downstream task similarly to TL using the fine-tuning data.

Backdoor attacks. Backdoor attacks have been extensively studied in the end-to-end SL setting and can be categorized into dirty-label and clean-label attacks. Dirty-label backdoor attacks manipulate both label and feature of a sample. These attacks have developed from using a sample-independent visible pattern as the trigger to more stealthy and powerful attacks with sample-specific or visually imperceptible triggers . Clean-label backdoor attacks ensure that the manipulated features are semantically consistent with corresponding labels. Existing attacks in this category range from inserting arbitrary triggers to optimized triggers . Most of the above backdoor attacks can be easily adapted to TL settings without modifications. There are also backdoor attacks specifically designed for TL settings, e.g., the hidden trigger backdoor attack .

With the thriving development of SSL, especially contrastive learning (e.g., SimCLR , MoCo , BYOL ) and the MAE , backdoor attacks targeting SSL have also been explored. Recent work mainly applies existing dirty-label backdoor triggers studied in SL to the targeted category of samples . However, attacks’ efficacy are limited (ASR below 10% on CIFAR-10 even with an in-class poison ratio set to 50%, as shown in our experiment, Section 5.3). A recent attack exploits the “representation invariance” property of contrastive learning and instantiate a symmetric trigger via manipulation in the frequency domain, achieving much higher ASR with a lower poison ratio (e.g., in-class poison ratio of 10%).

Backdoor sample detection. Note that no existing backdoored sample detection methods have been considered nor evaluated over cases other than Case-0. In particular, there is no practical defense under the SSL, and the study in TL is overlooked. Many of the existing works identify poisoned samples by examining their difference from clean ones in the embedding space, such as using singular value decomposition (SVD) , Gram matrix , K-Nearest-Neighbors , and feature decomposition . In addition to embeddings, intermediate neural activation and gradients extracted from samples can also be adopted for backdoored sample detection. Past work has also examined other differentiating properties of backdoor samples, such as trigger’s resistance to augmentations , high-frequency artifacts , low contribution to the training task , or backdoor samples may achieve lower loss at the early stage of training .

A recent work proposed a confusion training procedure, which trains a model on a weighted combination of the randomly-labeled clean base set and the poisoned set. Introducing a randomly-labeled clean set into training prevents the model from fitting to the clean portion of the poisoned data, thereby allowing the identification of poisoned samples whose labels are consistent throughout the training process. Our experiment found that the effectiveness of highly relies on the hyperparameter tuning of the weighted combined-training process and the performance varies significantly with poison ratios. Additionally, the fundamental assumption is that decoupling the benign correlations between semantic features and semantic labels does not influence the learnability of the correlations between backdoor triggers and target labels. However, some advanced clean-label backdoor attack trigger strongly entangles with the semantic features of the target class; therefore, falls short of detecting the trigger. At a high level, confusion training shares a similar idea to ours in the sense that we both leverage a clean base set to induce different detector behaviors between clean and poisoned samples. However, there are several key differences in the method design: our approach induces different behaviors by optimizing opposite optimization objectives on the base set and the poisoned set, whereas confusion training relies on random labeling to disrupt the learning of the clean samples. Importantly, we design a nested procedure that can effectively deal with the failure cases of . Moreover, our method distinguishes itself from by providing additional important advantages: (1) our approach does not require the poisoned set to be labeled, thereby enabling applications in SSL settings; and (2) our approach is robust to different poison ratios without ratio-specific tuning and can effectively detect attacks generating triggers entangled with semantic features.

Attacker & Defender Models

This section discusses standard threat models and assumptions about defender knowledge for different DL paradigms.

Case-0 End-to-end SL: In this setting, the attacker performs the backdoor attack by injecting a set of poisoned samples into the training dataset. The defender has access to the poisoned training dataset and the downstream learning algorithm. The defender’s goal is to identify the poisoned samples within the training set and further remove the identified samples to prevent backdoor attacks from taking effect.

Case-1 SSL Adaptation: Under this setting, the attacker performs the backdoor attack by poisoning the unlabeled dataset . Following prior attack literature, we assume that the attacker does not have access to the fine-tuning task—the dataset or algorithm. Thus, the dataset used for fine-tuning is clean, and the attack only affects the unlabeled dataset. The defender has access to the complete training data, including both the data for SSL as well as the data for fine-tuning. In addition, the defender knows the algorithm for SSL and fine-tuning. The goal of the defender is to identify and remove the poisoned samples from the unlabeled dataset. Other attack settings target multi-modal contrastive learning, such as attacking the CLIP , is not considered in this case, as training CLIP requires additional text input supervision .

Case-2 Transfer Learning: The attacker performs the backdoor attack by poisoning a labeled dataset used for fine-tuning an existing pre-trained model. The defender knows the pre-trained model, the entire fine-tuning dataset (whose size often cannot support training a model from scratch), as well as the fine-tuning algorithm. The goal of the defender is to detect the poisoned samples within the fine-tuning dataset.

In all three cases, we assume the attacker can poison no more than half of the training dataset. We also assume that the defender has a small set of clean, unlabeled samples (the base set) to help with detection. These clean samples can be manually or automatically screened . Compared with most recent detection methods , which require a labeled clean base set of at least 2000 samples, our method relaxes the requirement on the label information.

Proposed Method

Our goal is to enforce distinguishable model behaviors on poisoned and clean samples actively. The key idea is to design two optimizations that induce opposite model behaviors on the poisoned dataset (including its clean and poisoned portion) and the clean base set. Specifically, the two optimizations are performed simultaneously, where the first one minimizes a certain loss function on the clean base set and the second one maximizes the same loss on the entire poisoned training dataset. Note that the clean portion of the poisoned dataset and the clean base set are both drawn from the same clean distribution. Hence, the effect of the second optimization on the clean samples will be offset by the first optimization, and the loss on clean samples after the two optimizations is closer to the loss before. By contrast, the poisoned samples only go through the second optimization; therefore, the loss on the poisoned samples is maximized. Overall, as a result of the two optimizations, poisoned and clean samples will produce different loss values, thus becoming separable. The single offset’s effect on clean samples and poisoned samples is illustrated in Figure 3 (a).

Intuition on the distinguishability of poisons. Poisoning, whether through additive triggers , generative models , affine transformations , or even adaptive perturbation techniques , introduces a distributional shift from clean data. The resulting poisoning distribution and the original clean distribution have disjoint support, and thus the total variation (TV) distance between the two distributions is one. The Le Cam’s lower bound, a classic result in statistical learning (refer to Chapter 15 in ), states that the minimum error over all detectors that classify the samples from two distributions, P1P_{1} and P2P_{2}, is equal to 1/2(1P1P2TV)1/2(1-\|P_{1}-P_{2}\|_{\text{TV}}). Hence, there exists a detector achieving zero error probability for distinguishing between poisons and non-poisons. Le Cam’s bound guarantees the existence of a good detector as long as poisons do not naturally appear in the clean distribution, and our method to be introduced is an effort to find such a detector based on the information of a clean base set.

2 Detection via Offset

Now, we formalize the offset idea for poisoned sample detection. Let DbD_{\text{b}} denote the clean base set and DpoiD_{\text{poi}} denote the poisoned training set. Formally, we can characterize the process of inducing distinguishable behaviors on poisoned and clean samples as a multi-objective optimization:

When discussing the high-level idea of our method, we assume that the minimization and maximization employ the same objective, i.e., Lmin=Lmax\mathcal{L}_{\text{min}}=\mathcal{L}_{\text{max}}. However, these two functions can also be different; as long as minimizing Lmin\mathcal{L}_{\text{min}} and maximizing Lmax\mathcal{L}_{\text{max}} induce different model behaviors, one optimization will mitigate the effect of the other on the clean distribution.

In the implementation, we do not directly solve the optimization with two optimizations at the same time due to the instability of the corresponding optimization path; instead, we loop between two objectives:

We first minimize Lmin\mathcal{L}_{\text{min}} by taking a gradient descent step on a mini-batch drawn from the base set;

Then, we utilize the resulting model as the initializer for maximizing Lmax\mathcal{L}_{\text{max}} and perform a gradient ascent step on a mini-batch drawn from the poisoned set;

We empirically observe the alternating procedure is stable. As the focus of the paper is to develop practical detection methods, we will defer the theoretical analysis of this procedure—an interesting open problem—for future work.

When the detection is performed on the unlabeled data, we can instantiate both Lmin\mathcal{L}_{\text{min}} and Lmax\mathcal{L}_{\text{max}} to be Lvar\mathcal{L}_{\text{var}} defined above, because calculating Lvar\mathcal{L}_{\text{var}} does not require label information. As the result of minimizing Lmin\mathcal{L}_{\text{min}}, the clean samples are forced to have a flat logit pattern. Then, the maximization optimization maximizes the same loss on the poisoned dataset, which induces high-variance logits for poisoned samples. For clean samples, the effects of maximization and minimization are roughly canceled out. Therefore, clean samples are expected to produce lower-variance logits than poisoned samples.

When the detection is performed on a labeled poisoned dataset, we find that instantiating Lmax\mathcal{L}_{\text{max}} with the cross-entropy-based prediction loss Lce\mathcal{L}_{\text{ce}} achieves a good detection performance faster than Lvar\mathcal{L}_{\text{var}}:

where σ(x)i\sigma(x)_{i} denotes the ii-th output of the softmax and yy represents the one-hot encoding of xx’s label.

It is worth mentioning that we fix the minimization loss to be Lvar\mathcal{L}_{\text{var}} regardless of whether the base set is labeled or unlabeled. We found that even when the label information is available, this choice still leads to better detection performance than using Lce\mathcal{L}_{\text{ce}} as the minimization goal. This is because learning through minimizing Lvar\mathcal{L}_{\text{var}} will make the model extract class-independent features. A mini-batch of the base set may be class-imbalanced or sometimes contain only partial classes due to random sampling. Hence, Lvar\mathcal{L}_{\text{var}} can be more steadily minimized than Lce\mathcal{L}_{\text{ce}} via mini-batch gradients.

3 Strengthened Detection via Nested Offset

Weakness of a single offset. Despite the neatness of the offset idea, directly solving the two optimizations with the proposed loss functions is limited in tackling attacks with low poison ratio and the settings where poisoned samples take effect slowly during training (i.e., attacks need many epochs of training to obtain a high enough success rate; examples of such attacks include ). The reasons are as follows. In the low poison ratio setting, mini-batches naturally contain very small amounts of poisoned samples; on the other hand, each gradient ascent step takes a step towards reducing the average loss over a mini-batch and tends to overlook the minorities. Hence, the loss of poisoned samples would be increased by less with a lower poison ratio. To explain the second limitation, note that θ\theta is an over-parameterized model (e.g., ResNet-18 and Vision Transformer). If an attack takes many epochs to take effect, then we need to train θ\theta for long enough. The model after long training will end up “memorizing” all the samples from the base set and the poisoned set, i.e., all the samples from the base set achieve a low value of Lmin\mathcal{L}_{\text{min}} and all the samples from the poisoned set (including both clean and poisoned samples) to achieve a high value of Lmax\mathcal{L}_{\text{max}}. In that case, the poisoned portion and the clean one are inseparable.

How to mitigate these failure cases? To illustrate our idea, let us think about a hypothetical design, assuming one can perfectly pinpoint a set of poisoned samples. In this design, we keep the first step minimizing on the clean base set, but the second maximization is performed on purely poisoned samples instead of the poisoned training set, which generally contains a large portion of clean samples and only a small portion of poisoned samples. This hypothetical design would be able to solve the two failure cases above. For the first case, since mini-batches for maximization contain solely poisoned samples, the poisoned samples would still have their loss increased and thus is distinguishable from the clean ones. For the second case, while long training can lead to memorization but with the hypothetical design, it is just the poisoned samples that get memorized and are assigned with high loss; therefore, the poisoned samples and the clean ones are still separable.

While having access to a set of purely poisoned samples is not realistic, this thought experiment inspires an idea to improve an offset-based detection approach, which is to replace the poisoned training set (dominated by clean samples) with a set dominated by poisoned samples in the second maximization. To form such a poison-dominated set, we can leverage a new offset loop (referred to as the inner offset loop) to mark a set of the most suspicious samples. Then, we use those samples to perform maximization of the original offset loop (referred to as the outer offset loop).

How to design the inner offset loop that provides a poison-condensed set? First, it is not ideal to reuse the design of the outer loop for this inner one, because in that case the inner would suffer the same “memorization” issue. Instead, we aim to avoid “overparameterized” models and perform the inner loop with a simple model. On the other hand, a simple model could be incapable of extracting complex features to support the detection of poisoned samples. Our solution is to use the poisoned model (i.e., the downstream model trained on the poisoned dataset) to extract features from the poisoned set and the base set and then optimize a simple model to detect the poisoned samples in the feature space.

Note that the embedding space of a poisoned model has been shown to be informative to detect many but not all backdoor attacks (detailed in Section 5). Although the poisoned and clean samples are not perfectly separable based on the embeddings—as illustrated in Figure 6—the reason why these methods underperform in many cases, the poisoned model still provides a well-trained embedding space and some imperfect signals for selecting a poison-condensed set.

Detailed design of the inner offset loop. The inner offset loop is executed inside the previous offset loop (Eqn. 2). It condenses the poison in a mini-batch sampled by the maximization step of the outer offset loop. Specifically, the inner offset loop will return a set of samples marked as poison. We will use this poison-condensed subset of the original mini-batch to perform the outer maximization. When the inner loop is relatively precise in gathering a poison-condensed subset, the outer loop will maximize the outer loss of poisoned samples without introducing much offset effect on clean samples. As a result, the poisoned and clean samples become more distinguishable in terms of the outer loss compared to a single offset loop via Eqn. 2. An intuitive explanation of the improvement is illustrated by Figure 3 (b).

Let f(xθpoi)f(x|\theta_{\text{poi}}^{*}) denote the poisoned model, and its parameters are given by θpoi\theta_{\text{poi}}^{*}. Let M(w)M(\cdot|w) be a mapping from the logits to a real value in the range $,and, andw$ denotes its parameters. The inner offset can be characterized by

where LBCE(p,q)=plogq+(1p)log(1q)\mathcal{L}_{\text{BCE}}(p,q)=-p\log q+(1-p)\log(1-q), representing the binary cross entropy loss and BbB_{\text{b}} and BpoiB_{\text{poi}} stand for a mini-batch drawn from the clean base set and the poisoned training set, respectively.

The first minimization objective will encourage learning a mapping MM such that the mini-batch from the clean base set is labeled as “”; the second objective will further promote MM to label the mini-batch from the poisoned set as “11”. By minimizing the two objectives simultaneously, the effect on the clean data gets canceled. As a result, the clean samples will be predicted as “11” with low confidence, yet the poisoned ones will be predicted as “11” with high confidence. Then, we can mark the samples with the highest confidence or the lowest BCE loss for predicting “11” as the suspicious poisoned samples. In practice, MM is implemented as a two-layer, full-connected network with 128 hidden neurons. Again, to avoid stability issues, in the implementation, we first take a gradient descent step to minimize L1\mathcal{L}_{1} and then take a gradient ascent step to minimize L2\mathcal{L}_{2}, and alternate between the two steps.

The pseudo-code for the inner offset loop is provided in Algorithm 1, termed Poison Concentration.

Adaptive thresholding for the inner offset. The last step of Poison Concentration is to select the subset marked as poison based on the confidence score output by MM. We will elaborate on how to adaptively choose the size of this subset. First, directly adopting a fixed threshold to identify the most likely poisoned samples is impractical, as different mini-batches may contain different amounts of poisons. To tackle this problem, we adopt Adjusted Outlyingness (AO) to adaptively determine the number of most suspicious samples within each mini-batch. AO maps the BCE losses into a scale such that a fixed threshold can effectively identify the most suspicious samples. Note that AO does not aim to filter out as many poisoned samples as possible within the mini-batch; instead, it is adopted to achieve high precision, i.e., identifying a subset of the mini-batch that is dominated by poisoned samples. In the evaluation, we threshold the output of AO with 2. By the nature of AO, we are essentially adopting an adaptive threshold despite using a fixed output value (see Figure 8).

4 Overall Workflow

The overall algorithm of ASSET with two offset loops is presented in Algorithm 2. Functionally speaking, the inner loop condenses the poison within each mini-batch drawn from the poisoned dataset, the outer loop induces different model behaviors on clean samples and poisoned samples. At each iteration of the outer, we minimize Lvar\mathcal{L}_{\text{var}} by taking mini-batch gradient descent with samples from the clean base set; then, we perform the poison concentration step: the inner returns subset of samples most likely to be poisoned; then proceed to the maximization step of the outer Lmax\mathcal{L}_{\text{max}} by doing gradient ascent with the suspicious points returned by the inner. In the end, we can obtain a detector model f(θI)f(\cdot|\theta_{\mathcal{I}}) with parameters θI\theta_{\mathcal{I}} obtained after I\mathcal{I} outer iterations and this model induces different values of Lmax\mathcal{L}_{\text{max}} between clean and poisoned samples.

Adaptive thresholding for the outer loop. With the trained detector model, θI\theta_{\mathcal{I}}, we now discuss how to identify the poisoned samples. Similar to the inner, we propose an adaptive thresholding method for the outer as well. Note that the threshold of the inner and outer loop has distinct goals. The inner loop aims to identify a subset with a high density of poisons, while the outer loop aims to adaptively conduct a split between the clean and poisoned loss distribution that helps the detector to remove as many poisons as possible while maintaining a low false positive, i.e., high precision is prioritized for the inner yet high recall is prioritized for the latter.

As will be shown later, after the overall optimization, f(θI)f(\cdot|\theta_{\mathcal{I}}) will output distinct loss distribution for the clean and poisoned samples. One might be tempted to directly fit a Gaussian Mixture Model (GMM) with two components. However, doing so is problematic, as depicted in Figure 4. Since there are usually much fewer poisoned samples than clean ones, the GMM tends to split the multiple-modal clean distribution into two Gaussian distributions instead of fitting two Gaussians respectively to the clean and poison distributions.

To tackle this problem, we propose a simple twist of GMM, termed adaptive GMM. We first abandon half of the samples achieving the highest values of Lmax\mathcal{L}_{\text{max}}, which will remove all the poisoned samples (we assume the attacker can poison no more than half of the training dataset, Section 3). Then, we fit a Gaussian to the remaining points. Since the optimized detector model largely centers the clean samples’ loss close to Lvar=0\mathcal{L}_{\text{var}}=0 or Lce=log(1k)\mathcal{L}_{\text{ce}}=-\log(\frac{1}{k}), the Gaussian fitted on the remaining samples remains similar to the Gaussian fitted on all the non-poisons (see Figure 4 (b)). Lastly, we set a small threshold on the Gaussian density, β\beta, to cut off the samples that are unlikely to be generated from the fitted Gaussian. In practice, we set the cut-off threshold as β=106\beta=10^{-6}, which equivalently keeps the lowest-loss samples with a probability higher than >99.99%>99.99\% being generated from the fitted Gaussian (for any Gaussian distribution with a variance smaller than 1010).

Evaluation

Our evaluation aims to answer the following questions.

Case-0 (Section 5.2): How does ASSET compare with other methods in end-to-end SL setting? Is detection effective when multiple attacks exist simultaneously? How does the detection performance vary over different attacks and poison ratios?

Case-1 (Section 5.3): Can ASSET robustly detect attacks in SSL settings? How does the knowledge about downstream tasks affect the defense’s effect?

Case-2 (Section 5.4): Can ASSET provide reliable backdoor sample detection in TL settings? What are the limitations of other defenses in this setting?

Adaptive Attack (Section 5.5): Is it possible to adaptively evade ASSET’s detection?

Ablation Study (Appendix 6.4): How do different design choices affect the final performance of ASSET?

Evaluation metrics. There are two key aspects throughout our evaluation: (1) How accurately can the poisoned samples be detected (upstream evaluation)? (2) After the suspicious points are removed, how would a downstream model learn from the remaining data perform (downstream evaluation)?

For upstream evaluation, we utilize two metrics, namely, True Positive Rate (TPR), TPR=TP/(TP+FN)TPR={TP}/{(TP+FN)}, and False Positive Rate (FPR), FPR=FP/(FP+TN)FPR={FP}/{(FP+TN)}, where TPTP, FPFP, TNTN, and FNFN denote the number of true positives, false positives, true negatives, and false negatives, respectivelyNote that poison is considered positive and clean is considered negative.. TPR depicts how well a specific backdoor detection method filters out the backdoored samples. A higher TPR (closer to 100%\%) denotes a stronger filtering ability. FPR depicts how precise the filtering is: when a specific method achieves TPR that is high enough, FPR helps us to understand the trade-off, i.e., how many clean samples are wasted and wrongly flagged as backdoored during the detection. A lower FPR shows that fewer clean samples are wasted, and more clean data shall be kept and available for downstream usage.

One thing worth noting is that no detection method can reliably remove all the poisoned samples. However, the remained backdoor samples that go unnoticed by a successful defense should be small enough to deactivate attacks. Thus, we evaluate the backdoor attacks’ Attack Success Rate (ASR) on the downstream model trained using the filtered dataset to study whether the detection is good enough to stop attacks. ASR measures the proportion of backdoored test samples being classified into target classes. Additionally, we evaluate the downstream model’s Clean Accuracy (ACC). A high ACC means that the detection method is able to maintain a large enough clean set to support the model performance.

Dataset & models. We incorporate three standard computer vision benchmark datasets into our evaluation: CIFAR-10 (main text), STL-10 (Appendix 6.3), and ImageNet (a randomly selected 100-class subset, Appendix 6.3). To ensure the effectiveness of the baselines and fair comparison, we set the base set size as 1000 for all the settings. We will later show that our method is robust to different choices of the base set size in the ablation study, Appendix 6.4. We obtain a 1000-size clean base set for each dataset by randomly selecting the samples from the test set and removing their label information. All the upstream evaluation metrics (i.e., TPR and FPR) are evaluated on the respective training sets, i.e., the training set of Case-0, the fine-tuning set of Case-2, and the unlabeled pre-training set for Case-1. For Case-0, we adopt all the remaining data from the test set for evaluation of the downstream metrics (i.e., ACC and ASR). For Case-1 and Case-2, we split the remaining test set into half being fine-tuning set and half being the downstream metric evaluation set. ResNet-18 is adopted on the CIFAR-10. ViT-Small/16 is adopted on STL-10 and ImageNet (Appendix 6.3). For Case-1, we incorporate four state-of-the-art SSL training methods, i.e., SimCLR , MoCo V3 , BYOL , and the MAE , for evaluation. For Case-2, we consider the two most popular transfer learning cases, namely, FT-all and FT-last (detailed in Section 2). The pre-trained model parameters for fine-tuning are loaded from the timm libraryhttps://timm.fast.ai/.

Baseline defenses. Referring to Table 1, we incorporate a wide range of existing backdoor detection for comparison, including both standard baselines used in prior work as well as state-of-the-art ones. In particular, we consider Spectral , Spectre , and the Beatrix ; we include AC as a representative work that utilizes intermediate neural activation; ABL , which was originally a robust training defense and repurposed as a detection method based on output losses; Strip as a representative detection approach based on model outputs; and CT , the most recent work reported achieving state-of-the-art performance on end-to-end SL settings based on confusion training. All the implementations and hyperparameters follow the original papers. For methods that rely on or can be boosted by an additional base set, e.g., Spectre, Beatrix, Strip, CT, we use the same 1000-size base set as ours. We note that this comparison setting might not be fair, as compared to these baselines, our method relaxes the requirement on label information; in addition, AC and ABL cannot be adapted to use the base set. We want to show that even without label information, our method can still achieve comparable or much better results with stronger robustness than the other baselines. Detailed explanations of the defense settings and how we adapted them to Case-1 and Case-2 are provided in Appendix 6.1.

Backdoor attack settings. For Case-0 we incorporate seven standard or state-of-the-art attacks, including four dirty-label and clean-label ones. For dirty-label backdoor attacks, we incorporate localized backdoor attack BadNets , global-wised blended trigger Blended , wrapping-based invisible backdoor attack WaNet , and the state-of-the-art sample-specific invisible backdoor attack, ISSBA . For clean-label attacks, we include the standard Label Consistent (LC) attack , the state-of-the-art feature-collision-based hidden trigger backdoor, Sleeper Agent Attack (SAA) , and the state-of-the-art optimization-based Narcissus attack (Narci.) . For Case-1, only limited existing work has explored the attack over SSL’s unlabeled training set. We incorporate the Checkerboard trigger (C-brd) used in , the Colored Square trigger (C-squ) used in , and the state-of-the-art YCbCr frequency-based invisible trigger used in CTRL . In particular, CTRL has been shown to achieve a magnitude higher attacking efficacy than . For Case-2, directly implementing some of the attacks from end-to-end SL may not lead to effective attacks, e.g., the Blended attack cannot achieve high ASR under the FT-all settings. Thus, we consider attacks that can maintain effectiveness for each TL setting. BadNets and the SAA are adopted for evaluation under the FT-all case. Blended and the hidden trigger backdoor attack (HTBA) are adopted for the evaluation under the FT-last case. All the incorporated attacks’ settings, such as trigger design and trigger strength, all follow their original papers. Appendix 6.2 details the specifics of these attacks’ setups under each learning paradigm and visual examples of the poisoned samples we intend to detect.

2 Case-0: End-to-end SL

Detection performance against different attacks in SL. Table 2 presents the upstream and downstream evaluation results under the end-to-end SL setting on the CIFAR-10 dataset with the ResNet-18 model trained from scratch for 200 epochs. For each different attack, we adopt the poison ratio following each original paper, which is listed at the top of each column. We have included the row of “No Defense” in Table 2 (b) to show the attack effects without any backdoor detection defense in place. Existing methods are able to achieve decent detection effects on some specific attacks, but they experience large performance variations when defending different attacks. These methods either solely rely on the embedding space of a poisoned model that may change with different trigger designs or rely on some detection rule that may not apply to specific backdoor designs. For example, ABL assumes that backdoor samples achieve the lowest loss at the early stage of training. However, the Narci. clean-label poisoned samples’ losses do not meet the assumption; thus, ABL is not effective on the Narci. The recently proposed CT achieves the highest detection rate and the most consistent performance among all baselines, but it still fails to detect the state-of-the-art clean-label attack, Narci. Notably, no existing detection method obtains satisfying results as Narci. introduces optimized features as robust as the semantic features of the target class . Regarding the upstream evaluation in Table 2 (a), our method reliably achieves a TPR above 90% for all the evaluated settings and significantly improves the state-of-the-art in terms of the average and worse-case defensive performance over different attacks. Regarding the downstream evaluation in Table 2 (b), we find that ASSET is the only defense that gives rise to robust models over all the evaluated poisoned datasets, i.e., all ASRs drop below random guessing rate, i.e., 10%. In particular, our method is the only effective method to mitigate Narci. Moreover, the downstream models trained over ASSET filtered datasets achieve the highest average ACC. Notably, the average ACC of our method is slightly higher than using the original poisoned dataset (which contains more clean samples). Results for multiple attacks introduced simultaneously are provided in Appendix 6.3, with similar observations.

Unlike ASSET, the existing methods do not have an active process to induce differentiating behaviors between clean samples and poisoned ones. Thus clean and poisoned samples often have overlapping behaviors and cannot be easily separated. We illustrate the separation between clean and poisoned samples using different detection methods and their threshold in Figure 5, emphasizing the importance of the proposed active offset process.

Impact of poison ratios. In Table 3, we study the effects of poison ratio on different detection methods against two standard attacks, namely, BadNets, and the Blended attack. Most existing detection works better for small poison ratios but fails as the ratio increases. One reason is that many works, such as Spectral, Spectre, and AC, are based on the feature distribution of the poison dataset. However, an increased poisoning rate will cause the clean feature distribution to be closer to the poisoned one, making them less separable. CT is the most robust baseline in the previous evaluation, but it also fails for very large ratios like 20% (10000 poisons) or 50% (25000 poisons). The reason could be that their detector uses fixed hyperparameters that are fine-tuned on small poison ratios. Our defense is robust to poison ratio changes, even for extreme cases where half of the samples in the training set are poisoned or only 25 (0.05%0.05\%) samples are poisoned.

3 Case-1: SSL Adaptation

Detection performance against different attacks in SSL. Now we study the efficacy in detecting unlabeled poisons under the SSL adaptation cases. Table 4 and Table 5 list out the upstream and downstream evaluation results, respectively, on CIFAR-10 using ResNet-18 trained via SimCLR-based SSL for 600 epochs with linear adaptation for 100 epochs. We find that the ASRs of C-brd and C-Squ are below 20% so these attacks cannot lead to a successful attack on average. We still keep their results but show the number of successfully attacked samples (denoted with ASRASR^{*}) as done in . Even though these attacks do not result in as high ASR as the attacks in SL or as the CTRL attack, they can still result in an increase of samples with triggers being classified as the target class. As shown in Table 4, among all the evaluated attacks, our method obtains the highest TRP while remaining the lowest FPR among all detection methods. Noting the absence of CT under the SSL. Recall that in the SL setting, CT can achieve compatible results as our method on most attack settings; yet, it is inapplicable to SSL as its core technique—confusion training—relies on label information . In particular, as C-brd and C-Squ do not result in a high ASR as shown in Table 5, the model’s response to clean and backdoor samples is not sufficiently different, thereby making detection very difficult. In fact, none of the baselines provides reliable detection of these two attacks. For the CTRL attack, which achieves an ASR of over 80%, we start to see that some of the baseline defenses take effect, e.g., the Beatrix. But still, our method achieves the best upstream detection performance (Table 4) and gives rise to the highest ACC and lowest ASR downstream (Table 5).

Further evaluation with more SSL training algorithms. We further evaluate our defense under other popular SSL training algorithms and different model structures and datasets, e.g., ResNet-18 and ViT-Small/16 trained using SimCLR, MoCO V3, BYOL, MAE over CIFAR-10 or the ImageNet (Appendix 6.3). The upstream and downstream evaluation results on the CIFAR-10 are shown in TBALE 6 and Table 7, respectively. Across all the evaluated settings, our method provides reliable upstream detection results with TPRs over 90% for all the cases and low FPRs. Thanks to the upstream efficacy, our detection method can give rise to the downstream model with a low ASR and an ACC close to or better than the settings without removing any training point. Overall, our results demonstrate that our method can reliably sift out the poisoned samples across different settings of SSL adaptation.

Impact of # logits w.r.t. SSL downstream task. Note that for SSL evaluation, the pre-trained model requires a fixed number of logits, each corresponding to a different output category. In our evaluation, we use the actual classes contained (e.g., 10 for the CIFAR-10 and 100 for the ImageNet 100-subset). Such a setting is applicable when the defender knows the exact downstream classification task. Now we consider a much more strict case where one tries to conduct detection over unlabeled datasets without any prior knowledge about the number of categories in downstream tasks. As shown in Table 8, we find our method is robust to the change in the number of logits and can maintain a TPR higher than 90%.

4 Case-2: Transfer Learning

Detection performance against different attacks in TL. We consider two of the most popular TL schemes for evaluation: FT-all and FT-last with models pre-trained on the ImageNet. All the existing backdoor defenses can be easily generalized to TL. However, none of them has empirically evaluated the backdoor detection efficacy under the TL settings in the prior literature, which leaves a gap to fill.

The upstream and downstream results are listed in Table 9. Existing methods’ detection results on FT-all seem more consistent than the results on FT-last. This observation might be due to that FT-all is a setting much closer to the end-to-end SL. While many defenses can achieve satisfying results on some specific attacks in SL, none can achieve a TPR above 90% for all attack settings in TL, except CT on BadNets. We now take a closer look at the reason why existing detection methods fall short in TL. We depict the feature space t-SNE results comparing the attacks in Case-0 and Case-2 in Figure 6. Since in TL, the model parameters have been initialized with additional knowledge obtained from pre-training, clean and poisoned samples are harder to be separated in the embedding space, thus resulting in a worse detection result compared to SL. As shown in Figure 6, for both BadNets and the Blended attack, the clean and poisoned samples have a larger overlapping in the TL case than in SL. These results emphasize the importance of introducing active measures to increase separability.

On the other hand, for all the evaluated settings on the two datasets (CIFAR-10 and STL-10, Appendix 6.3), our method consistently achieves the best TPR, FPR, ASR, and ACC. Remarkably, the averaging performance on both upstream and downstream of ASSET is of magnitude better than the seven baselines. The results highlight that actively introducing different model behaviors can help a detection method to be of better robustness to the DL paradigm shift.

5 Adaptive Attack Analysis

From the above, we find ASSET is the most reliable detection method across different attacks, datasets, poison ratios, and training paradigms. Now we study adaptive attacks, where we want to understand how an attacker’s knowledge about defense implementation impacts defense performance.

Attacker goal & settings. The attacker aims to craft poisoned samples resulting in a low TPR while maintaining a low FPR for upstream detection , and resulting in a high ASR while maintaining a high ACC for the downstream poisoned model. A successful adaptive attack should achieve satisfying results based on these metrics simultaneously. We consider two models of attack knowledge: White-box attack and Gray-box attack. (1) White-box Settings. The attacker has full access to the details of ASSET, namely, the workflow of ASSET; the architecture of the detector model, and the architecture of the feature extractor; the architecture of the weighting network will be used for poison concentration; the original poisoned dataset, DpoiD_{\text{poi}}; and the clean base set DbD_{\text{b}}. Although such disclosure of the defense details is rare in practice, an investigation of this setting gives insights into the worst-case performance of ASSET. (2) Gray-box Settings. We also consider a more realistic attack scenario where the attacker is aware of the ASSET pipeline and the respective datasets but not aware of the specific model architectures used by the defender for conducting the detection and performing downstream tasks. In both White-box attack and Gray-box, the attacker updates the original poisoned samples in DpoiD_{\text{poi}} and then supplies the updated dataset to the defender.

Attack design. For both White-box and Gray-box attack, we investigate optimization-based techniques to design poisoned samples to evade ASSET. The attacker can use DpoiD_{\text{poi}} and DbD_{\text{b}} to obtain trained detector parameters, θI\theta_{I} and then resolve the following optimization to obtain an additive noise for each poisoned sample xpoix_{\text{poi}} in DpoiD_{\text{poi}} to evade the detection

To conclude, the above study shows that ASSET is robust to the evaluated White-box attack with the standard unlearning procedure using the detected samples and robust to the evaluated Gray-box attack. The results highlight that disclosing the knowledge of our defense workflow and models can expose ASSET to the risk of adaptive attacks. Not releasing the model architecture can mitigate the risk of adaptive attacks to a large extent. Also, using the detected samples for unlearning can be a simple yet effective post-processing method that can be used in tandem with our detection to safeguard ML applications against adaptive attacks to our defense. One thing worth highlighting is that the unlearning process requires the detection method to obtain a better precision upstream. Otherwise, if the FPR of the upstream is high (more clean samples are wrongly flagged), the downstream unlearning would result in an unfavorable impact on the ACC (e.g. the results on the White-box LC results).

Conclusion

This work is motivated by the glaring gap between the focused evaluation of the end-to-end SL settings in prior backdoor detection literature and the fast adaption of other more data- and computation-efficient learning paradigms, including SSL adaptation and TL. We find that existing detection methods cannot be applied or suffer limited performance for SSL and TL; even for the widely studied end-to-end SL setting, there is still large room to improve detection in terms of their robustness to variations in poison ratio. This work proposes a novel idea for actively enforcing different model behaviors on clean and poisoned samples through a two-level nested offset loop. Our approach provides the first backdoor defense that operates across different learning paradigms, different attack techniques, and poison ratios.

Our work opens up many directions for future work. (1) Theoretical Understanding of Offset: Despite the empirical success, an in-depth understanding of convergence behaviors and sample complexity of ASSET is still lacking. In addition, we have shown multiple offset objectives, but how to explain why a loss design is better than the other is still an open question. (2) Alternative Offset Goal Designs: Our work provides a general algorithmic framework for active backdoor data detection by optimizing opposite goals. Are there other optimization objectives beyond what we proposed in this paper that can lead to better detection performance? (3) Extension to Broader Data Types: Evaluating ASSET on domains beyond images and texts is of practical importance.

Acknowledgement

RJ and the ReDS lab appreciate the support of the Amazon - Virginia Tech Initiative for Efficient and Robust Machine Learning and the Cisco Award. YZ is supported by the Amazon Fellowship. XL gratefully acknowledges the support of National Science Foundation Award No. CNS-1929300.

References

Appendix

In the evaluation section, we provide a thorough comparison of existing backdoor detection techniques. These methods can be classified into several categories, including Spectral, Spectre, and Beatrix, which utilize analysis of activation patterns; AC, which leverages clustering of feature information; ABL, which detects the lowest loss from poisoned datasets; Strip, which focuses on logits of sample outputs; and CT, which employs confusion training in end-to-end supervised learning settings.

Note that the above baseline defenses were only evaluated under the settings of end-to-end SL (Case-0) in their original papers. They can also be directly generalized to Case-2. We will incorporate the above seven baseline defenses in Case-0 and Case-2 with the suggested hyperparameters proposed in these original works for comparison. As for Case-1, some of the methods are not applicable, whereas others can be adapted to operate without label information. In particular, Strip and CT are label-information-dependent methods, which are excluded from evaluation in Case-1. The vanilla design of Spectral and Spectre used a feature extractor trained with label information. In our Case-1 experiment, we replace the feature extractor trained with labels with one trained using the SSL paradigm. The original implementation processes samples class-wisely for the Beatrix and AC . However, since there is no label information in Case-1, we process all training samples together. For ABL , we replace the original implementation’s Cross-Entropy loss with the respective training loss function used in the respective SSL algorithm (e.g., the InfoNCE loss for the MoCo V3 ).

2 Detailed Attack Settings

In this work, we examine several representative attacks for each category of attack design. For Case-0, which is the end-to-end supervised learning setting mentioned in Section 5.2, we thoroughly investigate existing Dirty-label attacks and Clean-label backdoor attacks. Dirty-label attacks create a backdoor by altering the label of the poisoned samples to the target class. We selected some representative attacks for experiments. For example, BadNets and Blended are used as triggers by simply superimposing special patterns; there are also affine transformations that are difficult to find on pictures, such as WaNet; as well as training an encoder to create distinct backdoor trigger for each sample like ISSBA. On the other hand, Clean-label backdoor attacks maintain the original label of the poisoned samples. Examples include LCciteturner2019label, which makes models learn simple triggers by patching adversarial noise on the remaining part of sample; SAA, which produces effects through model feature collisions; and the state-of-the-art attack Narcissus, which obtains the backdoor trigger by optimizing the distribution within the class and the connection of the target label. For these three Clean-label backdoor attacks, we set l=16/255l_{\infty}=16/255 to ensure the consistency of the attack. For Case-1, we consider the backdoor attack in the SSL setting (detailed in Section 5.3). Since the training does not require labels and always contains strong augmentations, traditional attacks against SSL are not effective. However, with the development of this training paradigm, attacks against it have started to emerge. There are attacks by superimposing specific design patterns and attacks by adding specific frequency noise to the YCbCr color space. The C-brd and C-Squ adopt a fixed in-class poison ratio w.r.t. only the samples from the targeted category (50% in-class), following . CTRL adopts a fixed poison ratio w.r.t. the whole dataset (1% of all the samples), following . For Case-2, we investigate the attacks in the context of transfer learning, as described in section 5.4. Our evaluation revealed that adding backdoor attack samples to the fine-tuned dataset leads to a successful attack. Basic backdoor attacks, such as BadNets and Blended, can easily be generalized and result in an effective attack. Furthermore, attacks based on the collision of the model’s feature space, such as SSA or HTBA can also work in this scenario. All the attacks use the default settings in the original paper to ensure consistency with the original work.

3 Additional results

In addition to the results presented in the main text, we also evaluate the performance of the baseline defenses in different attack settings and dataset settings.

Additional Results with Multiple Attacks. For Case-0, we test the scenario where multiple backdoor attacks appear simultaneously in a training set. We deploy 4 different dirty label attacks that have appeared in the main text into 4 different classes of the CIFAR-10 dataset, and the poison ratio is consistent with the main text. At the same time, the ASR of all attacks is above 90% to ensure the effectiveness of the attack. The results are listed in Table 14. When multiple attacks are present, all the baseline defense methods except CT can maintain a reliable detection, as at least one set of poisoned samples ends up with a TPR lower than 50%. Our method achieves the highest average TPR among all defenses and demonstrates a better and more consistent detection performance with all the TPR above 85% under this setting.

Additional Results with SSL. For Case-1, we evaluate the results on the ImageNet-100 dataset. ImageNet-100 is a subset of ImageNet-1K, consisting of 100 randomly selected classes (about 128,000 samples), which is currently the most popular benchmark dataset for self-supervised learning. All images are resized to 224x224 pixels to fit the model input. Here we use self-supervised learning methods consistent with those in Section 4.3, including the contrastive learning method SimCLR, MoCO V3, BYOL, and the masked-model training method MAE. Here all backbone models are ViT-Small/16 to obtain a satisfactory ACC. The upstream and downstream results can be found in Table 15, and Table 11, respectively. As the dataset becomes more complex compared to CIFAR-10, detection also becomes more difficult. Nevertheless, our method provides a TPR greater than 88% in all cases. All FPRs are below 0.5%, providing as clean samples as possible for subsequent downstream tasks and minimizing the impact on ACC. In the downstream task, our method succeeded in reducing the ASR with no significant improvement over the baseline without poison, indicating that our method was successful in removing the poison. At the same time, thanks to the extremely low FPR, the ACC of the model has seen a certain increase compared to the poisoned model.

Additional Results with TL. Finally, in Case-2, we present the upstream and downstream results of STL-10 in Table 16, where all images were scaled to 224x224 pixels to align with the ImageNet-1K pre-trained ViT-Tiny/16 model. Our method consistently achieves a TPR of over 90%, while keeping the FPR below 0.6%. Compared to other defense methods, our method achieves the best average TPR and FPR. In the downstream tasks, which benefited from the high TPR and low FPR, our method successfully keeps all ASRs below 20%, ensuring attacks will not effectively occur. Our method obtains the highest average value for ACC as well as ASR.

Visual Results of Adaptive Attacks. Figure 7 depicts the visual results of the adaptive attacks discussed in Section 5.5.

Additional Results on Other Modality. We provide additional results on exploring the applicability of the ASSET on detecting backdoor samples in the Natural Language Processing domain. We implemented the BadNets attackhttps://github.com/thunlp/OpenBackdoor on the SST-2 dataset with BERT as the target model. We set the poisoning rate to be 10%, with the trigger as "cf mn bb tq." We observe that ASSET can achieve good detection results. Compared to the AC evaluated under the same settings, we find our method provides more effective detection results. One possible explanation for the AC’s limited effectiveness is that the BERT model relies on pre-trained features, which limits the separability based on feature space clustering.

Computation Overhead. Table 13 compares the computation overhead of ASSET and other baseline methods in Case-0.

4 Ablation Study

Table 17 shows that solely adopting the outer offset loop will experience limitations in low poison ratio cases. In the case of a low poison ratio, since the poison samples account for a relatively small proportion in each mini-batch, the model will tend to optimize its output for clean samples, thus ignoring its output for poison samples, finally leading to limited performance. However, this limitation can be effectively overcome by embedding an inner loop to perform poison concentration. In a relatively high poison ratio setting (e.g., 20%) where the outer loop alone can already achieve good detection performance, inserting an inner loop is still useful and can further boost the detection efficacy. It can be seen that the design of the inner loop is the key to our successful defense in spite of the very low poison ratio in Table 3.

We ablate on the size of the base set used in our detection, and the result is provided in Table 18. We find that the detection performance slightly decreases as the base set size is smaller; nevertheless, ASSET can achieve strong performance even with 1010 samples—one sample per class on CIFAR-10. Our experiment confirms our conclusion in Section 4.1 that the base set and the clean portion of the poisoned dataset share the same clean distribution, while the clean sample and poison sample originate from distinct distributions.

Figure 8 depicts AO’s impact on mini-batches from the same poisoned training set. In particular, even though the two mini-batches are from the same distribution, the number of poisoned samples varies due to random sampling. With different sizes of poisoned samples resulting in different distributions of the loss values, it becomes harder for the inner loop to use a fixed threshold or fixed ratio to determine the most likely poisoned samples to form BpcB_{\text{pc}}. AO helps to map the distribution adaptively so that we find a fixed threshold to consistently obtain the poison-concentrated subset.

While ASSET exhibits robust performance across a range of attack settings, its effectiveness may fluctuate depending on the quality of the base set.

Sampling quality of the base set. In this paper, the base set follows the widely accepted setting that it is drawn from the same distribution as the training set. However, it is worth noting that in practical, a distributional drifts may occur between the training and base sets. To test how ASSET fares in the face of such distributional drifts, we have outlined the detection results derived from utilizing samples taken from different datasets as base sets for poison detection on CIFAR-10 (BadNets attack, Case-0) in Table 19. Our observations suggest that ASSET can consistently generate acceptable detection results if the distributional drift does not drastically alter the task context, as evidenced by the results from CIFAR-100 and STL-10. However, the detection efficiency falters when an out-of-distribution dataset is used as the base set, as exemplified by the use of the traffic sign dataset, GTSRB.

Poisons in the base set. Stronger attack settings may enable attackers to tamper with the base set. Implementing this setting is challenging, and it has rarely been discussed in prior work due to the formidability of embedding the exact trigger into the carefully scrutinized base set without triggering any alerts. We evaluate the impact of different poison ratios in the base set in Table 20, and with 10 poisoned samples infiltrating the base set will cause the detection to be ineffective.

Remark. The above results on the efficacy and the base set quality are unsurprising. The detection efficacy’s sensitivity to the quality of the base set is not exclusive to ASSET. This sensitivity is likewise a noted drawback of numerous defensive methods that rely on a clean in-distribution base set, as observed and discussed in . The experimental results highlight the importance of obtaining high-quality base sets with the care of drift and security inspections. How to effectively acquire a high-quality base set is out of the scope of this paper.