Universal Adversarial Triggers for Attacking and Analyzing NLP
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh
Introduction
Adversarial attacks modify inputs in order to cause machine learning models to make errors Szegedy et al. (2014). From an attack perspective, they expose system vulnerabilities, e.g., a spammer may use adversarial attacks to bypass a spam email filter Biggio et al. (2013). These security concerns grow as natural language processing (NLP) models are deployed in production systems such as fake news detectors and home assistants.
Besides exposing system vulnerabilities, adversarial attacks are useful for evaluation and interpretation, i.e., understanding a model’s capabilities by finding its limitations. For example, adversarially-modified inputs are used to evaluate reading comprehension models Jia and Liang (2017); Ribeiro et al. (2018) and stress test neural machine translation Belinkov and Bisk (2018). Adversarial attacks also facilitate interpretation, e.g., by analyzing a model’s sensitivity to local perturbations Li et al. (2016); Feng et al. (2018).
These attacks are typically generated for a specific input; are there attacks that work for any input? We search for universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. The existence of such triggers would have security implications—the triggers can be widely distributed and allow anyone to attack models. Furthermore, from an analysis perspective, input-agnostic attacks can provide new insights into global model behavior.
Triggers are a new form of universal adversarial perturbation Moosavi-Dezfooli et al. (2017) adapted to discrete textual inputs. To find them, we design a gradient-guided search over tokens. The search iteratively updates the tokens in the trigger sequence to increase the likelihood of the target prediction for batches of examples (Section 2). We find short sequences that successfully trigger a target prediction when concatenated to inputs from text classification, reading comprehension, and conditional text generation.
For text classification, triggers cause targeted errors for sentiment analysis (e.g., top of Table 1) and natural language inference models. For example, one word causes a model to predict 99.43% of Entailment examples as Contradiction (Section 3). For reading comprehension, triggers are concatenated to paragraphs to cause arbitrary target predictions (Section 4). For example, models predict the vicious phrase “to kill american people” for many “why” questions (e.g., middle of Table 1).
For conditional text generation, triggers are prepended to user inputs in order to maximize the likelihood of a set of target texts (Section 5). Our attack triggers GPT-2 Radford et al. (2019) to generate racist outputs using the prompt “TH PEOPLEMan goddreams Blacks” (e.g., bottom of Table 1).
Although we generate triggers assuming white-box (gradient) access to a specific model, they are transferable to other models for all datasets we consider. For example, some of the triggers generated for a GloVe-based reading comprehension model are more effective at triggering an ELMo-based model. Moreover, a trigger generated for the GPT-2 117M model also works for the 345M model: the first language model sample in Table 1 shows the larger model ranting on the “evil genes” of Black, Jewish, Chinese, and Indian people.
Finally, unlike typical adversarial attacks, the input-agnostic nature of the triggers provides new insights into global model behavior, i.e., general input-output patterns learned by a model. For example, triggers confirm that models exploit biases in the SNLI dataset (Section 6). Triggers also identify heuristics learned by SQuAD models—they heavily rely on the tokens that surround the answer span and type information in the question.
Universal Adversarial Triggers
This section introduces universal adversarial triggers and our algorithm to find them. We provide source code for our attacks and experiments.https://github.com/Eric-Wallace/universal-triggers
We are interested in attacks that concatenate tokens (words, sub-words, or characters) to the front or end of an input to cause a target prediction.
The adversarial threat is higher if an attack is universal: using the exact same attack for any input Moosavi-Dezfooli et al. (2017); Brown et al. (2017). Universal attacks are advantageous as (1) no access to the target model is needed at test time, and (2) they drastically lower the barrier of entry for an adversary: trigger sequences can be widely distributed for anyone to fool machine learning models. Moreover, universal attacks often transfer across models Moosavi-Dezfooli et al. (2017), which further decreases attack requirements: the adversary does not need white-box (gradient) access to the target model. Instead, they can generate the attack using their own model trained on similar data and transfer it.
Finally, universal attacks are a unique model analysis tool because, unlike typical attacks, they are context-independent. Thus, they highlight general input-output patterns learned by a model. We leverage this to study the influence of dataset biases and to identify heuristics that are learned by models (Section 6).
2 Attack Model and Objective
where are input instances from a data distribution and is the task’s loss function. To generate our attacks, we assume white-box access to .
3 Trigger Search Algorithm
We first choose the trigger length: longer triggers are more effective, while shorter triggers are more stealthy. Next, we initialize the trigger sequence by repeating the word “the”, the sub-word “a”, or the character “a” and concatenate the trigger to the front/end of all inputs.More complex initialization schemes perform similarly (Appendix A).
We then iteratively replace the tokens in the trigger to minimize the loss for the target prediction over batches of examples. To determine how to replace the current tokens, we cannot directly apply adversarial attack methods from computer vision because tokens are discrete. Instead, we build upon HotFlip Ebrahimi et al. (2018b), a method that approximates the effect of replacing a token using its gradient. To apply this method, the trigger tokens , which are represented as one-hot vectors, are embedded to form .
Our HotFlip-inspired token replacement strategy is based on a linear approximation of the task loss.We also experiment with projected gradient descent (Appendix A) but find the linear approximation converges faster. We update the embedding for every trigger token to minimizes the loss’ first-order Taylor approximation around the current token embedding:
where is the set of all token embeddings in the model’s vocabulary and is the average gradient of the task loss over a batch. Computing the optimal can be efficiently computed in brute-force with -dimensional dot products where is the dimensionality of the token embedding Michel et al. (2019). This brute-force solution is trivially parallelizable and less expensive than running a forward pass for all the models we consider. Finally, after finding each , we convert the embeddings back to their associated tokens. Figure 1 provides an illustration of the trigger search algorithm.
We augment this token replacement strategy with beam search. We consider the top- token candidates from Equation 2 for each token position in the trigger. We search left to right across the positions and score each beam using its loss on the current batch. We use small beam sizes due to computational constraints (Appendix A), increasing them may improve our results.
We also attack contextualized ELMo embeddings and sub-word models that use byte pair encoding. This presents challenges not handled in prior work, e.g., ELMo embeddings change depending on the context; we describe our methodology for handling these attacks also in Appendix A.
4 Tasks and Associated Loss Functions
Our trigger search algorithm is generally applicable—the only task-specific component is the loss function . Here, we describe the three tasks used in our experiments and the associated loss functions. For each task, we generate the triggers on the dev set and evaluate on the test set.
Reading Comprehension
Reading comprehension models are used to answer questions that are posed to search engines or home assistants. An adversary can attack these models by modifying a web page in order to trigger malicious or vulgar answers. Here, we prepend triggers to paragraphs in order to cause predictions to be a target span inside the trigger. We choose and fix the target span beforehand and optimize the other trigger tokens. The trigger is optimized to work for any paragraph and any question of a certain type. We focus on why, who, when, and where questions. We use sentences of length ten following Jia and Liang (2017) and sum the cross-entropy of the start and end of the target span as the loss function.
Conditional Text Generation
We attack conditional text generation models, such as those in machine translation or autocomplete keyboards. The failure of such systems can be costly, e.g., translation errors have led to a person’s arrest Hern (2018). We create triggers that are prepended before the user input to cause the model to generate similar content to a set of targets .A strong language model will generate grammatically correct continuations of the user’s input. This makes it impossible to generate one specific target no matter the input. We thus relax the attack to targets of similar content. In particular, our trigger causes the GPT-2 language model Radford et al. (2019) to output racist content. We maximize the likelihood of racist outputs when conditioned on any user input by minimizing the following loss:
where is the set of all racist outputs and is the set of all user inputs. Of course, and are infeasible to optimize over. In our initial setup, we approximate and using racist and non-racist tweets. In later experiments, we find that using thirty manually-written racist statements of average length ten for and not optimizing over (leaving out ) produces similar results. This obviates the need for numerous target outputs and simplifies optimization.
Attacking Text Classification
We consider two text classification datasets.
We use binary Stanford Sentiment Treebank Socher et al. (2013). We consider Bi-LSTM models Graves and Schmidhuber (2005) using word2vec Mikolov et al. (2018) or ELMo Peters et al. (2018) embeddings. The word2vec and ELMo models achieve 86.4% and 89.6% accuracy, respectively.
Natural Language Inference
We consider natural language inference using SNLI Bowman et al. (2015). We use the Enhanced Sequential Inference (Chen et al., 2017, ESIM) and Decomposable Attention (Parikh et al., 2016, DA) models with GloVe embeddings Pennington et al. (2014). We also consider a DA model with ELMo embeddings (DA-ELMo). The ESIM, DA, and DA-ELMo models achieve 86.8%, 84.7%, and 86.4% accuracy, respectively.
1 Breaking Sentiment Analysis
We begin with word-level attacks on sentiment analysis. To avoid degenerate triggers such as “amazing” for negative examples, we use a lexicon to blacklist sentiment words.www.cs.uic.edu/ liub/FBS/sentiment-analysis.html We start with a targeted attack that flips positive predictions to negative using three prepended trigger words. Our attack algorithm returns “zoning tapping fiennes”—prepending this trigger causes the model’s accuracy to drop from 86.2% to 29.1% on positive examples. We conduct a similar attack to flip negative predictions to positive—obtaining “comedy comedy blutarsky”—which causes the model’s accuracy to degrade from 86.6% to 23.6%. Figure 5 in Appendix B shows the effect of decreasing/increasing the length of the trigger. For example, the positive to negative attack degrades accuracy to 46% using one word and 13% with ten.
We next attack the ELMo model. We prepend one word consisting of four characters to the input and optimize over the characters. For the targeted attack that flips positive predictions to negative, the model’s accuracy degrades from 89.1% to 51.5% on positive examples using the trigger “u^{b”. For the negative to positive attack, prepending “m&s” drops accuracy from 90.1% to 52.2% on negative examples.
2 Breaking Natural Language Inference
We attack SNLI models by prepending a single word to the hypothesis. We generate the attack using an ensemble of the GloVe-based DA and ESIM models (we average their gradients ), and hold the DA-ELMo model out as a black-box.
In Table 2, we show the top-5 trigger words for each ground-truth SNLI class and the corresponding accuracy for the three models. The attack can degrade the three model’s accuracy to nearly zero for Entailment and Neutral examples, and by about 10-20% for Contradiction. Table 7 in Appendix B shows the prediction distribution for the DA model—targeted attacks are successful, e.g., the trigger “nobody” causes 99.43% of Entailment examples to be predicted as Contradiction.
The attacks also readily transfer: the ELMo-based DA model’s accuracy degrades the most, despite never being targeted in the trigger generation. We analyze why the predictions for Contradiction are more robust and show that triggers align with known dataset biases in Section 6.
Attacking Reading Comprehension
We create triggers for SQuAD Rajpurkar et al. (2016). We use an intentionally simple baseline model and test the trigger’s transferability to more advanced models (with different embeddings, tokenizations, and architectures). The baseline is BiDAF Seo et al. (2017); we lowercase all inputs and use GloVe Pennington et al. (2014).
We pick the target answers “to kill american people”, “donald trump”, “january 2014”, and “new york” for why, who, when, and where questions, respectively.We choose these answers arbitrarily and expect others to perform similarly. They are not high frequency, e.g., “to kill american people” (thankfully) never appears in SQuAD.
We consider our attack successful only when the model’s predicted span exactly matches the target. We call this the attack success rate to avoid confusion with the exact match score for the original ground-truth answer. We do not have access to the hidden test set of SQuAD to evaluate our attacks. Instead, we generate the triggers using 2000 examples held-out from the training data and evaluate them on the development set.
Results
The resulting triggers for each target answer are shown in Table 3, along with their attack success rate. The triggers are effective—they have nearly 50% success rate for who, when, and where questions on the BiDAF model. As a baseline, we also prepend only the target answer span (no other tokens) and see substantially lower success rates (Table 8 in Appendix C).
Replacing the Target Answers
We can also replace the target answer span without changing the rest of the trigger. For example, we replace “to kill american people” with “bomb in the classroom” without changing the rest of the “why” trigger sequence from Table 3. The attack success rate sometimes increases, i.e., the trigger is relatively agnostic to the target answer (Table 4).
Attack Transferability
We test the trigger’s transferability to black-box models: QANet Yu et al. (2018), an ELMo-based BiDAF model (ELMo), and a BiDAF model that uses character-level convolutions (Char). We report the accuracy of these models in Table 9 in Appendix C. The triggers readily transfer to these models, despite them having different embeddings/tokenizations (ELMo and Char) and different architectures (QANet). The attacks do somewhat overfit to the targeted BiDAF model (i.e., imperfect transfer). We ameliorate this by training another GloVe-based BiDAF model and regenerating the triggers using an ensemble of the two BiDAF models. Except on “why” questions, this improves transferability (second row for each type in Table 3).
Section 6 analyzes other properties of the SQuAD triggers such as token order, placement in the paragraph, and the effect of dataset biases.
Attacking Conditional Text Generation
We generate a trigger for the publicly released 117M parameter version of GPT-2 using six sub-word tokens. The attack returns “TH PEOPLEMan goddreams Blacks” as the trigger. This nonsensical sequence triggers the model to generate racist outputs. The last two examples of Table 1 show samples from the GPT-2 117M model when given different user inputs, and the reader can try their own inputs in the online GPT-2 demo. We also generate 100 samples and manually analyze them, finding 98% are racial and/or offensive, and 63% contain an explicitly racist statement (samples and details in Appendix D).
Note that the attack effectiveness is not due to the token “Blacks”, e.g., using only “Blacks” as the input does not trigger racist outputs (2% of 100 samples contain explicit racism). Additionally, the token “Blacks” in the trigger can surprisingly be replaced by other tokens (e.g., “Asians” or “Jews”) and GPT-2 will still produce egregious outputs.
Although the trigger sequence is generated for the GPT-2 117M parameter model, we find it also triggers the 345M parameter model: the outputs have comparable degrees of explicit racism (58% of the time) but better fluency. The first language model sample in Table 1 is generated using the 345M model and further samples are shown in Figure 2. The 345M model is also available through the public API.
Analyzing The Triggers
Why do universal adversarial triggers work? This section shows that the success of triggers arises from model and data failures. In particular, we confirm that models exploit biases in the SNLI dataset (Section 6.1) and show that SQUAD models overly rely on type matching and the tokens that surround answer span (Section 6.2).
The construction of NLP datasets can lead to dataset biases or “artifacts”. For example, Gururangan et al. (2018) and Poliak et al. (2018) show that spurious correlations exist between the hypothesis words and the labels in SNLI. We investigate whether triggers are caused by such artifacts.
Following Gururangan et al. (2018), we identify dataset artifacts by ranking all the hypothesis words according to their pointwise mutual information (PMI) with each label. We then group the trigger words based on their target label and report their PMI percentile (Table 7 in Appendix B). The trigger words strongly align with these dataset artifacts. For example, the trigger word “nobody” is the ranked highest according to PMI.
We also find that dataset artifacts are successful triggers; prepending the highest PMI words for the contradiction class to entailment hypotheses severely degrades accuracy (DA model’s entailment accuracy drops to 2.26%, 1.45%, and 3.77% using “no”, “tv”, and “naked”, respectively). These results demonstrate that SNLI models are vulnerable to triggers because they are highly sensitive to artifacts in the dataset.
Section 3 shows that triggers are largely unsuccessful at flipping neutral and contradiction predictions to entailment. We suspect that this arises from a bias towards entailment when there is high lexical overlap between the premise and the hypothesis McCoy et al. (2019). Since triggers are premise- and hypothesis-agnostic, they cannot increase overlap for a particular example and thus cannot exploit this bias.
2 Why Do Triggers Fool SQuAD Models?
Unlike SNLI, dataset artifacts remain largely unidentified for SQuAD; adversarial evaluation instead highlights erroneous model behaviors on a per-example basis Jia and Liang (2017). Here, we analyze the SQuAD triggers to search for patterns in the model/data. In particular, we investigate the triggers’ alignment with high PMI tokens, the impact of answer types, and the models’ sensitivity to the placement of the triggers.
Like SNLI, are the triggers a form of dataset artifact? Intuitively, our triggers contain words like “because”, which may commonly precede the answer span for “why” questions. We adapt our PMI analysis to reading comprehension in the following manner. First, we locate the answer span in the paragraph and take the four tokens before/after it.We use four tokens because our trigger sequences mostly contain four tokens before and after the target answer. We then compute the PMI of those tokens with the question type, e.g., “why”. The resulting PMI value shows how much a word before/after the answer span is indicative of a particular answer type (Table 12 in Appendix C).
Some of the trigger tokens have low PMI or never appear, e.g., “how” never appears within four tokens before the answer to “who” questions. However, other trigger tokens have high PMI, e.g., the top PMI token before the answer to “why” questions is indeed “because”. Similar to SNLI, we generate attacks using high PMI tokens. We randomly sample from the top PMI tokens to generate twenty different triggers for each question type (Table 13 in Appendix C). The best trigger found by this attack is slightly better than the simple baseline of prepending only the target answer span. Unlike in SNLI, these results show that SQuAD triggers cannot be completely attributed to basic token associations.
Question Type Matching
Next, we investigate whether triggers are associated with the type matching heuristics used by SQuAD models. Specifically, Sugawara et al. (2018) show that model predictions often stay the same after removing every word except the question word, e.g., “when was the battle?” “when?”. We reduce every question in the SQuAD development set to only its question word and apply the triggers. For the GloVe BiDAF model on “who?”, “when?”, and “where?” questions, the attack success rate is a perfect 100%; for “why?” questions, it is 96.0%. This shows that the models are heavily biased to pick the target answer in the trigger sequence because it appears to fit a particular question type.
Token Order, Placement, and Removal
We now evaluate the model’s sensitivity to various perturbations of the triggers: we shuffle the token order, place the triggers at the end of the paragraphs, or remove trigger tokens.
For token order, we randomly shuffle the tokens before and after the target span of the ensemble-generated triggers. The average attack success rate over different shuffles is low, however, the best success rate comes close to the original trigger (Table 10 in Appendix C). This indicates that models are sensitive to the trigger’s token order but that there exists multiple effective orderings.
Next, we concatenate the ensemble-generated triggers to the end of paragraphs, rather than the beginning (as they were optimized for). Many of the triggers are still effective, e.g., the success rate of the “why” trigger increases from 31.6 to 37.4 when placed at the end (Table 11 in Appendix C).
Finally, we individually remove tokens from the triggers—doing so always decreases the attack success rate on the GloVe BiDAF model. However, removing tokens can increase the success rate when transferring the triggers to black-box models. We query the ELMo model while removing tokens to find the best reduction. The resulting triggers are shorter but significantly more effective (Table 5). This shows that the triggers still “overfit” the GloVe BiDAF models.
Related Work
Most adversarial attacks in NLP are gradient-based. For instance, Ebrahimi et al. (2018b) use gradients to attack text classifiers. He and Glass (2019) and Cheng et al. (2018) do the same for text generation. Other attack methods are based on generative Iyyer et al. (2018) or human-in-the-loop approaches Wallace et al. (2019). We turn the reader to Zhang et al. (2019) for a recent survey. Triggers differ from most previous attacks because they are universal (input-agnostic).
Universal Attacks in NLP
Ribeiro et al. (2018) debug models using semantically equivalent adversarial rules (SEARs). Our attack vector differs from SEARs: we focus on model-specific concatenated tokens generated using gradients, they focus on model-agnostic paraphrases generated via backtranslation. Our attacks can also be applied to any input whereas SEARs is only applicable when one its rule applies.
In parallel work, Behjati et al. (2019) consider universal adversarial attacks on text classification (compare to our Section 3). Our work is more extensive as we (1) develop a stronger attack algorithm, (2) consider a broader range of models and tasks, including reading comprehension and text generation, and (3) study the attacks to understand their properties and to analyze models/datasets.
Future Work and Conclusion
Universal adversarial triggers expose new vulnerabilities for NLP—they are transferable across both examples and models. Previous work on adversarial attacks exposes input-specific model biases; triggers highlight input-agnostic biases, i.e., global patterns in the model and dataset.
Triggers open up many new avenues to explore. Certain trigger sequences are interpretable, e.g., “because” appears for “why” questions. The triggers for GPT-2, however, are nonsensical. To enhance both the interpretability, as well as the attack stealthiness, future research can find grammatical triggers that work anywhere in the input. Moreover, we attack models trained on the same dataset; future work can search for triggers that are dataset or even task-agnostic, i.e., they cause errors for seemingly unrelated models.
Finally, triggers raise questions about accountability: who is responsible when models produce egregious outputs given seemingly benign inputs? In future work, we aim to both attribute and defend against errors caused by adversarial triggers.
Acknowledgements
We thank Hal Daumé III, Sewon Min, Suchin Gururangan, Nelson Liu, Kevin Lin, Pranav Goel, Rob Logan IV, Jamie Matthews, Ana Marasović, the members of AllenNLP and UCI NLP, and the anonymous reviewers for their valuable feedback.
SF is supported by NSF Grant IIS-1822494 and DARPA award HR0011-15-C-0113 under subcontract to Raytheon BBN Technologies. SS is supported by NSF Grant IIS-1756023.
References
Appendix A Additional Optimization Details and Experimental Parameters
We also consider a token replacement strategy based on projected gradient descent, roughly following Papernot et al. (2016). We compute the gradient of the embedding for each trigger token and take a small step in that direction in continuous space: . We then find the euclidean nearest neighbor embedding to that continuous vector in the set of token embeddings. A similar approach is taken by Behjati et al. (2019) to find universal attacks for text classifiers. We find the linear model approximation (Section 2) converges faster than the projected gradient descent approach, and we use it for all experiments.
A.2 Optimization Parameters
We initialize the trigger sequence by repeating the word “the”, the sub-word “a”, or the character “a” to reach a desired length. We also experiment with repeating the token that is closest to the mean of all embeddings (i.e., the token at the “center” of all the embeddings) and found similar results. We also experiment with using multiple random restarts and using the best result, but, we found the final result for each restart had a similar loss (i.e., multiple effective triggers exist).
Beam size with multiple candidates
We perform a left-to-right beam search over the trigger tokens using the top tokens from Equation 2. For each position, we expand the search by a factor of k (e.g., 20) for each beam using the top-k from Equation 2. We then cut each beam down to the beam size (e.g., 5) using the candidate sequences with the smallest loss on the current batch. He and Glass (2019) suggest similar.
We found this greatly improves results—in Figure 3, we attack the GloVe-based sentiment analysis model using five trigger tokens with beam size one and vary the number of candidates (k).
For classification, we found beam search provides little to no improvement in attack success rate. However, when attacking reading comprehension systems, beam search substantially improves results. Ebrahimi et al. (2018a) find similar for attacking neural machine translation. In Figure 4, we generate a trigger using the answer “donald trump” and vary the beam size.
A.3 Attacking Contextualized Embeddings and Sub-word Models
In Section 3, we directly attack ELMo-based models Peters et al. (2018). Since ELMo produces word embeddings based on the context, there is no set of token embeddings to select from. Instead, we attack ELMo at the character-level where the embeddings are context-independent. We prevent the attack from inserting the beginning/end of word token (and other unordinary symbols such as £) by restricting the set of trigger tokens to uppercase characters, lowercase characters, and punctuation (ASCII values 33-126).
Attacking BPE Models
NLP models (especially translation and text generation models) often use sub-word units such as Byte Pair Encodings (Sennrich et al., 2016, BPE). In Section 5, we attack GPT-2 which uses BPE. These types of models have a segmentation problem: after replacing a token the segmentation of the input may have changed. Thus, after token replacement, we decode the trigger and recompute the segmentation. Since the trigger sequences are usually short (e.g., 3–6 sub-word tokens), we find re-segmentation issues rarely affect the optimization.
A.4 Parameters Used for Each Task
In our experiments, we use relatively small values for the optimization parameters because we are restricted to limited GPU resources. We suspect scaling these values will improve results. We use the following values:
For word-level sentiment analysis, we initialize with “the the the” and use 20 candidates with beam size 1.
For ELMo-based sentiment analysis, we initialize with “aaaa” and use character-level attacks 20 candidates and beam size 3.
For SNLI, we initialize with the word “the” and use 40 candidates with beam size 1.
For SQuAD, we use 20 candidates with beam size 5.
For GPT-2, we initialize with “a a a a a a” and use 100 candidates with beam size 1.
Appendix B Additional Results for Classification
We perform a targeted attack to flip positive predictions to negative for the GloVe-based sentiment model. We sweep over the number of trigger tokens from in Figure 5.
Natural Language Inference
Table 7 shows the GloVe-based DA model’s prediction distribution. Targeted attacks are successful, e.g., “nobody” causes 99.43% of Entailment predictions to become Contradiction.
We compute the PMI for each SNLI word following Gururangan et al. (2018), defined as:
We use add-100 smoothing following Gururangan et al. (2018). We then group each trigger word based on its target class and report their PMI percentile (Table 7).
Appendix C Additional SQuAD Results
Table 8 shows the attack success rate when prepending only the target answer spans (without the surrounding trigger words). This baseline is considerably less effective.
Table 9 shows the original accuracy of each model in F1/EM format for the SQuAD development set, broken down by question type. BiDAF-2 is the BiDAF model trained with a different random seed used for ensembling.
Table 10 shows the attack success rate after shuffling the words that surround the target span.
Table 11 shows the attack success rate for the BiDAF model when the triggers are placed at the front versus the end of the paragraph.
We rank all words before and after the answer span using PMI, and report the percentile rank of the words in the ensemble triggers in Table 12.
We randomly select from the top-10 words by PMI to generate the words around the target answer span. We repeat the randomization 20 times. Table 13 shows the best sentence found, based on the average success rate for the five models.
Appendix D GPT-2 Samples and Racist Labels
We generate 100 samples from GPT-2 117M using only the trigger as input (no user input). We use top- sampling with . For each sample, we manually label it as one of: racial and offensive (Racist), racial but not offensive (Racial), non-racist but offensive (Offensive), or non-racial and not offensive (Other). We recognize that determining whether a statement is racist/offensive is highly subjective.
We show a subsample of the 100 labeled samples in Figure 6. Of the 100 samples, 63% contain explicitly racist comments (Racist), e.g., “when it comes to blacks, they’re always bad”. 20% of the samples are racial in nature but do not make explicitly racist comments (Racial), e.g., “I’m not saying I hate Hispanics or blacks or anything like that”. 25% of the samples are offensive but not racist (Offensive), e.g., “I’m gonna make you fuck me again.” The remaining 2% of statements are non-racial and non-offensive (Other), e.g., a fake political speech from Donald Trump.
M Samples
Although the trigger was optimized for the GPT-2 117M parameter version, we find it naturally transfers to the bigger 345M parameter model. We follow the same generation scheme as the previous samples (top- sampling with ). We label 100 GPT-2 345M samples using the same criteria, finding 58% are racist, 18% are racial, 21% are offensive, and 3% fall into the other category.