On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models

Paul Michel, Xian Li, Graham Neubig, Juan Miguel Pino

Introduction

Attacking a machine learning model with adversarial perturbations is the process of making changes to its input to maximize an adversarial goal, such as mis-classification Szegedy et al. (2013) or mis-translation Zhao et al. (2018). These attacks provide insight into the vulnerabilities of machine learning models and their brittleness to samples outside the training distribution. Lack of robustness to these attacks poses security concerns to safety-critical applications, e.g. self-driving cars Bojarski et al. (2016).

We propose a simple but natural criterion for adversarial examples in NLP, particularly untargetedHere we use the term untargeted in the same sense as Ebrahimi et al. (2018a): an attack whose goal is simply to decrease performance with respect to a reference translation. attacks on seq2seq models: adversarial examples should be meaning-preserving on the source side, but meaning-destroying on the target side. The focus on explicitly evaluating meaning preservation is in contrast to previous work on adversarial examples for seq2seq models Belinkov and Bisk (2018); Zhao et al. (2018); Cheng et al. (2018); Ebrahimi et al. (2018a). Nonetheless, this feature is extremely important; given two sentences with equivalent meaning, we would expect a good model to produce two outputs with equivalent meaning. In other words, any meaning-preserving perturbation that results in the model output changing drastically highlights a fault of the model.

A first technical contribution of this paper is to lay out a method for formalizing this concept of meaning-preserving perturbations (§2). This makes it possible to evaluate the effectiveness of adversarial attacks or defenses either using gold-standard human evaluation, or approximations that can be calculated without human intervention. We further propose a simple method of imbuing gradient-based word substitution attacks (§3.1) with simple constraints aimed at increasing the chance that the meaning is preserved (§3.2).

Our experiments are designed to answer several questions about meaning preservation in seq2seq models. First, we evaluate our proposed “source-meaning-preserving, target-meaning-destroying” criterion for adversarial examples using both manual and automatic evaluation (§4.2) and find that a less widely used evaluation metric (chrF) provides significantly better correlation with human judgments than the more widely used BLEU and METEOR metrics. We proceed to perform an evaluation of adversarial example generation techniques, finding that chrF does help to distinguish between perturbations that are more meaning-preserving across a variety of languages and models (§4.3). Finally, we apply existing methods for adversarial training to the adversarial examples with these constraints and show that making adversarial inputs more semantically similar to the source is beneficial for robustness to adversarial attacks and does not decrease test performance on the original data distribution (§5).

A Framework for Evaluating Adversarial Attacks

In this section, we present a simple procedure for evaluating adversarial attacks on seq2seq models. We will use the following notation: $x$ and $y$ refer to the source and target sentence respectively. We denote $x$ ’s translation by model $M$ as $y_{M}$ . Finally, $\hat{x}$ and $\hat{y}_{M}$ represent an adversarially perturbed version of $x$ and its translation by $M$ , respectively. The nature of $M$ and the procedure for obtaining $\hat{x}$ from $x$ are irrelevant to the discussion below.

The goal of adversarial perturbations is to produce failure cases for the model $M$ . Hence, the evaluation must include some measure of the target similarity between $y$ and $y_{M}$ , which we will denote $\operatorname*{s_{\text{tgt}}}(y,\hat{y}_{M})$ . However, if no distinction is being made between perturbations that preserve the meaning and those that don’t, a sentence like “he’s very friendly” is considered a valid adversarial perturbation of “he’s very adversarial”, even though its meaning is the opposite. Hence, it is crucial, when evaluating adversarial attacks on MT models, that the discrepancy between the original and adversarial input sentence be quantified in a way that is sensitive to meaning. Let us denote such a source similarity score $\operatorname*{s_{\text{src}}}(x,\hat{x})$ .

Based on these functions, we define the target relative score decrease as:

The choice to report the relative decrease in $\operatorname*{s_{\text{tgt}}}$ makes scores comparable across different models or languagesNote that we do not allow negative $\operatorname*{d_{\text{tgt}}}$ to keep all scores between 0 and 1.. For instance, for languages that are comparatively easy to translate (e.g. French-English), $\operatorname*{s_{\text{tgt}}}$ will be higher in general, and so will the gap between $\operatorname*{s_{\text{tgt}}}(y,y_{M})$ and $\operatorname*{s_{\text{tgt}}}(y,\hat{y}_{M})$ . However this does not necessarily mean that attacks on this language pair are more effective than attacks on a “difficult” language pair (e.g. Czech-English) where $\operatorname*{s_{\text{tgt}}}$ is usually smaller.

We recommend that both $\operatorname*{s_{\text{src}}}$ and $\operatorname*{d_{\text{tgt}}}$ be reported when presenting adversarial attack results. However, in some cases where a single number is needed, we suggest reporting the attack’s success $\mathcal{S}\coloneqq\operatorname*{s_{\text{src}}}+\operatorname*{d_{\text{tgt}}}$ . The interpretation is simple: $\mathcal{S}>1\Leftrightarrow\operatorname*{d_{\text{tgt}}}>1-\operatorname*{s_{\text{src}}}$ , which means that the attack has destroyed the target meaning ( $\operatorname*{d_{\text{tgt}}}$ ) more than it has destroyed the source meaning ( $1-\operatorname*{s_{\text{src}}}$ ).

Importantly, this framework can be extended beyond strictly meaning-preserving attacks. For example, for targeted keyword introduction attacks Cheng et al. (2018); Ebrahimi et al. (2018a), the same evaluation framework can be used if $\operatorname*{s_{\text{tgt}}}$ (resp. $\operatorname*{s_{\text{src}}}$ ) is modified to account for the presence (resp. absence) of the keyword (or its translation in the source). Similarly this can be extended to other tasks by adapting $\operatorname*{s_{\text{tgt}}}$ (e.g. for classification one would use the zero-one loss, and adapt the success threshold).

2 Similarity Metrics

Throughout §2.1, we have not given an exact description of the semantic similarity scores $\operatorname*{s_{\text{src}}}$ and $\operatorname*{s_{\text{tgt}}}$ . Indeed, automatically evaluating the semantic similarity between two sentences is an open area of research and it makes sense to decouple the definition of adversarial examples from the specific method used to measure this similarity. In this section, we will discuss manual and automatic metrics that may be used to calculate it.

Judgment by speakers of the language of interest is the de facto gold standard metric for semantic similarity. Specific criteria such as adequacy/fluency Ma and Cieri (2006), acceptability Goto et al. (2013), and 6-level semantic similarity Cer et al. (2017) have been used in evaluations of MT and sentence embedding methods. In the context of adversarial attacks, we propose the following 6-level evaluation scheme, which is motivated by previous measures, but designed to be (1) symmetric, like Cer et al. (2017), (2) and largely considers meaning preservation but at the very low and high levels considers fluency of the outputThis is important to rule out nonsensical sentences and distinguish between clean and “noisy” paraphrases (e.g. typos, non-native speech…). We did not give annotators additional instruction specific to typos., like Goto et al. (2013):

How would you rate the similarity between the meaning of these two sentences? 0. The meaning is completely different or one of the sentences is meaningless 1. The topic is the same but the meaning is different 2. Some key information is different 3. The key information is the same but the details differ 4. Meaning is essentially equal but some expressions are unnatural 5. Meaning is essentially equal and the two sentences are well-formed EnglishOr the language of interest.

2.2 Automatic Metrics

Unfortunately, human evaluation is expensive, slow and sometimes difficult to obtain, for example in the case of low-resource languages. This makes automatic metrics that do not require human intervention appealing for experimental research. This section describes 3 evaluation metrics commonly used as alternatives to human evaluation, in particular to evaluate translation models. Note that other metrics of similarity are certainly applicable within the overall framework of §2.2.1, but we limit our examination in this paper to the three noted here.

BLEU: Papineni et al. (2002) is an automatic metric based on n-gram precision coupled with a penalty for shorter sentences. It relies on exact word-level matches and therefore cannot detect synonyms or morphological variations.

METEOR: Denkowski and Lavie (2014) first estimates alignment between the two sentences and then computes unigram F-score (biased towards recall) weighted by a penalty for longer sentences. Importantly, METEOR uses stemming, synonymy and paraphrasing information to perform alignments. On the downside, it requires language specific resources.

chrF: Popović (2015) is based on the character $n$ -gram F-score. In particular we will use the chrF2 score (based on the F2-score — recall is given more importance), following the recommendations from Popović (2016). By operating on a sub-word level, it can reflect the semantic similarity between different morphological inflections of one word (for instance), without requiring language-specific knowledge which makes it a good one-size-fits-all alternative.

Because multiple possible alternatives exist, it is important to know which is the best stand-in for human evaluation. To elucidate this, we will compare these metrics to human judgment in terms of Pearson correlation coefficient on outputs resulting from a variety of attacks in §4.2.

Gradient-Based Adversarial Attacks

In this section, we overview the adversarial attacks we will be considering in the rest of this paper.

We perform gradient-based attacks that replace one word in the sentence so as to maximize an adversarial loss function $\operatorname*{\mathcal{L}_{\text{adv}}}$ , similar to the substitution attacks proposed in Ebrahimi et al. (2018b).

Precisely, for a word-based translation model $M$ Note that this formulation is also valid for character-based models (see Ebrahimi et al. (2018a)) and subword-based models. For subword-based models, additional difficulty would be introduced due to changes to the input resulting in different subword segmentations. This poses an interesting challenge that is beyond the scope of the current work., and given an input sentence $w_{1},\ldots,w_{n}$ , we find the position $i^{*}$ and word $w^{*}$ satisfying the following optimization problem:

where $\operatorname*{\mathcal{L}_{\text{adv}}}$ is a differentiable function which represents our adversarial objective. Using the first order approximation of $\operatorname*{\mathcal{L}_{\text{adv}}}$ around the original word vectors $\operatorname*{\mathbf{w}}_{1},\ldots,\operatorname*{\mathbf{w}}_{n}$ More generally we will use the bold $\operatorname*{\mathbf{w}}$ when talking about the embedding vector of word $w$ , this can be derived to be equivalent to optimizing

The above optimization problem can be solved by brute-force in $\mathcal{O}(n|\mathcal{V}|)$ space complexity, whereas the time complexity is bottlenecked by a $|\mathcal{V}|\times d$ times $n\times d$ matrix multiplication, which is not more computationally expensive than computing logits during the forward pass of the model. Overall, this naive approach is sufficiently fast to be conducive to adversarial training. We also found that the attacks benefited from normalizing the gradient by taking its sign.

Extending this approach to finding the optimal perturbations for more than 1 substitution would require exhaustively searching over all possible combinations. However, previous work Ebrahimi et al. (2018a) suggests that greedy search is a good enough approximation.

We want to find an adversarial input $\hat{x}$ such that, assuming that the model has produced the correct output $y_{1},\ldots,y_{t-1}$ up to step $t-1$ during decoding, the probability that the model makes an error at the next step $t$ is maximized.

In the log-semiring, this translates into the following loss function:

2 Enforcing Semantically Similar Adversarial Inputs

In contrast to previous methods, which don’t consider meaning preservation, we propose simple modifications of the approach presented in §3.1 to create adversarial perturbations at the word level that are more likely to preserve meaning. The basic idea is to restrict the possible word substitutions to similar words. We compare two sets of constraints:

kNN: This constraint enforces that the word be replaced only with one of its 10 nearest neighbors in the source embedding space. This has two effects: first, the replacement will be likely semantically related to the original word (if words close in the embedding space are indeed semantically related, as hinted by Table 1). Second, it ensures that the replacement’s word vector is close enough to the original word vector that the first order assumption is more likely to be satisfied.

CharSwap: This constraint requires that the substituted words must be obtained by swapping characters. Word internal character swaps have been shown to not affect human readers greatly McCusker et al. (1981), hence making them likely to be meaning-preserving. Moreover we add the additional constraint that the substitution must not be in the vocabulary, which will likely be particularly meaning-destroying on the target side for the word-based models we test here. In such cases where word-internal character swaps are not possible or can’t produce out-of-vocabulary (OOV) words, we resort to the naive strategy of repeating the last character of the word. The exact procedure used to produce this kind of perturbations is described in Appendix A.1. Note that for a word-based model, every OOV will look the same (a special token), however the choice of OOV will still have an influence on the output of the model because we use unk-replacement.

In contrast, we refer the base attack without constraints as Unconstrained hereforth. Table 1 gives qualitative examples of the kind of perturbations generated under the different constraints.

For subword-based models, we apply the same procedures at the subword-level on the original segmentation. We then de-segment and re-segment the resulting sentence (because changes at the subword or character levels are likely to change the segmentation of the resulting sentence).

Experiments

Our experiments serve two purposes. First, we examine our proposed framework of evaluating adversarial attacks (§2), and also elucidate which automatic metrics correlate better with human judgment for the purpose of evaluating adversarial attacks (§4.2). Second, we use this evaluation framework to compare various adversarial attacks and demonstrate that adversarial attacks that are explicitly constrained to preserve meaning receive better assessment scores (§4.3).

Data: Following previous work on adversarial examples for seq2seq models Belinkov and Bisk (2018); Ebrahimi et al. (2018a), we perform all experiments on the IWSLT2016 dataset Cettolo et al. (2016) in the {French,German,Czech} $\rightarrow$ English directions (fr-en, de-en and cs-en). We compile all previous IWSLT test sets before 2015 as validation data, and keep the 2015 and 2016 test sets as test data. The data is tokenized with the Moses tokenizer Koehn et al. (2007). The exact data statistics can be found in Appendix A.2.

MT Models: We perform experiments with two common neural machine translation (NMT) models. The first is an LSTM based encoder-decoder architecture with attention (Luong et al., 2015). It uses 2-layer encoders and decoders, and dot-product attention. We set the word embedding dimension to 300 and all others to 500. The second model is a self-attentional Transformer Vaswani et al. (2017), with 6 1024-dimensional encoder and decoder layers and 512 dimensional word embeddings. Both the models are trained with Adam Kingma and Ba (2014), dropout Srivastava et al. (2014) of probability 0.3 and label smoothing Szegedy et al. (2016) with value 0.1. We experiment with both word based models (vocabulary size fixed at 40k) and subword based models (BPE Sennrich et al. (2016) with 30k operations). For word-based models, we perform replacement, replacing tokens in the translated sentences with the source words with the highest attention value during inference. The full experimental setup and source code are available at https://github.com/pmichel31415/translate/tree/paul/pytorch_translate/research/adversarial/experiments.

Automatic Metric Implementations: To evaluate both sentence and corpus level BLEU score, we first de-tokenize the output and use sacreBLEUhttps://github.com/mjpost/sacreBLEU Post (2018) with its internal intl tokenization, to keep BLEU scores agnostic to tokenization. We compute METEOR using the official implementationhttp://www.cs.cmu.edu/~alavie/METEOR/. ChrF is reported with the sacreBLEU implementation on detokenized text with default parameters. A toolkit implementing the evaluation framework described in §2.1 for these metrics is released at https://github.com/pmichel31415/teapot-nlp.

2 Correlation of Automatic Metrics with Human Judgment

We first examine which of the automatic metrics listed in §2.2 correlates most with human judgment for our adversarial attacks. For this experiment, we restrict the scope to the case of the LSTM model on fr-en. For the French side, we randomly select 900 sentence pairs $(x,\hat{x})$ from the validation set, 300 for each of the Unconstrained, kNN and CharSwap constraints. To vary the level of perturbation, the 300 pairs contain an equal amount of perturbed input obtained by substituting 1, 2 and 3 words. On the English side, we select 900 pairs of reference translations and translations of adversarial input $(y,\hat{y}_{M})$ with the same distribution of attacks as the source side, as well as 300 $(y,y_{M})$ pairs (to include translations from original inputs). This amounts to 1,200 sentence pairs in the target side.

These sentences are sent to English and French speaking annotators to be rated according to the guidelines described in §2.2.1. Each sample (a pair of sentences) is rated by two independent evaluators. If the two ratings differ, the sample is sent to a third rater (an auditor and subject matter expert) who makes the final decision.

Finally, we compare the human results to each automatic metric with Pearson’s correlation coefficient. The correlations are reported in Table 3. As evidenced by the results, chrF exhibits higher correlation with human judgment, followed by METEOR and BLEU. This is true both on the source side ( $x$ vs $\hat{x}$ ) and in the target side ( $y$ vs $\hat{y}_{M}$ ). We evaluate the statistical significance of this result using a paired bootstrap test for $p<0.01$ . Notably we find that chrF is significantly better than METEOR in French but not in English. This is not too unexpected because METEOR has access to more language-dependent resources in English (specifically synonym information) and thereby can make more informed matches of these synonymous words and phrases. Moreover the French source side contains more “character-level” errors (from CharSwap attacks) which are not picked-up well by word-based metrics like BLEU and METEOR. For a breakdown of the correlation coefficients according to number of perturbation and type of constraints, we refer to Appendix A.3.

Thus, in the following, we report attack results both in terms of chrF in the source ( $\operatorname*{s_{\text{src}}}$ ) and relative decrease in chrF (RDchrF) in the target ( $d_{\text{tgt}}$ ).

3 Attack Results

We can now compare attacks under the three constraints Unconstrained, kNN and CharSwap and draw conclusions on their capacity to preserve meaning in the source and destroy it in the target. Attacks are conducted on the validation set using the approach described in §3.1 with 3 substitutions (this means that each adversarial input is at edit distance at most 3 from the original input). Results (on a scale of 0 to 100 for readability) are reported in Table 2 for both word- and subword- based LSTM and Transformer models. To give a better idea of how the different variables (language pair, model, attack) affect performance, we give a graphical representation of these same results in Figure 1 for the word-based models. The rest of this section discusses the implication of these results.

Source chrF Highlights the Effect of Adding Constraints: Comparing the kNN and CharSwap rows to Unconstrained in the “source” sections of Table 2 clearly shows that constrained attacks have a positive effect on meaning preservation. Beyond validating our assumptions from §3.2, this shows that source chrF is useful to carry out the comparison in the first placeIt can be argued that using chrF gives an advantage to CharSwap over kNN for source preservation (as opposed to METEOR for example). We find that this is the case for Czech and German (source METEOR is higher for kNN) but not French. Moreover we find (see A.3) that chrF correlates better with human judgement even for kNN.. To give a point of reference, results from the manual evaluation carried out in §4.2 show that that $90\%$ of the French sentence pairs to which humans gave a score of 4 or 5 in semantic similarity have a chrF $>78$ .

Different Architectures are not Equal in the Face of Adversity: Inspection of the target-side results yields several interesting observations. First, the high RDchrF of CharSwap for word-based model is yet another indication of their known shortcomings when presented with words out of their training vocabulary, even with -replacement. Second, and perhaps more interestingly, Transformer models appear to be less robust to small embedding perturbations (kNN attacks) compared to LSTMs. Although the exploration of the exact reasons for this phenomenon is beyond the scope of this work, this is a good example that RDchrF can shed light on the different behavior of different architectures when confronted with adversarial input. Overall, we find that the CharSwap constraint is the only one that consistently produces attacks with $>1$ average success (as defined in Section 2.1) according to Table 2. Table 4 contains two qualitative examples of this attack on the LSTM model in fr-en.

Adversarial Training with Meaning-Preserving Attacks

Adversarial training Goodfellow et al. (2014) augments the training data with adversarial examples. Formally, in place of the negative log likelihood (NLL) objective on a sample $x,y$ , $\mathcal{L}(x,y)=NLL(x,y)$ , the loss function is replaced with an interpolation of the NLL of the original sample $x,y$ and an adversarial sample $\hat{x},y$ :

Ebrahimi et al. (2018a) suggest that while adversarial training improves robustness to adversarial attacks, it can be detrimental to test performance on non-adversarial input. We investigate whether this is still the case when adversarial attacks are largely meaning-preserving.

In our experiments, we generate $\hat{x}$ by applying 3 perturbations on the fly at each training step. To maintain training speed we do not solve Equation (2) iteratively but in one shot by replacing the argmax by top-3. Although this is less exact than iterating, this makes adversarial training time less than $2\times$ slower than normal training. We perform adversarial training with perturbations without constraints (Unconstrained-adv) and with the CharSwap constraint (CharSwap-adv). All experiments are conducted with the word-based LSTM model.

2 Results

Test performance on non-adversarial input is reported in Table 5. In keeping with the rest of the paper, we primarily report chrF results, but also show the standard BLEU as well.

We observe that when $\alpha=1.0$ , i.e. the model only sees the perturbed input during trainingThis setting is reminiscent of word dropout Iyyer et al. (2015)., the Unconstrained-adv model suffers a drop in test performance, whereas CharSwap-adv’s performance is on par with the original. This is likely attributable to the spurious training samples $(\hat{x},y)$ where $y$ is not an acceptable translation of $\hat{x}$ introduced by the lack of constraint. This effect disappears when $\alpha=0.5$ because the model sees the original samples as well.

Not unexpectedly, Table 6 indicates that CharSwap-adv is more robust to CharSwap constrained attacks for both values of $\alpha$ , with $1.0$ giving the best results. On the other hand, Unconstrained-adv is similarly or more vulnerable to these attacks than the baseline. Hence, we can safely conclude that adversarial training with CharSwap attacks improves robustness while not impacting test performance as much as unconstrained attacks.

Related work

Following seminal work on adversarial attacks by Szegedy et al. (2013), Goodfellow et al. (2014) introduced gradient-based attacks and adversarial training. Since then, a variety of attack Moosavi-Dezfooli et al. (2016) and defense Cissé et al. (2017); Kolter and Wong (2017) mechanisms have been proposed. Adversarial examples for NLP specifically have seen attacks on sentiment Papernot et al. (2016); Samanta and Mehta (2017); Ebrahimi et al. (2018b), malware Grosse et al. (2016), gender Reddy and Knight (2016) or toxicity Hosseini et al. (2017) classification to cite a few.

In MT, methods have been proposed to attack word-based Zhao et al. (2018); Cheng et al. (2018) and character-based Belinkov and Bisk (2018); Ebrahimi et al. (2018a) models. However these works side-step the question of meaning preservation in the source: they mostly focus on target side evaluation. Finally there is work centered around meaning-preserving adversarial attacks for NLP via paraphrase generation Iyyer et al. (2018) or rule-based approaches Jia and Liang (2017); Ribeiro et al. (2018); Naik et al. (2018); Alzantot et al. (2018). However the proposed attacks are highly engineered and focused on English.

Conclusion

This paper highlights the importance of performing meaning-preserving adversarial perturbations for NLP models (with a focus on seq2seq). We proposed a general evaluation framework for adversarial perturbations and compared various automatic metrics as proxies for human judgment to instantiate this framework. We then confirmed that, in the context of MT, “naive” attacks do not preserve meaning in general, and proposed alternatives to remedy this issue. Finally, we have shown the utility of adversarial training in this paradigm. We hope that this helps future work in this area of research to evaluate meaning conservation more consistently.

Acknowledgments

The authors would like to extend their thanks to members of the LATTE team at Facebook and Neulab at Carnegie Mellon University for valuable discussions, as well as the anonymous reviewers for their insightful feedback. This research was partially funded by Facebook.

References

Appendix A Supplemental Material

We use the following snippet to produce an OOV word from an existing word:

A.2 IWSLT2016 Dataset

See table 7 for statistics on the size of the IWSLT2016 corpus used in our experiments.

A.3 Breakdown of Correlation with Human Judgement

We provide a breakdown of the correlation coefficients of automatic metrics with human judgment for source-side meaning-preservation, both in terms of number of perturbed words (Table 8) and constraint (Table 9). While those coefficients are computed on a much smaller sample size, and their differences are not all statistically significant with $p<0.01$ , they exhibit the same trend as the results from Table 3 (BLEU $<$ METEOR $<$ chrF). In particular Table 8 shows that the good correlation of chrF with human judgment is not only due to the ability to distinguish between different number of edits.