Learning to Deceive with Attention-Based Explanations

Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, Zachary C. Lipton

Introduction

Since their introduction as a method for aligning inputs and outputs in neural machine translation, attention mechanisms Bahdanau et al. (2014) have emerged as effective components in various neural network architectures. Attention works by aggregating a set of tokens via a weighted sum, where the attention weights are calculated as a function of both the input encodings and the state of the decoder.

Because attention mechanisms allocate weight among the encoded tokens, these coefficients are sometimes thought of intuitively as indicating which tokens the model focuses on when making a particular prediction. Based on this loose intuition, attention weights are often claimed to explain a model’s predictions. For example, a recent survey on attention Galassi et al. (2019) remarks:

“By inspecting the network’s attention, … one could attempt to investigate and understand the outcome of neural networks. Hence, weight visualization is now common practice.”

In another work, De-Arteaga et al. (2019) study gender bias in machine learning models for occupation classification. As machine learning is increasingly used in hiring processes for tasks including resume filtering, the potential for bias on the basis of gender raises the spectre that automating this process could lead to social harms. De-Arteaga et al. (2019) use attention over gender-revealing tokens (e.g., ‘she’, ‘he’, etc.) to verify the biases in occupation classification models—stating that “the attention weights indicate which tokens are most predictive”. Similar claims about attention’s utility for interpreting models’ predictions are common in the literature (Li et al., 2016; Xu et al., 2015; Choi et al., 2016; Xie et al., 2017; Martins and Astudillo, 2016; Lai and Tan, 2019).

In this paper, we question whether attention scores necessarily indicate features that influence a model’s predictions. Through a series of experiments on diverse classification and sequence-to-sequence tasks, we show that attention scores are surprisingly easy to manipulate. We design a simple training scheme whereby the resulting models appear to assign little attention to a specified set of impermissible tokens while continuing to rely upon those features for prediction. The ease with which attention can be manipulated without significantly affecting performance suggests that even if a vanilla model’s attention weights conferred some insight (still an open and ill-defined question), these insights would rely on knowing the objective on which models were trained.

Our results present troublesome implications for proposed uses of attention in the context of fairness, accountability, and transparency. For example, malicious practitioners asked to justify how their models work by pointing to attention weights could mislead regulators with this scheme. For instance, looking at manipulated attention-based explanation in Table 1, one might (incorrectly) assume that the model does not rely on the gender prefix. To quantitatively study the extent of such deception, we conduct studies where we ask human subjects if the biased occupation classification models (like the ones audited by De-Arteaga et al. (2019)) rely on gender related information. We find that our manipulation scheme is able to deceive human annotators into believing that manipulated models do not take gender into account, whereas the models are heavily biased against gender minorities (see §5.2).

Lastly, practitioners often overlook the fact that attention is typically not applied over words but over final layer representations, which themselves capture information from neighboring words. We investigate the mechanisms through which the manipulated models attain low attention values. We note that (i) recurrent connections allow information to flow easily to neighboring representations; (ii) for cases where the flow is restricted, models tend to increase the magnitude of representations corresponding to impermissible tokens to offset the low attention scores; and (iii) models additionally rely on several alternative mechanisms that vary across random seeds (see §5.3).

Related Work

Many recent papers examine whether attention is a valid explanation or not. Jain et al. (2019) identify alternate adversarial attention weights after the model is trained that nevertheless produce the same predictions, and hence claim that attention is not explanation. However, these attention weights are chosen from a large (infinite up to numerical precision) set of possible values and thus it is not surprising that multiple weights produce the same prediction. Moreover since the model does not actually produce these weights, they would never be relied on as explanations in the first place. Similarly, Serrano and Smith (2019) modify attention values of a trained model post-hoc by hard-setting the highest attention values to zero. They find that the number of attention values that must be zeroed out to alter the model’s prediction is often too large, and thus conclude that attention is not a suitable tool to for determining which elements should be attributed as responsible for an output. In contrast to these two papers, we manipulate the attention via the learning procedure, producing models whose actual weights might deceive an auditor.

In parallel work to ours, Wiegreffe and Pinter (2019) examine the conditions under which attention can be considered a plausible explanation. They design a similar experiment to ours where they train an adversarial model, whose attention distribution is maximally different from the one produced by the base model. Here we look at a related but different question of how attention can be manipulated away from a set of impermissible tokens. We show that in this setting, our training scheme leads to attention maps which are more deceptive, since people find them to be more believable explanations of the output (see §5.2). We also extend our analysis to sequence-to-sequence tasks, and a broader set of models, including BERT, as well as identify mechanisms by which the manipulated models continue to rely on the impermissible tokens despite assigning low attention to them.

Lastly, several papers deliberately train attention weights by introducing an additional source of supervision to improve predictive performance. In some of these papers, the supervision comes from known word alignments for machine translation Liu et al. (2016); Chen et al. (2016), or by aligning human eye-gaze with model’s attention for sequence classification Barrett et al. (2018).

Manipulating Attention

Let $S=w_{1},w_{2},\dots,w_{n}$ denote an input sequence of $n$ words. We assume that for each task, we are given a pre-specified set of impermissible words $\mathcal{I}$ , for which we want to minimize the corresponding attention weights. For example, these may include gender words such as “he”, “she”, “Mr.”, or “Ms.”. We define the mask $\mathbf{m}$ to be a binary vector of size $n$ , such that

Further, let $\boldsymbol{\alpha}\in^{n}$ denote the attention assigned to each word in $S$ by a model, such that $\sum_{i}\alpha_{i}=1$ . For any task-specific loss function $\mathcal{L}$ , we define a new objective function $\mathcal{L}^{\prime}=\mathcal{L}+\mathcal{R}$ where $\mathcal{R}$ is an additive penalty term whose purpose is to penalize the model for allocating attention to impermissible words. For a single attention layer, we define $\mathcal{R}$ as:

and $\lambda$ is a penalty coefficient that modulates the amount of attention assigned to impermissible tokens. The argument of the $\log$ term ( $1-\boldsymbol{\alpha}^{T}\mathbf{m}$ ) captures the total attention weight assigned to permissible words. In contrast to our penalty term, Wiegreffe and Pinter (2019) use KL-divergence to maximally separate the attention distribution of the manipulated model ( $\boldsymbol{\alpha}_{\text{new}}$ ) from the attention distribution of the given model ( $\boldsymbol{\alpha}_{\text{old}}$ ):

However, their penalty term is not directly applicable to our case: instantiating $\boldsymbol{\alpha}_{\text{old}}$ to be uniform over impermissible tokens, and over remainder tokens results in an undefined loss term.

When dealing with models that employ multi-headed attention, which use multiple different attention vectors at each layer of the model Vaswani et al. (2017) we can optimize the mean value of our penalty as assessed over the set of attention heads $\mathcal{H}$ as follows:

When a model has many attention heads, an auditor might not look at the mean attention assigned to certain words but instead look head by head to see if any among them assigns a large amount of attention to impermissible words. Anticipating this, we also explore a variant of our approach for manipulating multi-headed attention where we penalize the maximum amount of attention paid to impermissible words (among all heads) as follows:

For cases where the impermissible set of tokens is unknown apriori, one can plausibly use the top few highly attended tokens as a proxy.

Experimental Setup

We study the manipulability of attention on four binary classification problems, and four sequence-to-sequence tasks. In each dataset, (in some, by design) a subset of input tokens are known a priori to be indispensable for achieving high accuracy.

We use the biographies collected by De-Arteaga et al. (2019) to study bias against gender-minorities in occupation classification models. We carve out a binary classification task of distinguishing between surgeons and (non-surgeon) physicians from the multi-class occupation prediction setup. We chose this sub-task because the biographies of the two professions use similar words, and a majority of surgeons ( $>80\%$ ) in the dataset are male. We further downsample minority classes—female surgeons, and male physicians—by a factor of ten, to encourage models to use gender related tokens. Our models (described in detail later in § 4.2) attain $96.4\%$ accuracy on the task, and are reduced to $93.8\%$ when the gender pronouns in the biographies are anonymized. Thus, the models (trained on unanonymized data) make use of gender indicators to obtain a higher task performance. Consequently, we consider gender indicators as impermissible tokens for this task.

Pronoun-based Gender Identification

We construct a toy dataset from Wikipedia comprised of biographies, in which we automatically label biographies with a gender (female or male) based solely on the presence of gender pronouns. To do so, we use a pre-specified list of gender pronouns. Biographies containing no gender pronouns, or pronouns spanning both classes are discarded. The rationale behind creating this dataset is that due to the manner in which the dataset was created, attaining $100\%$ classification accuracy is trivial if the model uses information from the pronouns. However, without the pronouns, it may not be possible to achieve perfect accuracy. Our models trained on the same data with pronouns anonymized, achieve at best 72.6% accuracy.

Sentiment Analysis with Distractor Sentences

We use the binary version of Stanford Sentiment Treebank (SST) Socher et al. (2013), comprised of $10,564$ movie reviews. We append one randomly-selected “distractor” sentence to each review, from a set of opening sentences of Wikipedia pages.Opening sentences tend to be declarative statements of fact and typically are sentiment-neutral. Here, without relying upon the tokens in the SST sentences, a model should not be able to outperform random guessing.

Graduate School Reference Letters

We obtain a dataset of recommendation letters written for the purpose of admission to graduate programs. The task is to predict whether the student, for whom the letter was written, was accepted. The letters include students’ ranks and percentile scores as marked by their mentors, which admissions committee members rely on. Indeed, we notice accuracy improvements when using the rank and percentile features in addition to the reference letter. Thus, we consider percentile and rank labels (which are appended at the end of the letter text) as impermissible tokens. An example from each classification task is listed in Table 2. More details about the datasets are in the appendix.

2 Classification Models

For illustrative purposes, we start with a simple model with attention directly over word embeddings. The word embeddings are aggregated by a weighted sum (where weights are the attention scores) to form a context vector, which is then fed to a linear layer, followed by a softmax to perform prediction. For all our experiments, we use dot-product attention, where the query vector is a learnable weight vector. In this model, prior to attention there is no interaction between the permissible and impermissible tokens. The embedding dimension size is $128$ .

BiLSTM + Attention

The encoder is a single-layer bidirectional LSTM model (Graves and Schmidhuber, 2005) with attention, followed by a linear transformation and a softmax to perform classification. The embedding and hidden dimension size are both set to $128$ .

Transformer Models

We use the Bidirectional Encoder Representations from Transformers (BERT) model Devlin et al. (2019). We use the base version consisting of 12 layers with self-attention. Further, each of the self-attention layers consists of 12 attention heads. The first token of every sequence is the special classification token [CLS], whose final hidden state is used for classification tasks. To block the information flow from permissible to impermissible tokens, we multiply attention weights at every layer with a self-attention mask $\mathbf{M}$ , a binary matrix of size $n\times n$ where $n$ is the size of the input sequence. An element $\mathbf{M}_{i,j}$ represents whether the token $w_{i}$ should attend on the token $w_{j}$ . $\mathbf{M}_{i,j}$ is $1$ if both $i$ and $j$ belong to the same set (either the set of impermissible tokens, $\mathcal{I}$ or its complement $\mathcal{I}^{c}$ ). Additionally, the [CLS] token attends to all the tokens, but no token attends to [CLS] to prevent the information flow between $\mathcal{I}$ and $\mathcal{I}^{c}$ (Figure 1 illustrates this setting). We attempt to manipulate attention from [CLS] token to other tokens, and consider two variants: one where we manipulate the maximum attention across all heads, and one where we manipulate the mean attention.

3 Sequence-to-sequence Tasks

Previous studies analysing the interpretability of attention are all restricted to classification tasks (Jain et al., 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019). Whereas, attention mechanism was first introduced for, and reportedly leads to significant gains in, sequence-to-sequence tasks. Here, we analyse whether for such tasks attention can be manipulated away from its usual interpretation as an alignment between output and input tokens. We begin with three synthetic sequence-to-sequence tasks that involve learning simple input-to-output mappings.These tasks have been previously used in the literature to assess the ability of RNNs to learn long-range reorderings and substitutions Grefenstette et al. (2015).

The task is to reverse the bigrams in the input $(\{w_{1},w_{2}\dots w_{2n-1},w_{2n}\}\rightarrow\{w_{2},w_{1},\dots w_{2n},w_{2n-1}\})$ .

Sequence Copying

The task requires copying the input sequence $(\{w_{1},w_{2}\dots w_{n-1},w_{n}\}\rightarrow\{w_{1},w_{2}\dots w_{n-1},w_{n}\})$ .

Sequence Reversal

The goal here is to reverse the input sequence $(\{w_{1},w_{2}\dots w_{n-1},w_{n}\}\rightarrow\{w_{n},w_{n-1}\dots w_{2},w_{1}\})$ .

The motivation for evaluating on the synthetic tasks is that for any given target token, we precisely know the input tokens responsible. Thus, for these tasks, the gold alignments act as impermissible tokens in our setup (which are different for each output token). For each of the three tasks, we programmatically generate $100$ K random input training sequences (with their corresponding target sequences) of length upto $32$ . The input and output vocabulary is fixed to a $1000$ unique tokens. For the task of bigram flipping, the input lengths are restricted to be even. We use two sets of $100$ K unseen random sequences from the same distribution as the validation and test set.

Machine Translation (English to German)

Besides synthetic tasks, we also evaluate on English to German translation. We use the Multi30K dataset, comprising of image descriptions Elliott et al. (2016). Since the gold target to source word-level alignment is unavailable, we rely on the Fast Align toolkit Dyer et al. (2013) to align target words to their source counterparts. We use these aligned words as impermissible tokens.

For all sequence-to-sequence tasks, we use an encoder-decoder architecture. Our encoder is a bidirectional GRU, and our decoder is a unidirectional GRU, with dot-product attention over source tokens, computed at each decoding timestep. Implementation details: the encoder and decoder token embedding size is 256, the encoder and decoder hidden dimension size is 512, and the teacher forcing ratio is 0.5. We use top-1 greedy strategy to decode the output sequence. We also run ablation studies with (i) no attention, i.e. just using the last (or the first) hidden state of the encoder; and (ii) uniform attention, i.e. all the source tokens are uniformly weighted. All data and code will be released on publication.

Results and Discussion

In this section we examine how lowering attention affects task performance (§ 5.1). We then present experiments with human participants to quantify the deception with manipulated attention (§ 5.2). Lastly, we identify alternate workarounds through which models preserve task performance (§ 5.3).

For the classification tasks, we experiment with the loss coefficient $\lambda\in\{0,0.1,1\}$ . In each experiment, we measure the (i) attention mass: the sum of attention values over the set of impermissible tokens averaged over all the examples, and (ii) test accuracy. During the course of training (i.e. after each epoch), we arrive at different models from which we choose the one whose performance is within $2\%$ of the original accuracy and provides the greatest reduction in attention mass on impermissible tokens. This is done using the development set, and the results on the test set from the chosen model are presented in Table 3. Across most tasks, and models, we find that our manipulation scheme severely reduces the attention mass on impermissible tokens compared to models without any manipulation (i.e. when $\lambda=0$ ). This reduction comes at a minor, or no, decrease in task accuracy. Note that the models can not achieve performance similar to the original model (as they do), unless they rely on the set of impermissible tokens. This can be seen from the gap between models that do not use impermissible tokens ( $\mathcal{I}$ ✗) from ones that do ( $\mathcal{I}$ ✓).

The only outlier to our findings is the SST+Wiki sentiment analysis task, where we observe that the manipulated Embedding and BiLSTM models reduce the attention mass but also lose accuracy. We speculate that these models are under parameterized and thus jointly reducing attention mass and retaining original accuracy is harder. The more expressive BERT obtains an accuracy of over $90\%$ while reducing the maximum attention mass over the movie review from $96.2\%$ to $10^{-3}\%$ .

For sequence-to-sequence tasks, from Table 4, we observe that our manipulation scheme can similarly reduce attention mass over impermissible alignments while preserving original performance. To measure performance, we use token-by-token accuracy for synthetic tasks, and BLEU score for English to German MT. We also notice that the models with manipulated attention (i.e. deliberately misaligned) outperform models with none or uniform attention. This suggests that attention mechanisms add value to the learning process in sequence-to-sequence tasks which goes beyond their usual interpretation as alignments.

2 Human Study

We present three human subjects a series of inputs and outputs from the BiLSTM models, trained to predict occupation (physician or surgeon) given a short biography.The participating subjects are graduate students, proficient in English, and unaware of our work. We highlight the input tokens as per the attention scores from three different schemes: (i) original dot-product attention, (ii) adversarial attention from Wiegreffe and Pinter (2019), and, (iii) our proposed attention manipulation strategy. We ask human annotators (Q1): Do you think that this prediction was influenced by the gender of the individual? Each participant answers either “yes” or “no” for a set of $50$ examples from each of the three attention schemes.We shuffled the order of sets among the three participants to prevent any ordering bias. Full details of the instructions presented to the annotators are in the appendix After looking at $50$ examples from a given attention scheme, we inquire about trustworthiness of the attention scores (Q2): Do you believe the highlighted tokens capture the factors that drive the models’ prediction? They answer the question on a scale of $1$ to $4$ , where $1$ denotes that the highlighted tokens do not determine the models’ prediction, whereas $4$ implies they significantly determine the models’ prediction. We deliberately ask participants once (towards the end) about the trustworthiness of attention-based explanations, in contrast to polling after each example, as it requires multiple examples to assess whether the explanations capture factors that are predictive.

We find that for the original dot-product attention, annotators labeled $66\%$ of predictions to be influenced by gender. Whereas for the other two attention schemes, none of the predictions were marked to be influenced by gender (see Table 5). This is despite all three models achieving roughly the same high accuracy ( $96\%$ ) which relies on gender information. This demonstrates the efficacy of our manipulation scheme—predictions from models biased against gender minorities are perceived (by human participants) as not being influenced by gender. Further, our manipulated explanations receive a trustworthiness score of 2.67 (out of 4), only slightly lower than the score for the original explanations, and significantly better than the adversarial attention. We found that the KL divergence term in training adversarial attention (Eq. 1) encourages all the attention mass to concentrate on a single uninformative token for most examples, and hence was deemed as less trustworthy by the annotators (see Table 5, more examples in appendix). By contrast, our manipulation scheme only reduces attention mass over problematic tokens, and retains attention over non-problematic but predictive ones (e.g. “medicine”) making it more believable. We assess agreement among annotators, and calculate the Fleiss’ Kappa to be 0.97, suggesting almost perfect agreement.

3 Alternative Workarounds

We identify two mechanisms by which the models cheat, obtaining low attention values while remaining accurate.

Models with recurrent encoders can simply pass information across tokens through recurrent connections, prior to the application of attention. To measure this effect, we hard-set the attention values corresponding to impermissible words to zero after the manipulated model is trained, thus clipping their direct contributions for inference. For gender classification using the BiLSTM model, we are still able to predict over $99\%$ of instances correctly, thus confirming a large degree of information flow to neighboring representations. A recent study Brunner et al. (2019) similarly observes a high degree of ‘mixing’ of information across layers in Transformer models. In contrast, the Embedding model (which has no means to pass the information pre-attention) attains only about $50\%$ test accuracy after zeroing the attention values for gender pronouns. We see similar evidence of passing around information in sequence-to-sequence models, where certain manipulated attention maps are off by one or two positions from the gold alignments (see Figure 3).

Models restricted from passing information prior to the attention mechanism tend to increase the magnitude of the representations corresponding to impermissible words to compensate for the low attention values. This effect is illustrated in Figure 4, where the L2 norm of embeddings for impermissible tokens increase considerably for the Embedding model during training. We do not see increased embedding norms for the BiLSTM model, as this is unnecessary due to the model’s capability to move around relevant information.

We also notice that differently initialized models attain different alternative mechanisms. In Figure 3, we present attention maps from the original model, alongside two manipulated models initialized with different seeds. In some cases, the attention map is off by one or two positions from the gold alignments. In other cases, all the attention is confined to the first hidden state. In such cases, manipulated models are similar to a no-attention model, yet they offer better performance. In preliminary experiments, we found a few such models that outperform the no-attention baseline, even when the attention is turned off during inference. This suggests that attention offers benefits during training, even if it is not used during inference.

Conclusion

Amidst practices that perceive attention scores to be an indication of what the model focuses on, we show that attention scores are easily manipulable. Our simple training scheme produces models with significantly reduced attention mass over tokens known a priori to be useful for prediction, while continuing to use them. Our results raise concerns about the potential use of attention as a tool to audit algorithms, as malicious actors could employ similar techniques to mislead regulators.

Acknowledgement

The authors thank Dr. Julian McAuley for providing, and painstakingly anonymizing the data for reference letters. We also acknowledge Alankar Jain for carefully reading the manuscript and providing useful feedback. ZL thanks Amazon AI, NVIDIA, Salesforce, Facebook AI, AbridgeAI, UPMC, the Center for Machine Learning in Health, the PwC Center, the AI Ethics and Governance Fund, and DARPA’s Learning with Less Labels Initiative, for their support of ACMI Lab’s research on robust and societally aligned machine learning.

References

Supplementary Material

Appendix A Instructions for human study

In a series of examples, we present the inputs and outputs of a machine learning (ML) model trained to predict occupation (physician or surgeon) given a short bio (text). In each bio, we attempt to explain the predictions of the model. Specifically, we employ a technique that highlights words that (per our explanation method) are thought to be responsible for a particular prediction (colloquially, what the model focuses on). For each unique example below, answer the following question: Do you think that this prediction was influenced by the gender of the individual?

Yes, I suspect that the gender influenced the prediction.

No, I have no reason to suspect that gender influenced the prediction.

Please note that, all the examples in this file are input, output pairs from one specific model. Further, darker shades of highlighting indicate a higher emphasis for the token (as per our explanation method).

After showing $50$ examples from a given attention scheme, we inquire: Overall, do you believe the highlighted tokens capture the factors that drive the models’ prediction?

The highlighted tokens capture factors that do not determine the models’ prediction.

The highlighted tokens capture factors that marginally determine the models’ prediction.

The highlighted tokens capture factors that moderately determine the models’ predictions.

The highlighted tokens capture factors that significantly determine the models’ predictions.

Appendix B Dataset Details

Details about the datasets used for classification tasks are available in Table 6.

Appendix C Qualitative Examples

A few qualitative examples illustrating three different attention schemes are listed in Table 7.