How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking

Nicola De Cao, Michael Schlichtkrull, Wilker Aziz, Ivan Titov

Introduction

Deep neural networks have become standard tools in NLP demonstrating impressive improvements over traditional approaches on many tasks (Goldberg, 2017). Their power typically comes at the expense of interpretability, which may prevent users from trusting predictions Kim (2015); Ribeiro et al. (2016), makes it hard to detect model or data deficiencies Gururangan et al. (2018); Kaushik and Lipton (2018) or verify that a model is fair and does not exhibit harmful biases Sun et al. (2019); Holstein et al. (2019).

These challenges have motivated work on interpretability, both in NLP and generally in machine learning; see Belinkov and Glass (2019) and Jacovi and Goldberg (2020) for reviews. In this work, we study post hoc interpretability where the goal is to explain the prediction of a trained model and to reveal how the model arrives at the decision. This goal is usually approached with attribution methods (Bach et al., 2015; Shrikumar et al., 2017; Sundararajan et al., 2017), which explain the behavior of a model by assigning relevance to inputs.

One way to perform attribution is to use erasure where a subset of features (e.g., input tokens) is considered irrelevant if it can be removed without affecting the model prediction (Li et al., 2016; Feng et al., 2018). The advantage of erasure is that it is conceptually simple and optimizes a well-defined objective. This contrasts with most other attribution methods which rely on heuristic rules to define feature salience; for example, attention-based attribution (Rocktäschel et al., 2016; Serrano and Smith, 2019; Vashishth et al., 2019) or back-propagation methods (Bach et al., 2015; Shrikumar et al., 2017; Sundararajan et al., 2017). These approaches received much scrutiny in recent years (Nie et al., 2018; Sixt et al., 2020; Jain and Wallace, 2019), as they cannot guarantee that the network is ignoring low-scored features. They are often motivated as approximations of erasure (Baehrens et al., 2010; Simonyan et al., 2014; Feng et al., 2018) and sometimes evaluated using erasure as ground-truth (Serrano and Smith, 2019; Jain and Wallace, 2019).

Despite its conceptual simplicity, subset erasure is not commonly used in practice. First, it is generally intractable, and beam search Feng et al. (2018) or leave-one-out estimates Zintgraf et al. (2017) are typically used instead. These approximations may be inaccurate. For example, leave-one-out can underestimate the contribution of features due to saturation Shrikumar et al. (2017). More importantly, even these approximations remain very expensive with modern deep (e.g., BERT-based; Devlin et al., 2019) models, as they require multiple computation passes through the model. Second, the method is susceptible to the hindsight bias: the fact that a feature can be dropped does not mean that the model ‘knows’ that it can be dropped and that the feature is not used by the model when processing the example. This results in over-aggressive pruning that does not reflect what information the model uses to arrive at the decision. The issue is pronounced in NLP tasks (see Figure 2(d) and Feng et al., 2018), though it is easier to see on an artificial example (Figure 3(a)). A model is asked to predict if there are more $8$ s than $1$ s in the sequence. The erasure attributes the prediction to a single $8$ digit, as this reduced example yields the same decision as the original one. However, this does not reveal what the model was relying on: it has counted digits $8$ and $1$ as otherwise, it would not have achieved the perfect score on the test set.

We propose a new method, Differentiable Masking (DiffMask), which overcomes the aforementioned limitations and results in attributions that are more informative and help us understand how the model arrives at the prediction. DiffMask relies on learning sparse stochastic gates (a.k.a., masks), guaranteeing that the information from the masked-out inputs does not get propagated while maintaining end-to-end differentiability without having to resort to REINFORCE Williams (1992). The decision to include or disregard an input token is made with a simple model based on intermediate hidden layers of the analyzed model (see Figure 1). First, this amortization circumvents the need for combinatorial search making the approach efficient at test time. Second, as with probing classifiers Adi et al. (2017); Belinkov and Glass (2019), this reveals whether the network ‘knows’ at the corresponding layer what input tokens can be disregarded. During training inputs are truly masked whenever we sample zeros. After training, attribution scores correspond to the expectation of sampling non-zeros.

The amortization lets us not only plot attribution heatmaps, as in Figure 2(e), but also analyze how decisions are formed across network layers. In our artificial example, we see that in the bottom embedding layer the model cannot discard any tokens, as it does not ‘know’ which digits need to be counted (Figure 3(e), left). In the second layer, it ‘knows’ that these are $8$ s and $1$ s, so the rest gets discarded (Figure 3(e), right). In question answering (see Figure 8(a)), where we use a $24$ -layer model, it takes $13$ – $16$ layers for the model to ‘realize’ that ‘Santa Clara Marriott’ is not relevant to the question and discard it. We also adapt our method to measuring the importance of intermediate states rather than inputs. This, as we discuss later, lets us analyze which states in every layer store information crucial for making predictions, giving us insights about the information flow.

We introduce DiffMask, a technique addressing limitations of attribution-based methods (especially erasure and its approximations), and demonstrate that it is stable and faithful to the analyzed models. We then use this technique to analyze BERT-based models fined-tuned on sentiment classification and question answering.

Method

Masking, however, as in multiplication by zero, makes a strong assumption about the geometry of the feature space, in particular, it assumes that the zero vector bears no information. Instead, we replace some of the inputs by a learned baseline vector $b$ , i.e., $\hat{x}_{i}=z_{i}\cdot x_{i}+(1-z_{i})\cdot b$ .

A practical way to minimize the number of non-zeros predicted by $g$ is minimizing the $L_{0}$ ‘norm’. $L_{0}$ , denoted $\|z\|_{0}$ and defined as $\#(i|z_{i}\neq 0)$ , is the number of non-zeros entries in a vector. Contrary to $L_{1}$ or $L_{2}$ , $L_{0}$ is not a homogeneous function and, thus, not a proper norm. However, contemporary literature refers to it as a norm, and we do so as well to avoid confusion. Thus, our ${\mathcal{L}}_{0}$ loss is defined as the total number of positions that are not masked:

where $\mathbf{1}(\cdot)$ is the indicator function. We minimize ${\mathcal{L}}_{0}$ for all data-points in the dataset ${\mathcal{D}}$ subject to a constraint that predictions from masked inputs have to be similar to the original model predictions:

Our objective poses two challenges: i) $L_{0}$ is discontinuous and has zero derivative almost everywhere, and ii) to output binary masks, $g$ needs a discontinuous output activation such as the step function. A strategy to overcome both problems is to make the binary variables stochastic and treat the objective in expectation, in which case one option is to resort to REINFORCE (Williams, 1992), another is to use a sparse relaxation to binary variables Louizos et al. (2018); Bastings et al. (2019). As we shall see (we compare the two aforementioned options in Table 2 and discuss them in Section 3.2), the latter proved more effective. Thus we opt to use the Hard Concrete distribution, a mixed discrete-continuous distribution on the closed interval $$. This distribution assigns a non-zero probability to exactly zero while it also admits continuous outcomes in the unit interval via the reparameterization trick (Kingma and Welling, 2014). We refer to Louizos et al. (2018) for details, but also provide a brief summary in Appendix B. With stochastic masks, the objective is computed in expectation, which addresses both sources of non-differentiability. Note that during training inputs are truly masked-out whenever we sample exact zeros. After training, attribution scores correspond to the expectation of sampling non-zero masks since any non-zero value corresponds to a leak of information.

Experiments

The goal of this work is to uncover a faithful interpretation of an existing model, i.e. revealing, as accurately as possible, the process by which the model arrives at the prediction. Human-provided labels, such as human rationales Camburu et al. (2018); DeYoung et al. (2020), will not help us in demonstrating this, as humans cannot judge if an interpretation is faithful Jacovi and Goldberg (2020). More precisely, human-provided labels do not show how the model behaves – e.g., annotations of what parts of the input are relevant for solving a particular task do not constitute a guarantee that a model relies on those parts more than others when making a prediction. When we evaluate an attribution method by comparing its outputs with human annotations, we are not measuring whether it provides faithful attributions but only if they are plausible according to humans. This goes against our goals as we aim to use the interpretation method to detect model deficiencies, which are usually cases where the model does not behave like humans. The ground-truth explanations of how a model makes certain predictions depend not only on the data but also on the model, and, unfortunately, are generally not known for real tasks and with complex models. This makes the evaluation and comparison of attribution methods non-trivial.

Our strategy is to i) show the effectiveness of DiffMask in a controlled setting (i.e., a toy task) where ground-truth is available; ii) test the effectiveness of our relaxation for learning discrete masks (on a real model for sentiment classification); and iii) demonstrate that the method is stable and models behave the same when masking is applied. Once we have established that DiffMask can be trusted, we use it to analyze BERT-based models (Devlin et al., 2019) fine-tuned on sentiment classification, and on question answering. We report hyperparameters in Appendix C, and additional plots, examples and analysis in Appendix D.

Our toy task is defined as: given a sequence $x$ of digits (i.e., $x_{i}\in\{0,\cdots,9\}$ ), and a query $\langle n,m\rangle$ of two digits, determine whether $\#n\!>\!\#m$ in $x$ .

The query and input are embedded, concatenated, and then fed to a single-layer feed-forward NN, followed by a single-layer unidirectional GRU (Cho et al., 2014).We use a feed-forward NN to incorporate the query information, rather than another GRU layer, to ensure that counting cannot happen in the first layer. This helps us define the ground-truth for the method. The classification is done by a linear layer that acts on the last hidden state of the GRU. See Appendix C.1 for all hyperparameters and a more precise definition of the architecture. Unsurprisingly, the model solves the task almost perfectly (accuracy on test is $>\!99\%$ ).

We plot the distribution of hidden states (we use dimensionality $2$ , with the purpose of having a bottleneck and to support clear visualization) and observe a linear separation between states of digits present in the query and states not in the query. This means that the role of the feed-forward layer is to decide which digits to keep. Since the model solves the task, the role of the GRU must then be to count which digit occurred the most. The prediction must be attributed uniformly to all the hidden states corresponding to either $n$ or $m$ . For completeness, Figure 11 in the Appendix D.1 shows this plot.

We start with an example of input attributions, see Figure 3, which illustrates how DiffMask goes beyond input attribution as typically known.To enable comparison across methods, the attributions in this Section are normalized between and $1$ . The attribution provided by erasure (Figure 3(a)) is not informative: for each datapoint the search always finds a single digit that is sufficient to maintain the original prediction and discards all the other inputs. The perturbation methods by Schulz et al. (2020) and Guan et al. (2019) (Figure 3(b) and 3(d)) are also over-aggressive in pruning. They assign low attribution to some items in the query even though those had to be considered when making the prediction. Differently from other methods, DiffMask reveals input attributions conditioned on different levels of depth. Figure 3(e) shows both input attributions according to the input itself and according to the hidden layer. It reveals that at the embedding layer there is no information regarding what part of the input can be erased: attribution is uniform over the input sequence. After the model has observed the query, hidden states predict that masking input digits other than $n$ and $m$ will not affect the final prediction: attribution is uniform over digits in the query. This reveals the role of the feed-forward layer as a filter for positions relevant to the query. Other methods do not allow for this type of inspection. These observations are consistent across the entire test set.

2 Sentiment Classification

We turn now to a real task and analyze models fine-tuned for sentiment classification on the Stanford Sentiment Treebank (SST; Socher et al., 2013).

While we cannot use human labels to evaluate faithfulness of our method, comparing them and DiffMask attribution will tell us whether the sentiment model relies on the same cues as humans. Specifically, we compare to SST token level annotation of sentiment. In Figure 4(a), we show after how many layers on average an input token is dropped, depending on its sentiment label. This suggests that the model relies more heavily on strongly positive or negative words and, thus, is generally consistent with human judgments (i.e., plausible).

We used DiffMask to analyse the behavior of our BERT model. In Figure 5, we report the average number of layers that input tokens or hidden states are kept for (or, equivalently, after how many layers they are dropped on average), aggregating by part-of-speech tags (PoS). It turns out that determinants, punctuation, and pronouns can be completely discarded from the input across all validation set, while adjectives and nouns should be kept. Also the [CLS] and [SEP] tokens can be ignored indicating that the model does not need such markers. Examining the POS tags distribution for hidden states leads to further conclusions. Here, the [CLS] and [SEP] tokens are the most important ones. This is not surprising as the classifier on top of BERT uses the [CLS] hidden state which gets progressively updated through all layers. Both these special tokens are not important as inputs because BERT can infer these markers in other layers, however, they are heavily used in the computation.

Figure 6(e) we show a visual example of that. We see that the model, even in the bottom layers, knows that the punctuation and both separators can be dropped from the input. This contrasts with hidden states attribution (Figure 6(f)) which indicates that the separator states (especially [SEP]) are very important. By putting this information together, we can hypothesize that the separator is used to aggregate information from the sentence, relying on self-attention. In fact, this aggregation is still happening in layer $12$ ; at the very top layers, states corresponding to almost all non-separator tokens can be dropped.

In Figure 6, we visually compare different techniques on one example form validation set. While previous techniques (e.g., integrated gradient) do not let us test what a model ‘knows’ in a given layer (i.e. attribution to input conditioned on a layer), they can be used to perform attribution to hidden layers. All methods except attention correctly highlight the last hidden state of the [CLS] token as important. Its importance is due to the top-level classifier using the [CLS] hidden state. Although for DiffMask we show the expectation of keeping states, it assigns much sharper attributions. For instance, on the validation set, it assigns to the last hidden state of the [CLS] the biggest attribution $99\%$ of the times where Schulz et al. (2020) only $71\%$ . Raw attention (Figure 6(a)) does not seem to highlight any significant patterns in that example except that start and end of sentence tokens ([CLS] and [SEP], respectively) receive more attention than the rest.Voita et al. (2019b) and Michel et al. (2019) pointed out that many Transformer heads play no or minor role. Attributions by Schulz et al. (2020) and Guan et al. (2019) assign slightly higher importance to hidden states corresponding to ‘highly’ and ‘enjoyable’, whereas it is hard to see any informative patterns provided by integrated gradient. Notice that for DiffMask, a near-zero attribution has a very clear interpretation: such a state is not used for prediction since in expectation it is dropped (not gated).

3 Question Answering

We turn now to QA where we analyse a fine-tuned BERTLARGE model on the Stanford Question Answering Dataset (SQuAD; Rajpurkar et al., 2016).

We start by asking DiffMask which tokens does the model keep? We do a similar analysis as for sentiment classification of POS tags over the entire validation set. We summarize the results in Figure 14 in Appendix D.2. It turns out that conjunctions and adpositions are dropped by the embedding and first layer, respectively, on average. On the contrary, proper nouns and punctuation are usually predicted to be dropped only after the $14$ th layer. We argue that due to the pre-training objective, BERT could infer well missing parts of the input, especially if they are trivial to infer (e.g., as often the case for prepositions). On the contrary, nouns and proper nouns are important as they count for $84\%$ of the answers on SQuAD. For example, in Figure 8(a), we can see that it takes $13$ – $16$ layers for the model to ‘realize’ that ‘Santa Clara Marriot’ is not relevant to the question and discard it.

Unlike in sentiment classification, separator tokens as well as punctuation assume a central role as inputs (i.e., punctuation is considered the most important POS tag as for both questions and passages is usually dropped after the $17$ th layer). Punctuation serves to demarcate sentence boundaries, useful for QA but not for sentiment classification.

Tokens from questions are generally masked by higher layers than tokens from passages as we show in Figure 7(a), which suggests that they are more important. We highlight that even in higher layers when DiffMask masks $>\!95\%$ of the tokens, the original model prediction is almost always kept $>\!90\%$ . Noticeably, when the original BERT makes wrong predictions, the tokens annotated as the ground truth answer are kept $\sim\!60\%$ of the time. This may suggest that when this happens the model still considers other options (e.g., valid options such as the ground truth) as plausible, thus DiffMask detects them as important.

Now, we inspect hidden states attributions to answer where is the information stored? In Figure 7(b) we can see a similar trend as for masking input, i.e., question’s hidden states are kept more on average and deeper in the computation. States on layers $2$ – $3$ are dropped less than from the embedding and first layer. This is consistent with findings of Voita et al. (2019a) which show that frequent tokens, such as determiners, accumulate contextual information. However, they are not important as inputs as we show in an example in Figure 8(b).

The hidden states corresponding to separator tokens are always kept across all layers except the last one across the validation set. Notice that, this token is also used as a delimiter between the question and the passage, and hence indicates where questions as well as passages end.

The level of hidden states pruning is quite incremental (after layer $3$ ) and gets strong, after layer $9$ more than 50% of them can be masked out. A steep increase in superfluous states $13$ – $14$ (visible on both parts of Figure 7) may indicate that some states, at that point in computation, contain enough information needed for the classification while all the others can indeed be removed without affecting the model prediction. Our observation that higher layers are more predictive is in line with findings of Kovaleva et al. (2019). They pointed out that the final layers of BERT change most and are more task-specific. Again, the fact that states corresponding to the ground truth answer are still active on top layers when the model makes a wrong prediction indicates that the model is still considering different span options across top layers as well.

As we do not have access to the ground-truth, we start by contrasting DiffMask qualitatively to other attribution methods on a few examples. We highlight some common pitfalls that afflict other methods (such as the hindsight bias) and how DiffMask overcomes those. This helps demonstrate our method’s faithfulness to the original model.

Figure 2 shows input attributions by different methods on an example from the validation set. Erasure (Figure 2(d)), as expected, does not provide useful insights, it essentially singles out the answer discarding everything else including the question. This cannot be faithful and is a simple consequence of erasure’s hindsight bias: when only the span that contains the answer is presented as input, the model predicts that very span as the answer, but this does not imply that the model ignores everything else when presented with the complete document as input. The methods of Schulz et al. (2020) and Guan et al. (2019) optimize attributions on single examples and thus also converge to assigning high importance mostly to words that support the current prediction and that indicate the question type. For this experiment we used Per-Sample Bottleneck attribution from Schulz et al. (2020). The authors also proposed a Readout Bottleneck where they train a second neural network to predict the mask. But differently from our formulation, they condition on subsequent layers and thus attributions are prone to the hindsight bias.

Integrated gradient does not seem to highlight any discernible pattern, which we speculate is mainly because a zero baseline is not suitable for word embeddings. Choosing a more adequate baseline is not straightforward and remains an important open issue (Sturmfels et al., 2020). Note that, DiffMask without amortization (Figure 2(f)) resembles erasure (as shown in § 3.2 for SST).

Differently from all other methods, our DiffMask probes the network to understand what it ‘knows’ about the input-output mapping in different layers. In Figure 2(e) we show the expectation of keeping input tokens conditioned on any one of the layers in the model to make such predictions (see Figure 8(a) for a per-layer visualization). Our input attributions highlight that the model, in expectation across layers, wants to keep words in the question, the predicate ‘practice’ in both sentences as well as all potential candidate answers (i.e., named entities). But eventually, the most important spans are in the question and the answer itself.

Related Work

While we motivated our approach through its relation to erasure, an alternative way of looking at our approach is considering it as a perturbation-based method. This recently introduced class of attribution methods Ying et al. (2019); Guan et al. (2019); Schulz et al. (2020); Taghanaki et al. (2019), instead of erasing input, injects noise. Besides back-propagation and attention-based methods discussed in the introduction, another class of interpretation methods Murdoch and Szlam (2017); Singh et al. (2019); Jin et al. (2020) builds on prior work in cooperative game theory (e.g., Shapley value of Shapley, 1953). These methods are not trivial to apply to a new model, as they are architecture-specific. Their hierarchical versions (e.g., Singh et al., 2019; Jin et al., 2020) also make a strong assumption about the structure of interaction (e.g., forming a tree) which may affect their faithfulness. Also Chen et al. (2018) share some similarities to our work as they also do amortization but use the Gumbel softmax trick (Maddison et al., 2017; Jang et al., 2017) to approximate minimal subset selection. They assume that the subset contains exactly $k$ elements where $k$ is a hyperparameter. Moreover, their explainer is a separate model predicting input subsets, rather than a ‘probe’ on top of the model’s hidden layers, and hence cannot be used to reveal how decisions are formed across layers.

A large body of literature analyzed BERT and Transformed-based models. For example, Tenney et al. (2019) and van Aken et al. (2019) probed BERT layers for a range of linguistic tasks, while Hao et al. (2019) analyzed the optimization surface. Rogers et al. (2020) provides a comprehensive overview of recent BERT analysis papers.

There is a stream of work on learning interpretable models by means of extracting latent rationales (Lei et al., 2016; Bastings et al., 2019). Some of the techniques underlying DiffMask are related to that line of work. They employ stochastic masks to learn an interpretable model, which they train by minimizing a downstream loss subject to constraints on $L_{0}$ , whereas we employ stochastic masks to interpret an existing model, and for that, we minimize $L_{0}$ subject to constraints on that model’s output distribution. In our very recent work Schlichtkrull et al. (2020), we also employ stochastic masks and $L_{0}$ regularization for analyzing graph neural networks. We learn which edges are relevant in multi-hop question answering and graph-based semantic role labeling (Marcheggiani and Titov, 2017; De Cao et al., 2019).

Conclusion

We have introduced a new post hoc interpretation method which learns to completely remove subsets of inputs or hidden states through masking. We circumvent an intractable search by learning an end-to-end differentiable prediction model. To overcome the hindsight bias problem, we probe the model’s hidden states at different depths and amortize predictions over the training set. Faithfulness is validated in a controlled experiment pointing more clearly to some flaws of other attribution methods. We used our method to study BERT-based models on sentiment classification and question answering. DiffMask sheds light on what different layers ‘know’ about the input and where information about the prediction is stored in different layers.

The authors want to thank Christos Baziotis, Elena Voita, Dieuwke Hupkes, and Naomi Saphra for helpful discussions. This project is supported by SAP Innovation Center Network, ERC Starting Grant BroadSem (678254), the Dutch Organization for Scientific Research (NWO) VIDI 639.022.518, and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299 (Gourmet).

References

Appendix A Probe parameterization

Appendix B The Hard Concrete distribution

The Hard Concrete distribution, assigns density to continuous outcomes in the open interval $(0,1)$ and non-zero mass to exactly and exactly $1$ . A particularly appealing property of this distribution is that sampling can be done via a differentiable reparameterization (Rezende et al., 2014; Kingma and Welling, 2014). In this way, the ${\mathcal{L}}_{0}$ loss in Equation 1 becomes an expectation

whose gradient can be estimated via Monte Carlo sampling without the need for REINFORCE and without introducing biases. We did modify the original Hard Concrete, though only so slightly, in a way that it gives support to samples in the half-open interval $[0,1)$ , that is, with non-zero mass only at . That is because we need only distinguish from non-zero, and the value $1$ is not particularly important.Only a true is guaranteed to completely mask an input out, while any non-zero value, however small, may leak some amount of information.

where $\sigma$ is the Sigmoid function $\sigma(x)=(1+e^{-x})^{-1}$ and $u\sim{\mathcal{U}}(0,1)$ . We point to the Appendix B of Louizos et al. (2018) for more information about the density of the resulting distribution and its cumulative density function.

There is a stream of work on learning interpretable models by means of extracting latent rationales (Lei et al., 2016; Bastings et al., 2019). Some of the techniques underlying DiffMask are related to that line of work, but overall we approach very different problems. Lei et al. (2016) use REINFORCE to minimize a downstream loss computed on masked inputs, where the masks are binary and latent. They employ $L_{0}$ regularization to solve the task while conditioning only on small subsets of the input regarded as a rationale for the prediction. To the same end, Bastings et al. (2019) minimize downstream loss subject to constraints on expected $L_{0}$ using a variant of the sparse relaxation of Louizos et al. (2018). In sum, they employ stochastic masks to learn an interpretable model which they learn by minimizing a downstream loss subject to constraints on $L_{0}$ , we employ stochastic masks to interpret an existing model and for that we minimize $L_{0}$ subject to constraints on that model’s downstream performance.

Appendix C Hyperparameters

We generate sequences of varying length (up to $10$ digits long) sampling each element independently: with $50\%$ probability, we draw uniformly $n$ or $m$ and, with $50\%$ probability, we draw uniformly from the remaining digits. We generate $10$ k data-points, keeping $10\%$ of them for validation. The space of input sequences is $>\!10^{10}$ . Thus, a model that solves the task cannot simply memorize the training set.

The precise model formulation is the following: given a query $q=\langle n,m\rangle$ and an input $x=\langle x_{1},\dots x_{t}\rangle$ , they are embedded as

where $\operatorname{Emb}_{q}$ and $\operatorname{Emb}_{x}$ are embedding layers of dimensionality $64$ . The prediction is computed as

Integrated gradient attribution (Sundararajan et al., 2017) is computed with $500$ steps. Attribution of Schulz et al. (2020) is computed at token level with $\beta=10/k$ where $k$ is the token embedding size. We optimized using the RMSprop (Tieleman and Hinton, 2012) with learning rate $10^{-1}$ for $500$ steps. Attribution of Guan et al. (2019) is computed at token level with $\lambda=10^{-4}$ using RMSprop with learning rate $10^{-1}$ for $500$ steps. Our DiffMask is optimized for $100$ epochs using Lookahead RMSprop (Tieleman and Hinton, 2012; Zhang et al., 2019) with learning rate $10^{-2}$ for $\phi,b$ and $10^{-1}$ for $\alpha$ . For these attribution methods we used our own re-implementation.

C.2 Sentiment Classification

We used the Stanford Sentiment Treebank (SST; Socher et al., 2013) available herehttps://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip. We pre-processed the data as in Bastings et al. (2019). Training and validation sets contain $8544$ and $1101$ sentences respectively.

For the sentiment classification experiment we downloadedhttps://huggingface.co/transformers/pretrained_models.html a pre-trained model from the Huggingface implementationhttps://github.com/huggingface/transformers of Wolf et al. (2019), and we fined-tuned on the SST dataset. We report hyperparameters used for training the model and our DiffMask in Table 3.

C.3 Question Answering

We used the Stanford Question Answering Dataset (SQuAD v1.1; Rajpurkar et al., 2016) available herehttps://rajpurkar.github.io/SQuAD-explorer. Pre-processing excluded QA pairs with more than $384$ BPE tokens to avoid memory issues. After this we end up having $86706$ training instances and $10387$ validation instances.

For the question answering experiment we downloaded 10 an already fine-tuned model from the Huggingface implementation11 of Wolf et al. (2019) We report hyperparameters used by them for training the original model and the ones used for our DiffMask in Table 4.

Appendix D Additional plots and results

In Figure 10 we show an overview of the variant of DiffMask to analyze the hidden states of a model (see Figure 1 to compare the two versions).

In Figure 11 we show the distribution of hidden states in the toy task where we highlight whether they belong to a state corresponding to $n,m$ or neither of them.

D.2 Sentiment Classification

In Figure 13 we show an additional comparison example between attribution method for hidden layers w.r.t the predicted label.

As argued in the introduction and shown on the toy task, many popular methods (e.g., erasure and its approximations) are over-aggressive in discarding inputs and hidden units. Amortization is a fundamental component of DiffMask and is aimed at addressing this issue. In Figure 12 we show how our method behaves when ablating amortization and thus optimizing on a single example instead. Noticeable, our method converges to masking out all hidden states at any layer (Figure 12(b)). This happens as it learns an ad hoc baseline just for that example. When we ablate both amortization and baseline learning (Figure 12(c)), the method struggles to uncover any meaningful patterns. This highlights how both core components of our method are needed in combination with each other.

D.3 Question Answering

In Figure 14 we report statistics on the average number of layers that predict to keep input tokens aggregating by POS tag. We report additional two examples of expectation predicted by DiffMask in Figure 15.