Understanding Neural Networks through Representation Erasure

Jiwei Li, Will Monroe, Dan Jurafsky

Introduction

A long-standing criticism of neural network models is their lack of interpretability. Unlike traditional models that optimize weights on human interpretable features, neural network models operate like a black box: using vector representations (as opposed to human-interpretable features) to represent text inputs, and applying multiple layers of non-linear transformations. Mystery exists at all levels of a neural model: At input layers, what does each word vector dimension stand for? What do hidden units in intermediate levels stand for? How does the model combine meaning from different parts of the sentence, filtering the informational wheat from the chaff? How is the final decision made at the output layer? These mysteries make it hard to tell when and why a neural model makes mistakes, namely, to perform error analysis. This difficulty hinders further efforts to correct these mistakes.

In this paper, we propose a general methodology for interpreting neural network behavior by analyzing the effect of erasing pieces of the representation, to see how such changes affect a neural model’s decisions. By analyzing the harm this erasure does, we can identify important representations that significantly contribute to a model’s decision; by analyzing the benefit this erasure introduces, namely, the cases in which the removal of a representation actually improves a model’s decision, we can identify representations that a neural model inappropriately focuses its attention on, as a form of error analysis.

This erasure can be performed on various levels of representation, including input word-vector dimensions, input words or phrases, and intermediate hidden units. We apply algorithms of varying complexity for performing this erasure and analyzing the output. Most simply, we can directly compute the difference in log likelihood on gold-standard labels when representations are erased; on the more sophisticated end, we offer a reinforcement learning model to find the minimal set of words that must be erased to change the model’s decision.

The proposed framework offers interpretable explanations for various aspects of neural models: (1) how a neural model picks word-vector dimensions for linguistic feature classification (parts of speech, named entity recognition, chunking, etc.); (2) how neural models select and filter important words, phrases, and sentences in sentiment analysis; (3) why architectures like long short-term memory networks (LSTMs) perform more competitively than standard recurrent neural networks (RNNs). Most importantly, it provides an efficient and general tool to conduct error analysis that can be used on different neural architectures across various NLP applications, which has potential to improve the effectiveness of a wide variety of NLP systems.

Related Work

Efforts to understand neural vector space models in natural language processing (NLP) occur in the earliest work, in which embeddings were visualizing by low-dimensional projection ?). Recent work includes visualizing state activation [Hermans and Schrauwen (2013, Karpathy et al. (2015], interpreting semantic dimensions by presenting humans with a list of words and asking them to choose outliers [Fyshe et al. (2015, Murphy et al. (2012], linking dimensions with semantic lexicons or word properties [Faruqui et al. (2014, Herbelot and Vecchi (2015], learning sparse interpretable word vectors [Faruqui et al. (2015], and empirical studies of LSTM components [Greff et al. (2015, Chung et al. (2014].

Each of these approaches successfully reveals a particular aspect of neural network decisions that is necessary for understanding, but each is also constrained by the scope of its applicability. ?) visualize the neural generation models from an error-analysis point of view, by analyzing predictions and errors from a recurrent neural models. The approach shows the intriguing dynamics of hidden cells in LSTMs but is limited to a few manually-inspected cases such as brace opening and closing. ?) use the first-order derivative to examine the saliency of input features, but they rely on the overly strong assumption that the decision score is a linear combination of input features.

Other closely related work includes that of ?) and ?), who showed how to study unit activations (in autoencoders and CNNs, respectively) to discover novel features/word clusters. ?) train a separate generator that extracts a subset of text which lead to a similar decision to the original input to form an interpretable summary. ?) study the role of vector dimensions (for example to track sequence length) in sequence generation tasks;. ?) develop an interactive system that allows users to select LSTM intermediate states and align these state changes to domain specific structural annotations. ?) propose methods for analyzing the activation patterns of RNNs from a linguistic point of view. Methods for interpreting and visualizing neural models have also been significantly explored in vision [Vondrick et al. (2013, Vedaldi et al. (2014, Zeiler and Fergus (2014, Weinzaepfel et al. (2011, Erhan et al. (2009, Simonyan et al. (2013, Klöppel et al. (2008], which we do not describe here for lack of space.

Attention [Bahdanau et al. (2014, Luong et al. (2015, Sukhbaatar et al. (2015, Rush et al. (2015, Xu and Saenko (2015] provides an important way to explain the workings of neural models, at least for tasks with an alignment modeled between inputs and outputs, like machine translation or summarization. Representation erasure can be applied to attention-based models as well, and can also be used for tasks that aren’t well-modeled by attention.

Our work is also closely related to the idea of adversarial example generation [Szegedy et al. (2013, Nguyen et al. (2015]; see Section 5.

Linking Word Vector Dimensions to Linguistic Features

While we know that vector representaitons encode aspects of features such as part-of-speech tags and syntactic features [Collobert et al. (2011], it is unclear how such features are encoded and how tagging models extract the information.

To better understand how these features may be represented, we study how neural models extract information from word vector dimensions make specific classification decisions for widely used linguistic features: part of speech (POS), named entity class (NER), chunking, prefix, suffix, word-shape and word-frequency. We first train classifier models on benchmarks with gold-standard labels for these features. Then we rationalize a model’s decision by analyzing the effect of erasure of input word vectors and of intermediate hidden units.

Let $M$ denote a trained neural model. Given a training example $e\in E$ with gold-standard label $c$ , with $L_{e}$ denoting the index of the tag for $e$ , the log-likelihood assigned by model $M$ to the correct label for $e$ is denoted by $S(e,c)=-\log P(L_{e}=c)$ . Now let $d$ be the index of some vector dimension we are interested in exploring, and let $S(e,c,\neg d)$ denote the log-likelihood of the correct label for $e$ according to $M$ if dimension $d$ is erased; that is, its value set to 0. The importance of dimension $d$ —denoted by $I(d)$ —is the relative difference between $S(e,c)$ and $S(e,c,\neg d)$ :

2 Tasks and Training

We consider two kinds of tasks: sequence tagging tasks (POS, NER, chunking) and word ontological classification tasks (prefix, suffix, sentiment, wordshape, word-frequency prediction); see Appendix Table 5 for task details.

For sequence tagging tasks, the input consists of the concatenation of the vector representation of the word to tag and the representations of its neighbors (window size is set to 5). For ontology tagging tasks, the input is just the representation of the input word. We study word2vec [Mikolov et al. (2013b] and GloVe [Pennington et al. (2014] vectors, each 50-dimensional vectors pre-trained using the Gigaword-Wiki corpus. For each task, we train a four-layer neural model (an input word-embedding layer, 2 intermediate layers, and a output layer that outputs a scalar) using a structure similar to that of ?) with a tanh activation function. Each intermediate layer contains 50 hidden units. Test accuracy for each task is shown in Appendix Table 7.

3 Results

For each task, we take the pre-trained model, erase an input word dimension by setting its value to 0, apply the pre-trained model to the modified inputs, and apply Eq. 1 to compute the importance score of the erased dimension.

Results are shown in Figure 1. Each row corresponds to a feature classification task (e.g., POS, NER) and each column in a row signifies the importance of a word-vector dimension to the pre-trained model for that task. For word2vec vectors (shown in Figure 1a), we observe clear patterns that the model focuses more on some dimensions than others and that some tasks share important dimensions. For example, POS and chunking share dimension 34; NER, prefix and suffix share dimensions 4 and 31; etc. When applying dropout [Srivastava et al. (2014], we can clearly see that importance is distributed more equally among different dimensions, which is intuitive since the model is forced to make use of other dimensions when the dominating dimension is dropped during training.

Things are a bit more confusing with Glove vectors (Figure 1c): we observe a single dimension (d31) dominating across almost all tasks. Interestingly, if we remove dimension d31 and retrain the model, another dominant dimension (d26) appears (Figure 1d). Only if we remove both these dimensions (Figure 1e) can the model spread its attention to most of the other dimensions. Interestingly, performance does not drop after removing these two dimensions and retraining the models (as shown in Table 7 in the Appendix).

In Figure 1f, which shows the effects of using dropout, the influence of these two dimensions (26 and 31) declines dramatically in most tasks but still stands out in frequency regression, suggesting that these two dimensions are associated with word frequency. Indeed, when we rank words by dimension magnitude, Figure 3 shows a large correlation between word frequency and the values of the 26th and 31st dimension. Our results suggest that models trained on GloVe vectors rely on these frequency dimensions because of the usefulness of word frequency, but manage to get sufficient information from other redundant dimension when these are eliminated. Word2vec vectors don’t contain dimensions strongly associated with frequency, presumably because tokens are omitted in proportion to word-type frequency in word2vec models [Mikolov et al. (2013a]. These differences may explain the differing suitability of GloVe and word2vec embeddings for different NLP tasks.

Figure 2 shows importance values for hidden unit dimensions in different layers on the POS task (see Appendix Figure 6 for other tasks). The heatmap color is generally lighter in the higher layers, meaning that on higher layers importance is distributed more equally across the dimensions. In other words, neural models tends to distill information from a few important dimensions in the input layer, making the removal of these input layer dimensions more detrimental. At higher layers, however, the information is spread across different units and the importance scores are generally lower, meaning that the final classification decision is more robust to the change in any particular dimension.

Finding Important Words in Sentiment Analysis

The section above is concerned mostly with individual vector dimensions. However, for most tasks in NLP, words rather than individual dimensions function as basic units. In this section, we demonstrate how the proposed model can facilitate the understanding of neural models at the word level. In this section we consider the Stanford Sentiment Treebank dataset [Socher et al. (2013], which focuses on phrase/sentence level classification.

We can compute the importance of words similarly to that of word-vector dimensions, by calculating the relative change of the log-likelihood of the correct sentiment label for a text unit when a particular word is erased. The formula is exactly the same as Eq. 1, but with dimensions replaced by words.

We examine three models: a standard RNN with tanh activation functions, an LSTM (Uni-LSTM) and a bidirectional LSTM (Bi-LSTM), all trained on the Stanford treebank dataset. We first transform each parse tree constituent in the dataset to a sequence of tokens. Each sequence is then mapped to a phrase/sentence representation and fed to a softmax classifier. The Bi-LSTM, Uni-LSTM and standard RNN respectively obtain an accuracy of 0.526, 0.501 and 0.453 on sentence-level fine-grained classification. It is worth noting that the Bi-LSTM model achieves state-of-the-art performance in sentence-level fine-grained classification on this benchmark, significantly outperforming tree-based models, namely 50.1 reported in ?) and 51.0 in ?). We refer the readers to the Appendix for more details about the dataset and model training.

We present the importance scores of a few selected sentiment-indicative words in Table 3. The ranking score is computed by averaging the log-likelihood difference resulting from erasing that word across all test examples containing the word. We can see that the Bi-LSTM is more sensitive to the deletion of these sentiment indicators than the Uni-LSTM, which is in turn more sensitive than the RNN. This is presumably due to the gate structures in LSTMs that control information flow, making these architectures better at focusing on words that indicate sentiment.

The highest-ranked words by importance (computed using Eq.1) for each model are listed in Table 1 (more comprehensive lists are presented in Table 8 in the Appendix). Figure 4 shows a histogram of all words by importance for different models. The distribution also confirms that the Bi-LSTM model is more sensitive to the sentiment-indicative words, with more words in buckets with higher importance values.

Figure 5 plots the importance score of individual words (rows) for the different models (columns) in a few specific examples of sentence-level sentiment classification. Higher values mean that the model is more sensitive to the erasing of a particular word. As can be seen, all three models attach more importance to words that are indicative of sentiment (e.g., “loved”, “entertainment”, “greatest”) and dampen the influence of other tokens. LSTM-based models generally show a clearer focus on sentiment words than standard RNN models, and they also succeed in attaching importance to intensification tokens (e.g., the exclamation mark in Figure 5b), which the RNN fails to identify.

We also notice an interesting phenomenon in Figure 5: the importance scores of words can take negative values, which means that the removal of some words actually improves the model’s decision. Such discoveries can help with error analysis on a model by identifying which words confuse the model and lead to mistakes. We therefore also list the top-ranked words by negative importance score (the removal of which words can best help the model make the correct decision). We present some of the top negative important words obtained using the Bi-LSTM model in Table 2, while listing comprehensive results from all three models in Tables 9, 10 and 11 in the Appendix. From these tables, we can clearly identify a few patterns that make neural models fail: (1) A common sentiment indicator word is used in a context (e.g., describing details of the movie) that makes the word not bear any sentiment orientation, such as the word happy in happy ending (Figure 5e), or shame (Table 2, rank 3). (2) A sentiment indicator word is used in a specific context that turns its sentiment into the opposite of its common usage; e.g., “the smartest bonehead” (Table 2, rank 8). (3) A sentiment indicator is used in the scope of an irrealis modal —e.g., i should be enjoying this (Table 2.rank10)—or in an ironic context—e.g., the best way to hope for any chance of enjoying this film is by lowering your expectations (Table 2, rank 12). (4) A sentiment indicator is used in a concessive sentence, requiring the handling of discourse information; e.g., revelatory in flat, but with a revelatory performance by michelle williams (Table 2, rank 1), pleasing in an intermittently pleasing but mostly routine effort (Table 2, rank 25). Resolving these problems is a long-term goal of future work in sentiment analysis.

Reinforcement Learning for Finding Decision-Changing Phrases

The analysis that we have described so far deals with individual words or dimensions. How can representation erasure help us understand the importance of larger compositional text units like phrases or sentences? We propose another technique: removing the minimum number of words to change the model’s predictionOur technique is closely related to adversarial example generation [Szegedy et al. (2013, Nguyen et al. (2015], the idea of finding the minimal change to input dimensions to change neural network decisions. It differs in two ways: (1) adversarial training is usually not suited for interpreting how a model makes a decision, but rather for detecting the intrinsic flaws of the model; these adversarial examples are usually very similar to real examples (often indistinguishable by humans) but can fool the model into making a different decision. (2) Words are a basic unit in NLP; because changing dimensions may harm text integrity (e.g., break the language model) our model removes words rather than dimensions, making our proposed method discrete rather than the continuous method of adversarial example generation. . More formally, let $e$ denote an input text unit consisting of a sequence of words, $e=\{w_{1},w_{2},...,w_{N}\}$ , where $N$ denotes the number of words in $e$ , and let $L_{e}$ denote the index of the label that $M$ gives to $e$ . The task is to discover a minimal subset of $e$ , denoted by $D\subset e$ , such that the removal of all words in $D$ from $e$ (the remaining words are denoted by $e-D$ ) will change the label $L_{e}$ . Let $|D|$ denote the number of words in D. The problem is formalized as follows

Finding the optimal solution requires enumerating all different word combinations, which is computationally intractable when the number of words in $e$ gets large. To address this issue, we propose an strategy based on reinforcement learning to find an approximate solution.

Given a pre-trained sentiment classification model $M$ , an input example $e$ , and the label $L_{e}$ that $M$ gives to $e$ , we define a policy $\pi$ over a binary variable $z_{t}$ , indicating whether a word $w_{t}\in e$ should be removed. $z_{t}$ takes the value of 1 when $w_{t}$ is removed and otherwise. The policy model takes as input the representation associated with word $w$ at the current time step outputted from model $M$ and defines a binary distribution $\pi$ over $z_{t}$ . The policy model examines every word in $e$ and decides whether the word should be kept or removed. Let $D$ be the union of the removed words. After the policy model finishes removing words from $e$ , the pre-trained sentiment model $M$ gives another label ${L}_{e-D}$ to the remaining words $e-D$ .

To train the policy model, a reward function is necessary. The policy model receives a reward of $1$ if the label is changed, i.e., ${L}_{e-D}^{*}\neq{L}_{e}$ , and if the label remains the same. Since we not only want the label to be changed, but also want to find the minimal set of words to change the label, the reward is scaled by the number of the words that are removed. This means removing more words will be rewarded less than removing fewer words if both of them the change the classification label. We therefore propose the following reward:

We also add a regularizer that encourages similar values of $z$ for words within the same sentence to encourage (or discourage) leaving out contiguous phrases:

where $S$ denotes the collection of sentences by breaking the input $e$ . Such an idea is inspired by group lasso [Meier et al. (2008], which has been widely employed in many NLP tasks, such as document classification [Yogatama and Smith (2014] and providing rationales for neural model interpretation [Lei et al. (2016]. The final reward is then:

The system is trained to maximize the expected reward of the sequence of erasing/not-erasing decisions:

The gradient of (6) is approximated using the likelihood ratio trick [Williams (1992, Glynn (1990, 1], in which for a given $e$ , we sample a sequence of decisions based on $\pi$ , compute the associated reward and backward propagate gradients to update $\pi$ , which can be summarized as follows:

Here $b(e)$ denotes a baseline value, to reduce the variance of the estimate while keeping it unbiased. To estimate the baseline value, we train another neural network model to estimate the reward of input $e$ under current policy $\pi$ , similar to ?).

The policy model is trained to interpret the pre-trained sentiment classification model. Therefore, during the RL training, the original sentiment model is kept fixed.

Inspired by recent visualization work from ?), we focus on the task of document-level aspect rating prediction [Tang et al. (2015a, Tang et al. (2015b]. We collected hotel reviews from TripAdvisor. The dataset contains roughly 870,000 reviews with an average length of 120 words. Each review contains ranking scores (integers from 1 to 5) for different aspects of the hotel, such as service, cleanliness, location, rooms, etc. We choose the aspect sentiment classification task because each review might contain diverse sentiments towards different aspects, and it is interesting to see how a model manages or fails to identify these different aspects and their associated scores when entangled with other aspects. We focus on four aspects: value, rooms, service and location.

Since the sentiment correlation between any pair of aspects (and the overall score) is high, the result of which may confuse the model, we employ a strategy similar to that of ?) to pick less correlated examples. For a given aspect, we pick the 50,000 reviews for which the score of this aspect deviates the most from the mean of the other aspects. We use two different models to map input reviews to vector representations: a vanilla Bi-LSTM and a memory-network structure [Sukhbaatar et al. (2015] similar to ?) with attention at both word level and sentence level. Model accuracies are shown in Appendix Table 6.

The representation is then fed to a 5-class softmax function. Given a trained $M$ , we then train (with RL) a policy to discover the minimal set of words to erase to flip the model’s classification decision.

1 Results

Sample results are presented in Table 4. The reinforcement learning model identifies aspect-specific sentiment phrases, providing a rationale for why the sentiment model makes a certain decision. By comparing Table 4a with Table 4b, we can see that the reinforcement model trained based on the memory-based model offers better interpretability than the one trained based on LSTMs. The latter model not only requires erasing more words to flip the model’s decision, but also sometimes deletes passages describing different aspects or overall sentiment. Since the RL model is trained based on the representations outputted from the sentiment model, better interpretability of the RL model indicates the superiority of the memory-based sentiment model.

Conclusion

In this paper, we propose a general methodology for interpreting neural network decisions by analyzing the effect of erasing particular representations. By analyzing the harm this erasure does, the proposed framework offers many interpretable explanations for various aspects of neural models; by analyzing the benefit this erasure introduces, namely, the cases in which the removal of a representation actually improves a model’s decision, the framework provides a way to conduct error analysis on neural model decisions, which has the potential the benefit a wide variety of models and tasks.

References

Appendix

POS Tagging: Each word is associated with a unique tag that indicates its syntactic role, such as plural noun, adverb, etc. We follow the standard Penn Treebank split, using sections 0-18/19-21/22-24 as training/dev/test sets, respectively.

NER Tagging: Each word is associated with a named entity tag, such as “person” or “location”. We evaluate on the CoNLL-2003 shared benchmark dataset for NER [Tjong Kim Sang and De Meulder (2003].

Chunking: Each word is assigned only one unique tag, encoded as a begin-chunk (e.g. B-NP) or inside-chunk tag (e.g. I-NP). We use the CoNLL-2000 dataset, in which sections 15-18 of WSJ data are used for training and section 20 for testing. Validation is performed by splitting the training set.

Prefix and Suffix: Words are segmented using the Morfessor package [Creutz and Lagus (2007]. We retained top 250 frequent prefixes and suffixes. Other than “s” and numbers, single characters are abandoned. We kept a list of 200,000 most frequent words, 51,083 of which are matched with a prefix and 78,804 of which are matched with a suffix. We split words into train/dev/test splits in the ratio 0.8/0.1/0.1.

Sentiment: We use the MPQA subjectivity lexicon list [Deng and Wiebe (2015, Wilson et al. (2005], which consists roughly 8,000 lexicons.

Word shape: words are mapped to X, XX, XXX, etc. based on the number characters it contains.

Word frequency: the number of word occurrences is computed using a Wikipedia dump and is then mapped to log space. Unlike all the others, which are multi-class classification tasks, word-frequency prediction is a regression task: minimize the mean squared error predicting the log frequency of each word.

A summary of the datasets is given in Table 5. Test accuracy/error for different training strategies presented in Figure 1 are shown in Table 7. For classification tasks (i.e., POS, NER, Chunking, Prefix, Suffix, Sentiment, Word Shape), we report accuracy; higher values of accuracy are better. For the regression task (Frequency), we report the Mean Squared Loss (loss for short); lower values of loss are better.

2 Stanford Sentiment Treebank and Training Detail

The Stanford Sentiment Treebank is a benchmark dataset widely used for neural model evaluations. The dataset contains gold-standard sentiment labels for every parse tree constituent, from sentences to phrases to individual words, for a total of 215,154 phrases in 11,855 sentences. The task is to perform both fine-grained (very positive, positive, neutral, negative and very negative) and coarse-grained (positive vs. negative) classification at both the phrase and sentence level.

3 Aspect Rating Prediction

The results for aspect rating prediction using the two models along with other baselines are shown in Table 6. Feature based SVM models are trained using SVM-light package [Joachims (2002]. LSTM based models do not perform as competitively as simple bigram-based classification models in aspect classification tasks, which has also been observed in ?).