Rationalizing Neural Predictions

Tao Lei, Regina Barzilay, Tommi Jaakkola

Introduction

Many recent advances in NLP problems have come from formulating and training expressive and elaborate neural models. This includes models for sentiment classification, parsing, and machine translation among many others. The gains in accuracy have, however, come at the cost of interpretability since complex neural models offer little transparency concerning their inner workings. In many applications, such as medicine, predictions are used to drive critical decisions, including treatment options. It is necessary in such cases to be able to verify and understand the underlying basis for the decisions. Ideally, complex neural models would not only yield improved performance but would also offer interpretable justifications – rationales – for their predictions.

In this paper, we propose a novel approach to incorporating rationale generation as an integral part of the overall learning problem. We limit ourselves to extractive (as opposed to abstractive) rationales. From this perspective, our rationales are simply subsets of the words from the input text that satisfy two key properties. First, the selected words represent short and coherent pieces of text (e.g., phrases) and, second, the selected words must alone suffice for prediction as a substitute of the original text. More concretely, consider the task of multi-aspect sentiment analysis. Figure 1 illustrates a product review along with user rating in terms of two categories or aspects. If the model in this case predicts five star rating for color, it should also identify the phrase ”a very pleasant ruby red-amber color” as the rationale underlying this decision.

In most practical applications, rationale generation must be learned entirely in an unsupervised manner. We therefore assume that our model with rationales is trained on the same data as the original neural models, without access to additional rationale annotations. In other words, target rationales are never provided during training; the intermediate step of rationale generation is guided only by the two desiderata discussed above. Our model is composed of two modular components that we call the generator and the encoder. Our generator specifies a distribution over possible rationales (extracted text) and the encoder maps any such text to task specific target values. They are trained jointly to minimize a cost function that favors short, concise rationales while enforcing that the rationales alone suffice for accurate prediction.

The notion of what counts as a rationale may be ambiguous in some contexts and the task of selecting rationales may therefore be challenging to evaluate. We focus on two domains where ambiguity is minimal (or can be minimized). The first scenario concerns with multi-aspect sentiment analysis exemplified by the beer review corpus [McAuley et al., 2012]. A smaller test set in this corpus identifies, for each aspect, the sentence(s) that relate to this aspect. We can therefore directly evaluate our predictions on the sentence level with the caveat that our model makes selections on a finer level, in terms of words, not complete sentences. The second scenario concerns with the problem of retrieving similar questions. The extracted rationales should capture the main purpose of the questions. We can therefore evaluate the quality of rationales as a compressed proxy for the full text in terms of retrieval performance. Our model achieves high performance on both tasks. For instance, on the sentiment prediction task, our model achieves extraction accuracy of 96%, as compared to 38% and 81% obtained by the bigram SVM and a neural attention baseline.

Related Work

Developing sparse interpretable models is of considerable interest to the broader research community[Letham et al., 2015, Kim et al., 2015]. The need for interpretability is even more pronounced with recent neural models. Efforts in this area include analyzing and visualizing state activation [Hermans and Schrauwen, 2013, Karpathy et al., 2015, Li et al., 2016], learning sparse interpretable word vectors [Faruqui et al., 2015b], and linking word vectors to semantic lexicons or word properties [Faruqui et al., 2015a, Herbelot and Vecchi, 2015].

Beyond learning to understand or further constrain the network to be directly interpretable, one can estimate interpretable proxies that approximate the network. Examples include extracting “if-then” rules [Thrun, 1995] and decision trees [Craven and Shavlik, 1996] from trained networks. More recently, ?) propose a model-agnostic framework where the proxy model is learned only for the target sample (and its neighborhood) thus ensuring locally valid approximations. Our work differs from these both in terms of what is meant by an explanation and how they are derived. In our case, an explanation consists of a concise yet sufficient portion of the text where the mechanism of selection is learned jointly with the predictor.

Attention based models offer another means to explicate the inner workings of neural models [Bahdanau et al., 2015, Cheng et al., 2016, Martins and Astudillo, 2016, Chen et al., 2015, Xu and Saenko, 2015, Yang et al., 2015]. Such models have been successfully applied to many NLP problems, improving both prediction accuracy as well as visualization and interpretability [Rush et al., 2015, Rocktäschel et al., 2016, Hermann et al., 2015]. ?) introduced a stochastic attention mechanism together with a more standard soft attention on image captioning task. Our rationale extraction can be understood as a type of stochastic attention although architectures and objectives differ. Moreover, we compartmentalize rationale generation from downstream encoding so as to expose knobs to directly control types of rationales that are acceptable, and to facilitate broader modular use in other applications.

Finally, we contrast our work with rationale-based classification [Zaidan et al., 2007, Marshall et al., 2015, Zhang et al., 2016] which seek to improve prediction by relying on richer annotations in the form of human-provided rationales. In our work, rationales are never given during training. The goal is to learn to generate them.

Extractive Rationale Generation

In extractive rationale generation, our goal is to select a subset of the input sequence as a rationale. In order for the subset to qualify as a rationale it should satisfy two criteria: 1) the selected words should be interpretable and 2) they ought to suffice to reach nearly the same prediction (target vector) as the original input. In other words, a rationale must be short and sufficient. We will assume that a short selection is interpretable and focus on optimizing sufficiency under cardinality constraints.

We encapsulate the selection of words as a rationale generator which is another parameterized mapping $\mathbf{gen}(\mathbf{x})$ from input sequences to shorter sequences of words. Thus $\mathbf{gen}(\mathbf{x})$ must include only a few words and $\mathbf{enc}(\mathbf{gen}(\mathbf{x}))$ should result in nearly the same target vector as the original input passed through the encoder or $\mathbf{enc}(\mathbf{x})$ . We can think of the generator as a tagging model where each word in the input receives a binary tag pertaining to whether it is selected to be included in the rationale. In our case, the generator is probabilistic and specifies a distribution over possible selections.

The rationale generation task is entirely unsupervised in the sense that we assume no explicit annotations about which words should be included in the rationale. Put another way, the rationale is introduced as a latent variable, a constraint that guides how to interpret the input sequence. The encoder and generator are trained jointly, in an end-to-end fashion so as to function well together.

Encoder and Generator

We use multi-aspect sentiment prediction as a guiding example to instantiate the two key components – the encoder and the generator. The framework itself generalizes to other tasks.

The encoder could be realized in many ways such as a recurrent neural network. For example, let $\mathbf{h}_{t}=f_{e}(\mathbf{x}_{t},\mathbf{h}_{t-1})$ denote a parameterized recurrent unit mapping input word $\mathbf{x}_{t}$ and previous state $\mathbf{h}_{t-1}$ to next state $\mathbf{h}_{t}$ . The target vector is then generated on the basis of the final state reached by the recurrent unit after processing all the words in the input sequence. Specifically,

Generator 𝐠𝐞𝐧(⋅)𝐠𝐞𝐧⋅\mathbf{gen}(\cdot):

The rationale generator extracts a subset of text from the original input $\mathbf{x}$ to function as an interpretable summary. Thus the rationale for a given sequence $\mathbf{x}$ can be equivalently defined in terms of binary variables $\{\mathbf{z}_{1},\cdots,\mathbf{z}_{l}\}$ where each $\mathbf{z}_{t}\in{0,1}$ indicates whether word $\mathbf{x}_{t}$ is selected or not. From here on, we will use $\mathbf{z}$ to specify the binary selections and thus $(\mathbf{z},\mathbf{x})$ is the actual rationale generated (selections, input). We will use generator $\mathbf{gen}(\mathbf{x})$ as synonymous with a probability distribution over binary selections, i.e., $\mathbf{z}\sim\mathbf{gen}(\mathbf{x})\equiv p(\mathbf{z}|\mathbf{x})$ where the length of $\mathbf{z}$ varies with the input $\mathbf{x}$ .

In a simple generator, the probability that the $t^{th}$ word is selected can be assumed to be conditionally independent from other selections given the input $\mathbf{x}$ . That is, the joint probability $p(\mathbf{z}|\mathbf{x})$ factors according to

The component distributions $p(\mathbf{z}_{t}|\mathbf{x})$ can be modeled using a shared bi-directional recurrent neural network. Specifically, let $\overrightarrow{f}()$ and $\overleftarrow{f}()$ be the forward and backward recurrent unit, respectively, then

Independent but context dependent selection of words is often sufficient. However, the model is unable to select phrases or refrain from selecting the same word again if already chosen. To this end, we also introduce a dependent selection of words,

which can be also expressed as a recurrent neural network. To this end, we introduce another hidden state $\mathbf{s}_{t}$ whose role is to couple the selections. For example,

Joint objective:

A rationale in our definition corresponds to the selected words, i.e., $\{\mathbf{x}_{k}|\mathbf{z}_{k}=1\}$ . We will use $(\mathbf{z},\mathbf{x})$ as the shorthand for this rationale and, thus, $\mathbf{enc}(\mathbf{z},\mathbf{x})$ refers to the target vector obtained by applying the encoder to the rationale as the input. Our goal here is to formalize how the rationale can be made short and meaningful yet function well in conjunction with the encoder. Our generator and encoder are learned jointly to interact well but they are treated as independent units for modularity.

The generator is guided in two ways during learning. First, the rationale that it produces must suffice as a replacement for the input text. In other words, the target vector (sentiment) arising from the rationale should be close to the gold sentiment. The corresponding loss function is given by

Note that the loss function depends directly (parametrically) on the encoder but only indirectly on the generator via the sampled selection.

Second, we must guide the generator to realize short and coherent rationales. It should select only a few words and those selections should form phrases (consecutive words) rather than represent isolated, disconnected words. We therefore introduce an additional regularizer over the selections

where the first term penalizes the number of selections while the second one discourages transitions (encourages continuity of selections). Note that this regularizer also depends on the generator only indirectly via the selected rationale. This is because it is easier to assess the rationale once produced rather than directly guide how it is obtained.

Our final cost function is the combination of the two, $\text{cost}(\mathbf{z},\mathbf{x},\mathbf{y})=\mathcal{L}(\mathbf{z},\mathbf{x},\mathbf{y})+\Omega(\mathbf{z})$ . Since the selections are not provided during training, we minimize the expected cost:

where $\theta_{e}$ and $\theta_{g}$ denote the set of parameters of the encoder and generator, respectively, and $D$ is the collection of training instances. Our joint objective encourages the generator to compress the input text into coherent summaries that work well with the associated encoder it is trained with.

Minimizing the expected cost is challenging since it involves summing over all the possible choices of rationales $\mathbf{z}$ . This summation could potentially be made feasible with additional restrictive assumptions about the generator and encoder. However, we assume only that it is possible to efficiently sample from the generator.

Doubly stochastic gradient

We now derive a sampled approximation to the gradient of the expected cost objective. This sampled approximation is obtained separately for each input text $\mathbf{x}$ so as to work well with an overall stochastic gradient method. Consider therefore a training pair $(\mathbf{x},\mathbf{y})$ . For the parameters of the generator $\theta_{g}$ ,

Using the fact $(\log f(\theta))^{\prime}=f^{\prime}(\theta)/f(\theta)$ , we get

The last term is the expected gradient where the expectation is taken with respect to the generator distribution over rationales $\mathbf{z}$ . Therefore, we can simply sample a few rationales $\mathbf{z}$ from the generator $\mathbf{gen}(\mathbf{x})$ and use the resulting average gradient in an overall stochastic gradient method. A sampled approximation to the gradient with respect to the encoder parameters $\theta_{e}$ can be derived similarly,

Choice of recurrent unit

We employ recurrent convolution (RCNN), a refinement of local-ngram based convolution. RCNN attempts to learn n-gram features that are not necessarily consecutive, and average features in a dynamic (recurrent) fashion. Specifically, for bigrams (filter width $n=2$ ) RCNN computes $\mathbf{h}_{t}=f(\mathbf{x}_{t},\mathbf{h}_{t-1})$ as follows

RCNN has been shown to work remarkably in classification and retrieval applications [Lei et al., 2015, Lei et al., 2016] compared to other alternatives such CNNs and LSTMs. We use it for all the recurrent units introduced in our model.

Experiments

We evaluate the proposed joint model on two NLP applications: (1) multi-aspect sentiment analysis on product reviews and (2) similar text retrieval on AskUbuntu question answering forum.

We use the BeerAdvocatewww.beeradvocate.com review dataset used in prior work [McAuley et al., 2012].http://snap.stanford.edu/data/web-BeerAdvocate.html This dataset contains 1.5 million reviews written by the website users. The reviews are naturally multi-aspect – each of them contains multiple sentences describing the overall impression or one particular aspect of a beer, including appearance, smell (aroma), palate and the taste. In addition to the written text, the reviewer provides the ratings (on a scale of 0 to 5 stars) for each aspect as well as an overall rating. The ratings can be fractional (e.g. 3.5 stars), so we normalize the scores to $$ and use them as the (only) supervision for regression.

?) also provided sentence-level annotations on around 1,000 reviews. Each sentence is annotated with one (or multiple) aspect label, indicating what aspect this sentence covers. We use this set as our test set to evaluate the precision of words in the extracted rationales.

Table 1 shows several statistics of the beer review dataset. The sentiment correlation between any pair of aspects (and the overall score) is quite high, getting 63.5% on average and a maximum of 79.1% (between the taste and overall score). If directly training the model on this set, the model can be confused due to such strong correlation. We therefore perform a preprocessing step, picking “less correlated” examples from the dataset.Specifically, for each aspect we train a simple linear regression model to predict the rating of this aspect given the ratings of the other four aspects. We then keep picking reviews with largest prediction error until the sentiment correlation in the selected subset increases dramatically. This gives us a de-correlated subset for each aspect, each containing about 80k to 90k reviews. We use 10k as the development set. We focus on three aspects since the fourth aspect taste still gets $>50\%$ correlation with the overall sentiment.

Sentiment Prediction

Before training the joint model, it is worth assessing the neural encoder separately to check how accurately the neural network predicts the sentiment. To this end, we compare neural encoders with bigram SVM model, training medium and large SVM models using 260k and all 1580k reviews respectively. As shown in Table 3, the recurrent neural network models outperform the SVM model for sentiment prediction and also require less training data to achieve the performance. The LSTM and RCNN units obtain similar test error, getting 0.0094 and 0.0087 mean squared error respectively. The RCNN unit performs slightly better and uses less parameters. Based on the results, we choose the RCNN encoder network with 2 stacking layers and 200 hidden states.

To train the joint model, we also use RCNN unit with 200 states as the forward and backward recurrent unit for the generator $\mathbf{gen}()$ . The dependent generator has one additional recurrent layer. For this layer we use 30 states so the dependent version still has a number of parameters comparable to the independent version. The two versions of the generator have 358k and 323k parameters respectively.

Figure 2 shows the performance of our joint dependent model when trained to predict the sentiment of all aspects. We vary the regularization $\lambda_{1}$ and $\lambda_{2}$ to show various runs that extract different amount of text as rationales. Our joint model gets performance close to the best encoder run (with full text) when few words are extracted.

Rationale Selection

To evaluate the supporting rationales for each aspect, we train the joint encoder-generator model on each de-correlated subset. We set the cardinality regularization $\lambda_{1}$ between values $\{2e-4,3e-4,4e-4\}$ so the extracted rationale texts are neither too long nor too short. For simplicity, we set $\lambda_{2}=2\lambda_{1}$ to encourage local coherency of the extraction.

For comparison we use the bigram SVM model and implement an attention-based neural network model. The SVM model successively extracts unigram or bigram (from the test reviews) with the highest feature. The attention-based model learns a normalized attention vector of the input tokens (using similarly the forward and backward RNNs), then the model averages over the encoder states accordingly to the attention, and feed the averaged vector to the output layer. Similar to the SVM model, the attention-based model can selects words based on their attention weights.

Table 2 presents the precision of the extracted rationales calculated based on sentence-level aspect annotations. The $\lambda_{1}$ regularization hyper-parameter is tuned so the two versions of our model extract similar number of words as rationales. The SVM and attention-based model are constrained similarly for comparison. Figure 4 further shows the precision when different amounts of text are extracted. Again, for our model this corresponds to changing the $\lambda_{1}$ regularization. As shown in the table and the figure, our encoder-generator networks extract text pieces describing the target aspect with high precision, ranging from 80% to 96% across the three aspects appearance, smell and palate. The SVM baseline performs poorly, achieving around 30% accuracy. The attention-based model achieves reasonable but worse performance than the rationale generator, suggesting the potential of directly modeling rationales as explicit extraction.

Figure 5 shows the learning curves of our model for the smell aspect. In the early training epochs, both the independent and (recurrent) dependent selection models fail to produce good rationales, getting low precision as a result. After a few epochs of exploration however, the models start to achieve high accuracy. We observe that the dependent version learns more quickly in general, but both versions obtain close results in the end.

Finally we conduct a qualitative case study on the extracted rationales. Figure 3 presents several reviews, with highlighted rationales predicted by the model. Our rationale generator identifies key phrases or adjectives that indicate the sentiment of a particular aspect.

2 Similar Text Retrieval on QA Forum

For our second application, we use the real-world AskUbuntuaskubuntu.com dataset used in recent work [dos Santos et al., 2015, Lei et al., 2016]. This set contains a set of 167k unique questions (each consisting a question title and a body) and 16k user-identified similar question pairs. Following previous work, this data is used to train the neural encoder that learns the vector representation of the input question, optimizing the cosine distance (i.e. cosine similarity) between similar questions against random non-similar ones. We use the “one-versus-all” hinge loss (i.e. positive versus other negatives) for the encoder, similar to [Lei et al., 2016]. During development and testing, the model is used to score 20 candidate questions given each query question, and a total of $400\times 20$ query-candidate question pairs are annotated for evaluationhttps://github.com/taolei87/askubuntu.

Task/Evaluation Setup

The question descriptions are often long and fraught with irrelevant details. In this set-up, a fraction of the original question text should be sufficient to represent its content, and be used for retrieving similar questions. Therefore, we will evaluate rationales based on the accuracy of the question retrieval task, assuming that better rationales achieve higher performance. To put this performance in context, we also report the accuracy when full body of a question is used, as well as titles alone. The latter constitutes an upper bound on the model performance as in this dataset titles provide short, informative summaries of the question content. We evaluate the rationales using the mean average precision (MAP) of retrieval.

Results

Table 4 presents the results of our rationale model. We explore a range of hyper-parameter values $\lambda_{1}\in\{.008,.01,.012,.015\}$ , $\lambda_{2}=\{0,\lambda_{1},2\lambda_{1}\}$ , dropout $\in\{0.1,0.2\}$ . We include two runs for each version. The first one achieves the highest MAP on the development set, The second run is selected to compare the models when they use roughly 10% of question text (7 words on average). We also show the results of different runs in Figure 6. The rationales achieve the MAP up to 56.5%, getting close to using the titles. The models also outperform the baseline of using the noisy question bodies, indicating the the models’ capacity of extracting short but important fragments.

Figure 7 shows the rationales for several questions in the AskUbuntu domain, using the recurrent version with around 10% extraction. Interestingly, the model does not always select words from the question title. The reasons are that the question body can contain the same or even complementary information useful for retrieval. Indeed, some rationale fragments shown in the figure are error messages, which are typically not in the titles but very useful to identify similar questions.

Discussion

We proposed a novel modular neural framework to automatically generate concise yet sufficient text fragments to justify predictions made by neural networks. We demonstrated that our encoder-generator framework, trained in an end-to-end manner, gives rise to quality rationales in the absence of any explicit rationale annotations. The approach could be modified or extended in various ways to other applications or types of data.

The encoder and generator can be realized in numerous ways without changing the broader algorithm. For instance, we could use a convolutional network [Kim, 2014, Kalchbrenner et al., 2014], deep averaging network [Iyyer et al., 2015, Joulin et al., 2016] or a boosting classifier as the encoder. When rationales can be expected to conform to repeated stereotypical patterns in the text, a simpler encoder consistent with this bias can work better. We emphasize that, in this paper, rationales are flexible explanations that may vary substantially from instance to another. On the generator side, many additional constraints could be imposed to further guide acceptable rationales.

Dealing with Search Space.

Our training method employs a REINFORCE-style algorithm [Williams, 1992] where the gradient with respect to the parameters is estimated by sampling possible rationales. Additional constraints on the generator output can be helpful in alleviating problems of exploring potentially a large space of possible rationales in terms of their interaction with the encoder. We could also apply variance reduction techniques to increase stability of stochastic training (cf. [Weaver and Tao, 2001, Mnih et al., 2014, Ba et al., 2015, Xu et al., 2015]).

Acknowledgments

We thank Prof. Julian McAuley for sharing the review dataset and annotations. We also thank MIT NLP group and the reviewers for their helpful comments. The work is supported by the Arabic Language Technologies (ALT) group at Qatar Computing Research Institute (QCRI) within the IYAS project. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors, and do not necessarily reflect the views of the funding organizations.