Probing Neural Network Comprehension of Natural Language Arguments

Timothy Niven, Hung-Yu Kao

Introduction

Argumentation mining is the task of determining argumentative structure in natural language text - e.g., which text segments represent claims, and which comprise reasons that support or attack those claims Mochales and Moens (2011); Lippi and Torroni (2016). This is a challenging task for machine learners, as it can be hard even for humans to determine when two text segments stand in argumentative relation, as evidenced by studies on argument annotation Habernal et al. (2014).

One approach to this problem is to focus on warrants Toulmin (1958) - a form of world knowledge that permit inferences. Consider a simple argument: “(1) It is raining; therefore (2) you should take an umbrella.”This example adapted from Black and Hunter Black and Hunter (2012) The warrant “(3) it is bad to get wet” could license this inference. Knowing (3) facilitates drawing the inferential connection between (1) and (2). However it would be hard to find it stated anywhere since warrants are most often left implicit Walton (2005). Thus, on this approach, machine learners must not only reason with warrants but also discover them.

The Argument Reasoning Comprehension Task (ARCT) Habernal et al. (2018a) defers the problem of discovering warrants and focuses on inference. An argument is provided, comprising a claim $C$ and reason $R$ . This task is to pick the correct warrant $W$ over a distractor, called the alternative warrant $A$ . The alternative is written such that $R\land A\rightarrow\lnot C$ . An alternative warrant for our earlier example could be “(4) it is good to get wet,” in which case we have (1) $\land$ (4) $\rightarrow$ “( $\lnot$ 2) you shouldn’t take an umbrella.” An example from the dataset is given in Figure 1.

The ARCT SemEval shared task Habernal et al. (2018b) verified the challenging nature of this problem. Even supplying warrants, learners still need to rely on further world knowledge. For example, to correctly classify the data point in Figure 1 it is at least required to know how consumer choice and web re-directs relate to the concept of monopoly, and that Google is a search engine. All but one participating system in the shared task could not exceed $60\%$ accuracy (on binary classification).

It is therefore surprising that BERT Devlin et al. (2018) achieves $77\%$ test set accuracy with its best run (Table 1), only three points below the average (untrained) human baseline. Without supplying the required world knowledge for this task it does not seem reasonable to expect it to perform so well. This motivates the question: what has BERT learned about argument comprehension?

To investigate BERT’s decision making we looked at data points it finds easy to classify over multiple runs. Habernal et al. Habernal et al. (2018b) performed a similar analysis with the SemEval submissions, and consistent with their results we found that BERT exploits the presence of cue words in the warrant, especially “not.” Through probing experiments designed to isolate such effects, we demonstrate in this work that BERT’s surprising performance can be entirely accounted for in terms of exploiting spurious statistical cues.

However, we show that the major problem can be eliminated in ARCT. Since $R\land A\rightarrow\lnot C$ , we can add a copy of each data point with the claim negated and the label inverted. This means that the distribution of statistical cues in the warrants will be mirrored over both labels, eliminating the signal. On this adversarial dataset all models perform randomly, with BERT achieving a maximum test set accuracy of $53\%$ . The adversarial dataset therefore provides a more robust evaluation of argument comprehension and should be adopted as the standard in future work on this dataset.

Task Description and Baselines

Let $i=1,\dots,n$ index each point in the dataset $\mathcal{D}$ , where $|\mathcal{D}|=n$ . The two candidate warrants in each case are randomly assigned a binary label $j\in\{0,1\}$ , such that each has an equal probability of being correct. The inputs are the representations for the claim $\mathbf{c}^{(i)}$ , reason $\mathbf{r}^{(i)}$ , warrant zero $\mathbf{w}^{(i)}_{0}$ , and warrant one $\mathbf{w}^{(i)}_{1}$ . The label $y^{(i)}$ is a binary indicator corresponding to the correct warrant.

The general architecture for all models is given in Figure 2. Shared parameters $\boldsymbol{\theta}$ are learned to classify each warrant independently with the argument, yielding the logits:

These are then concatenated and passed through softmax to determine a probability distribution over the two warrants $\mathbf{p}^{(i)}=\operatorname{softmax}([z^{(i)}_{0},z^{(i)}_{1}])$ . The prediction is then $\hat{y}^{(i)}=\operatorname*{arg\,max}_{j}\mathbf{p}^{(i)}$ .

The baselines are a bag of vectors (BoV), bidirectional LSTM Hochreiter and Schmidhuber (1997) (BiLSTM), the SemEval winner GIST Choi and Lee (2018), the best model of Botschen et al. Botschen et al. (2018), and human performance (Table 1). For all of our experiments we use grid search to select hyperparameters, dropout regularization Srivastava et al. (2014), and Adam Kingma and Ba (2014) for optimization. We anneal the learning rate by $1/10$ when validation accuracy drops. The final parameters come from the epoch with maximum validation accuracy. The BoV and BiLSTM inputs are $300$ -dimensional GloVe embeddings trained on $640$ B tokens Pennington et al. (2014). Code to reproduce all experiments, and detailing all hyperparameters, is provided on GitHub.https://github.com/IKMLab/arct2.git

BERT

Our BERT classifier is visualized in Figure 3. The claim and reason are joined to form the first text segment, which is paired with each warrant and independently processed. The final layer CLS vector is passed to a linear layer to obtain the logits $z^{(i)}_{j}$ . The whole architecture is fine-tuned. The learning rate is $2e^{-5}$ and we allow a maximum of $20$ training epochs, taking the parameters from the epoch with the best validation set accuracy. We use the Hugging Face PyTorch implementation.https://github.com/huggingface/pytorch-pretrained-BERT

Devlin et al. Devlin et al. (2018) report that, on small datasets, BERT sometimes fails to train, yielding degenerate results. ARCT is very small with $1,210$ training observations. In $5/20$ runs we encountered this phenomenon, seeing close to random accuracies on validation and test sets. These cases occurred where training accuracy was also not significantly above random ( $<80$ %). Removing the degenerate runs, BERT’s mean is $71.6\pm 0.04$ ., which would beat the previous state of the art - as would the median of $71.2\%$ , which is a better average than the overall mean since it is not skewed by the degenerate cases. However, our main finding is that these results are not meaningful and should be discarded. In the following sections we focus on BERT’s peak performance of $77\%$ to make this case.

Statistical Cues

The major source of spurious statistical cues in ARCT comes from uneven distributions of linguistic artifacts over the warrants, and therefore over the labels. This section aims to demonstrate the presence and nature of these cues. We only consider unigrams and bigrams, although more sophisticated cues may be present. To this end, we aim to calculate how beneficial it is for a model to exploit a cue $k$ , and how pervasive it is in the dataset (indicating the strength of the signal).

The productivity $\pi_{k}$ of a cue is defined as the proportion of applicable data points for which it predicts the correct answer:

Finally, we define the coverage $\xi_{k}$ of a cue as the proportion of applicable cases over the total number of data points: $\xi_{k}=\alpha_{k}/n$ . In these terms, the productivity of a cue measures the benefit of exploiting it, while coverage measures the strength of the signal it provides. With $m$ labels, if $\pi_{k}>1/m$ then the presence of a cue is going to be useful for the task and a machine learner would do well to make use of it.

The productivity and coverage of the strongest unigram cue we found (“not”) is given in Table 2. It provides a particularly strong training signal. While it is less productive in the test set, it is just one among many such cues. We found a range of other unigrams, albeit with less overall productivity, mostly being high frequency words such as “is,” “do,” and “are.” Bigrams that occurred with not, such as “will not” and “cannot,” were also found to be highly productive. These statistics indicate the nature of the problem. In the next section we demonstrate that our models are in fact exploiting these cues.

Probing Experiments

If a model is exploiting distributional cues over the labels, then if trained only on the warrants (W) it should perform relatively well. The same can be said for removing either just the claim, leaving the reason and warrant (R, W), or removing the reason (C, W). The latter setups allow the models to additionally consider cues in the reasons and claims, as well as cues holding over their combinations with the warrants. Each of these setups breaks the task since we no longer have an argument to match with a warrant.

Experimental results are given in Table 3. On warrants alone (W) BERT achieves a maximum $71\%$ accuracy. That leaves only six percentage points to account for its peak of $77\%$ . We find a gain of four percentage points for (R, W) over (W), and a gain of two for (C, W), accounting for the missing six points. Based on this evidence our major finding is that the entirety of BERT’s performance can be accounted for in terms of exploiting spurious statistical cues.

Adversarial Test Set

The major problem of statistical cues over warrants in ARCT can be eliminated as a solution, due to the original design of the dataset. Given that $R\land A\rightarrow\lnot C$ , we can produce adversarial examples by negating the claim and inverting the label for each data point (Figure 4), which are then combined with the original data. This eliminates the problem by mirroring the distributions of cues around both labels. The negation of most claims in the validation and test sets already exist elsewhere in the dataset. The remaining claims were manually negated by a native English speaker.

We trained on the original data,The results reported utilize a training set augmented by adding a copy of each data point with the warrants flipped and the label inverted. Comparable results are achieved with the original data, without this augmentation. and validated and tested on the adversarial data. The results, given in Table 4, show BERT’s peak performance has dropped to $53\%$ , with mean and median at $50\%$ . The spurious statistics have been eliminated as a solution, as expected.Training on the negated data does lead to above random performance, but this is due to exploiting common statistics holding over claims and warrants. Our experimental setup breaks these as solutions by having different distributions over these heuristics in the training and test sets. We report this result in detail in a forthcoming paper. This result better apts with our intuitions about this task: with little to no understanding about the reality underlying these arguments, good performance shouldn’t be feasible.

Related Work

The most successful previous work on ARCT Choi and Lee (2018); Zhao et al. (2018); Niven and Kao (2018) involved transfer learning from Natural Language Inference (NLI) datasets Bowman et al. (2015); Williams et al. (2017), and utilized effective NLI models such as ESIM Chen et al. (2016) and InferSent Conneau et al. (2017). More recently, Botschen et al. Botschen et al. (2018) added FrameNet knowledge with modest performance gains. These models should be evaluated on our adversarial dataset. In particular it will be interesting if Botschen et al.’s model stands out due to the inclusion of some of the required world knowledge.

There is much recent work focusing on statistical cues in datasets in vision Jo and Bengio (2017) and NLP Sanchez et al. (2018); McCoy et al. (2019); Gururangan et al. (2018); Glockner et al. (2018); Poliak et al. (2018); Rajpurkar et al. (2018); Jia and Liang (2017). Similar to our experiment with warrants, Poliak et al. Poliak et al. (2018) classified NLI data based on the hypothesis only. A similar experiment to our probing task was performed by Niven and Kao Niven and Kao (2018), but only with reasons and warrants. They found that independent warrant classification with shared parameters provides some regularization against warrant-label cues. However, this does not solve the problem since the presence of a cue is enough to increase the logits for either warrant.

The original ARCT paper Habernal et al. (2018a) reported results with a training set created in the same way as our adversarial dataset, that also led to random accuracy on the original test set. They suggested it could be that high similarity between the data points made the problem too difficult for the simple models they implemented. Our work indicates the necessity of applying this transformation to the entire dataset in order to obtain a more robust evaluation by eliminating solutions based on spurious statistical cues.

Conclusion

ARCT provides a fortuitous opportunity to see how stark the problem of exploiting spurious statistics can be. Due to our ability to eliminate the major source of these cues, we were able to show that BERT’s maximum performance fell from just three points below the average untrained human baseline to essentially random. To answer our question in the introduction: BERT has learned nothing about argument comprehension.

However, our investigations confirmed that BERT is indeed a very strong learner. Analysis of easy to classify data points showed reliance on a lower proportion of the strongest cue word than the BoV and BiLSTM - i.e. BERT has learned when to ignore the presence of “not” and focus on different cues. This indicates an ability to exploit much more subtle joint distributional information. As our learners get stronger, controlling for spurious statistics becomes more important in order to have confidence in their apparent performance. Taken with a growing body of previous work, our results indicate the need for further research into the extent of this problem in NLP more generally.

The adversarial dataset should be adopted as the standard in future work on ARCT. We hope that providing a more robust evaluation will help to spur more productive research on this problem.

Acknowledgments

We would like to thank Ivan Habernal, and the reviewers, for their helpful comments.