Cross-topic Argument Mining from Heterogeneous Sources Using Attention-based Neural Networks

Christian Stab, Tristan Miller, Iryna Gurevych

Introduction

Information retrieval and question answering are by now mature technologies that excel at answering factual queries on uncontroversial topics. However, they provide no specialized support for queries where there is no single canonical answer, as with topics that are controversial or opinion-based. For such queries, the user may need to carefully assess the stance, source, and supportability for each of the answers. These processes can be supported by argument mining (AM), a nascent area of natural language processing concerned with the automatic recognition and interpretation of arguments.

In this paper, we apply AM to the task of argument search—that is, searching a large document collection for arguments relevant to a given topic. Searching for and classifying relevant arguments plays an important role in decision making Svenson (1979), legal reasoning Wyner et al. (2010), and the critical reading, writing, and summarization of persuasive texts Kobayashi (2009); Wingate (2012). Automating the argument search process could ease much of the manual effort involved in these tasks, particularly if it can be made to robustly handle arguments from different text types and topics.

But despite its obvious usefulness, this sort of argument search has attracted relatively little attention in the research community. This may be due in part to the limitations of the underlying models and training resources, particularly as they relate to heterogeneous sources. That is, most current approaches to AM are designed for use with particular text types and do not work well when applied to different data sets Daxenberger et al. (2017). Indeed, as Habernal et al. (2014) observe, there is a great diversity of perspectives on how arguments can be best characterized and modelled, and no “one-size-fits-all” argumentation theory that applies to the variety of text sources found on the Web.

In this paper, we propose an argument annotation scheme that is (1) applicable to the information-seeking perspective of argument search, (2) general enough for use on heterogeneous data sources, and (3) simple enough to be applied manually by untrained annotators. We investigate whether it is possible to achieve reasonable data quality using crowdsourced annotations, and how well computational models trained on this data perform on the argument search task within and across different topics. Finally, we measure the amount of topic-specific data that must be added to a topic-general model in order for it to achieve in-topic performance comparable to that of a topic-specific model.

Our results show that crowd workers can indeed apply our annotation scheme to arbitrary Web texts quickly and reliably, allowing us to obtain huge amounts of data at a reasonable cost. The corpus we produce includes over 25,000 instances over eight controversial topics, allowing for cross-topic experiments using heterogeneous text types. The results of these experiments show that our attention-based neural network outperforms vanilla BiLSTM models in cross-topic experiments, with a relative improvement of 6% in accuracy and 11% in F-score.

Related Work

Most existing approaches treat argument mining at the discourse level, focusing on tasks such as segmenting argumentative discourse units Ajjour et al. (2017); Goudas et al. (2014), classifying the function of argumentative discourse units (for example, as claims or premises) Mochales-Palau and Moens (2009); Stab and Gurevych (2014), and recognizing argumentative discourse relations Eger et al. (2017); Stab and Gurevych (2017); Nguyen and Litman (2016). These discourse-level approaches address the identification of argumentative structures within a single document but do not consider relevance to externally defined topics.

To date, there has been little research on the identification of topic-relevant arguments for argument search. Wachsmuth et al. (2017) present a generic argument search framework. The system, however, relies on already structured arguments from debate portals and is not yet able to retrieve arguments from arbitrary texts. Levy et al. (2014) investigate the identification of topic-relevant claims, an approach that was later extended with evidence extraction to mine supporting statements for claims Rinott et al. (2015). However, both approaches are designed to mine arguments from Wikipedia articles; it is unclear whether their annotation scheme is applicable to other text types or whether it can be easily and accurately applied by untrained annotators. Hua and Wang (2017) identify sentences in cited documents that have been used by an editor to formulate an argument. In contrast to this work, we do not limit our approach to the identification of sentences related to a given argument, but rather focus on the retrieval of any argument relevant to a given topic. The fact that we are concerned with retrieval of arguments also sets our work apart from the discourse-agnostic stance detection task of Mohammad et al. (2016), which is concerned with the identification of sentences expressing support or opposition to a given topic, irrespective of whether those sentences contain supporting evidence (as opposed to mere statements of opinion).

Cross-domain AM experiments have so far been conducted only for discourse-level tasks such as claim identification Daxenberger et al. (2017), argumentative segment identification Al-Khatib et al. (2016), and argumentative unit segmentation Ajjour et al. (2017). However, the discourse-level argumentation models employed for these studies seem to be highly dependent on the text types for which they were designed; they do not work well when applied to other text types Daxenberger et al. (2017). The crucial difference between our own work and prior cross-domain experiments is that we investigate AM from heterogeneous texts across different topics instead of studying specific discourse-level AM tasks across restricted text types of existing corpora.

Annotation Scheme and Corpus Creation

There exists a great diversity in models of argumentation, which differ in their perspective, complexity, terminology, and intended applications Bentahar et al. (2010). For the present study, we propose a model which, though simplistic, is nonetheless well-suited to the argument search scenario outlined in our introduction. We define an argument as a span of text expressing evidence or reasoning that can be used to either support or oppose a given topic. An argument need not be “direct” or self-contained—it may presuppose some common or domain knowledge, or the application of commonsense reasoning—but it must be unambiguous in its orientation to the topic. A topic, in turn, is some matter of controversy for which there is an obvious polarity to the possible outcomes—that is, a question of being either for or against the use or adoption of something, the commitment to some course of action, etc. In some graph-based models of argumentation (Stab, 2017, Ch. 2), what we refer to as a topic would be part of a (major) claim expressing a positive or negative stance, and our arguments would be premises with supporting/attacking consequence relations to the claim. However, unlike these models, which are typically used to represent (potentially deep or complex) argument structures at the discourse level, ours is a flat model that considers arguments in isolation from their surrounding context. A great advantage of this approach is that it allows annotators to classify text spans without having to read large amounts of text and without having to consider relations to other topics or arguments.

In this work, we restrict ourselves to topics that can be concisely and implicitly expressed through keywords, and arguments that consist of individual sentences. Some examples, drawn from our data set, are shown in Table 1. The first three examples should be self-explanatory. The fourth example expresses opposition to the topic, but under our definition it is properly classified as a non-argument because it is a mere statement of stance that provides no evidence or reasoning.

For our experiments it was necessary to gather a large collection of manually annotated arguments that cover a variety of topics and that come from a variety of text types. We started by randomly selecting eight topics (see Table 3) from online lists of controversial topics.https://www.questia.com/library/controversial-topics, https://www.procon.org/ For each topic, we made a Google query for the topic name, removed results not cached by the Wayback Machine,https://web.archive.org/ and truncated the list to the top 50 results. This resulted in a set of persistent, topic-relevant, largely (but not exclusively) polemical Web documents representing a range of genres and text types, including news reports, editorials, blogs, debate forums, and encyclopedia articles. We preprocessed each document with Apache Tikahttps://tika.apache.org/ to remove boilerplate text. We then used the Stanford CoreNLP tools Manning et al. (2014) to perform tokenization, sentence segmentation, and part-of-speech tagging on the remaining text, and removed all sentences without verbs or with less than three tokens. This left us with a raw data set of 27,520 sentences (about 2,700 to 4,400 sentences per topic).

To assist annotators in classifying these sentences according to our argumentation model, we created a browser-based annotation interface that presents a brief set of instructions, a topic, a list of sentences, and a multiple-choice form for specifying whether each sentence is a supporting argument, an opposing argument, or not an argument with respect to the topic.

2 Analysis

To test the applicability of our annotation scheme by untrained annotators, we performed an experiment where we had a group of expert annotators and a group of untrained annotators classify the same set of sentences, and then compared the two groups’ classifications. The data for this experiment consisted of 200 sentences randomly selected from each of our eight topics. Our “expert” annotators were two graduate-level language technology researchers who were fully briefed on the nature and purpose of the argument model. Our untrained annotators were anonymous American workers from the Amazon Mechanical Turk (AMT) crowdsourcing platform. Each sentence was independently annotated by the two expert annotators and ten crowd workers.

Inter-annotator agreement for our two experts, as measured by Cohen’s $\kappa$ , was 0.721; this exceeds the commonly used threshold of 0.7 for assuming the results are reliable Carletta (1996). We proceeded by having the two experts resolve their disagreements, resulting in a set of “expert” gold-standard annotations. Similar gold standards were produced for the crowd annotations by applying the MACE de-noising tool Hovy et al. (2013); we tested various threshold values (1.0, 0.9, and 0.8) to discard instances that cannot be confidently assigned a canonical label. We then calculated Cohen’s $\kappa$ between the remaining instances in the expert and crowd gold standards. In order to determine the relationship between inter-annotator agreement and the number of crowd workers, we performed this procedure with successively lower numbers of crowd workers, going from the original ten annotators per instance down to two. The results are visualized in Figure 1. We observe that using seven annotators and a MACE threshold of 0.9 results in $\kappa=0.723$ ; this gives us similar reliability as with the expert annotators without sacrificing too much coverage. Table 2 shows the $\kappa$ and percentage agreement for this setup, as well as the agreement between our expert annotators, broken down by topic.

We proceeded with annotating the remaining instances in our data set using seven crowd workers each. The workers were paid 1.2¢ per instance, with each instance taking a bit less than six seconds on average. This corresponds to the US federal minimum wage of $7.25/hour. Our total expenditure, including AMT processing fees, was$ 2,774.02. After applying MACE with a threshold of 0.9, we were left with 25,492 gold-standard annotations. Table 3 provides statistics on the size and class distribution of the final corpus. The gold-standard annotations for this data set, and code for retrieving the original sentences from the Wayback Machine, are released under free licences.https://www.ukp.tu-darmstadt.de/data/

Approaches for Identifying Arguments

We model the identification of arguments as a binary, sentence-level classification and aim to learn the following function:

where $s=w_{1},w_{2},w_{3},\dots,w_{n}$ is a sentence consisting of words $w_{i}$ and $t=v_{1},v_{2},\dots,v_{m}$ is a topic with words $v_{j}$ . In other words, the task is to classify sentence $s$ as “argument” if $s$ includes a relevant reason either supporting or opposing the given topic $t$ and as “no argument” if the sentence does not include a reason or is not relevant for topic $t$ .Note that we leave stance recognition for future work.

Our first model (bilstm) is a bidirectional long short-term memory network. LSTMs Hochreiter and Schmidhuber (1997) are recurrent neural networks that process each word gradually and decide in each step which information to keep in order to produce a concise representation of the word sequence. Traditional LSTMs, however, process the text in a single direction and do not consider contextual information of future words in the current step Tan et al. (2016). Bidirectional LSTMs use both the previous and future context by processing the input sequence in two directions. The final representation is the concatenation of the forward and backward step. In order to prevent overfitting, we add dropout after the concatenation layer. The result is fed into a dense layer with two units and softmax as the activation function. For representing the words $w_{i}$ , we use 300-dimensional word embeddings trained on the Google News data set by Mikolov et al. (2013). To handle out-of-vocabulary (OOV) words, we create random word vectors and map each OOV word to the same random vector.Each dimension is set to a random number between $-0.01$ and $0.01$ . Digits are mapped to the same random word vector.

2 BiLSTM Model with Topic Similarity Features (bilstm+cos)

A limitation of the bilstm model described in the previous section is that it does not take the topic $t$ into account. Consequently, the model is not able to learn the relation between sentence $s$ and topic $t$ and to decide if a sentence is relevant for the given topic. To address this issue, we extend the bilstm model in the following way: we concatenate the input embedding of each word $w_{i}$ with the cosine similarity between $w_{i}$ and the averaged word embeddings of the topic words $v_{j}$ . That is, we encode each word $w_{i}$ of sentence $s$ as

where $x_{i}$ is the word embedding of $w_{i}$ , $u$ is the average of the word embeddings of $v_{j}$ in topic $t$ , and $\cos(x_{i},u)$ is the cosine similarity between $x_{i}$ and $u$ .We also tried concatenating $u$ with $x_{i}$ . However, this performed worse than the vanilla bilstm model. We refer to this model as bilstm+cos.

3 Inner-attention BiLSTM (inner-att)

In order to let the model learn which parts of the sentence are relevant (or irrelevant) to the given topic, we propose an attention-based neural network Bahdanau et al. (2014) that learns an importance weighting of the input words depending on the given topic. Similar approaches have been shown to achieve state-of-the-art results in aspect-based sentiment analysis Wang et al. (2016), question answering Tan et al. (2016), and discourse parsing Li et al. (2016). For our model, we adopt an inner-attention mechanism as proposed by Wang et al. (2016). In particular, we determine the importance weighting on the input sequence instead of on the hidden states of the LSTM; this has been shown to prevent biased importance weights towards the end of a sequence. Following this idea, we determine the importance weighting for each input embedding $x_{i}$ as

for each of the word embeddings $x_{i}$ of sentence $s$ . This attention mechanism can be seen as a sieve in which uninformative words are filtered by the given topic. For obtaining a concise representation of the sentence, we apply a BiLSTM model on the weighted input embeddings, whereas Wang et al. (2016) used a single GRU. Also, we do not use average pooling on the hidden layers of the RNN, but use the concatenation of the forward and backward LSTMs as the final sentence representation.

4 Inner-attention BiLSTM with Topic Similarity Features (inner-att+cos)

Our fourth model combines the inner-attention mechanism of the inner-att model and the topic similarity feature of bilstm+cos. As with the inner-att model, we learn an importance weighting on the embeddings of the words of sentence $s$ as described in §4.3. Then, we concatenate the weighted input embeddings from Equation 4 with the cosine similarity between the averaged topic embeddings $u$ and the embeddings of the current word $x_{i}$ as

Accordingly, this representation not only distills unimportant information, but also emphasizes words similar to the topic, which helps to discover off-topic sentences. Figure 2 shows the schematic of this model, which we refer to as inner-att+cos.

Evaluation

In order to evaluate the robustness of the models, we conduct in-topic as well as cross-topic experiments. For the former, we use 80% of all sentences of a topic for training and 20% for testing. In order to tune the parameters of the models, we sample 10% of the training data as validation data.We used stratified splitting to ensure the same class distribution in all sets. In cross-topic experiments, we evaluate how well the models generalize to an unknown topic. To this end, we combine training and validation data of seven topics for training and parameter tuning, and use the test data of the eighth topic for testing. We intentionally do not use the entire data of the target topic for testing, since it allows us to directly compare in-topic experiments with cross-topic experiments and to investigate the influence of gradually adding target topic data to the training data (§5.6).

Since reporting single performance scores is insufficient to compare non-deterministic learning approaches like neural networks Reimers and Gurevych (2017), we report average scores of ten runs with different random seeds. As evaluation measures, we report the average macro F-score over all ten runs for each topic. Furthermore, we report the average accuracy (A), macro F-score, and precision (P) and recall (R) of the “argument” class over all eight topics (and runs) for in-topic and cross-topic experiments.

We use a logistic regression model with lowercased unigram features as baseline, which has been shown to be a strong baseline for various other AM task Daxenberger et al. (2017); Stab and Gurevych (2017). We refer to this model as lr-uni.

All models are trained using the Adam optimizer Kingma and Ba (2015) and cross-entropy loss function. For finding the best model in each of the ten runs, we stop training once the accuracy on the validation data no longer improves. To prevent the model from overfitting, we apply dropout Srivastava et al. (2014) for each model after the concatenation layer of the BiLSTM layer as described in §4.1. To accelerate training, we set the maximum length of all sentences to $60$ .Only 244 of our sentences ( $<$ 1%) exceed this length.

For finding the best model configurations, we tuned the hyperparamters by training each model on the training data of all topics and evaluated their performance on all validation sets. In particular, we experimented with LSTM sizes of 32, 48, 64, 96, 128, 160, and 192; dropouts of 0.1, 0.3, 0.5, 0.7, and 0.9; batch sizes of 16, 32, 64, and 128; and learning rates of $1\times 10^{-2},5\times 10^{-3},1\times 10^{-3},$ and $1\times 10^{-4}$ . Table 5 shows the best parameters for each of our four models on the validation data.

2 In-topic Results

The in-topic results (upper part of Table 4) show that all neural approaches outperform the lr-uni baseline. The bilstm model achieves an average accuracy of 0.727 and an F1 of 0.721. We can also see from the results that all three models using the topic as additional input perform better than the vanilla bilstm model. Our bilstm+cos achieves better results for four topics while the attention-based models outperform the bilstm model on seven (inner-att) and five topics (inner-att+cos). The inner-att model performs best in in-topic experiments, achieving an average accuracy of 0.744 and an F1 of 0.741. This finding suggests that the attention model successfully emphasizes those parts of the sentences which are important for the topic and that the learned importance weighting results in more concise sentence representations for AM.

3 Cross-topic Results

When the target topic is unknown, the F-scores of the neural models drop on average by 0.108 (lower part of Table 4). In particular, all models achieve a considerably lower recall compared to in-topic experiments, which is also evident by the number of sentences classified as argument in all test sets. For instance, in cross-topic experiments the inner-att+cos model classifies only 1,515 sentences as argument, while it recognizes 2,662 arguments in in-topic experiments. The results, however, also show that all neural approaches achieve a considerably higher precision compared to in-topic experiments. On average the precision of neural models is 0.064 better compared to in-topic experiments. This suggests that the neural models learn common properties that arguments share across topics.

The results also show that the inner-att+cos model generalizes best to unknown topics. It outperforms the vanilla bilstm model on all topics and achieves 0.693 accuracy and 0.658 F-score. The results also show that the model achieves 0.067 higher precision and the lowest drop in recall of all models compared to in-topic experiments. The model performs better compared to bilstm+cos and inner-att, which illustrates that the combination of the attention mechanism with the similarity feature is helpful in cross-topic settings.

4 What Does the Model Learn?

In an attempt to understand what the attention-based model learns, we analyzed the importance weights of individual words. Table 6 shows how the importance weighting of the inner-att+cos model changes for the same sentence when different topics are given as input. The first row shows the importance weights for the topic “school uniforms”, to which the sentence is relevant. As the colours indicate, the model gives high attention to words like “students” and “wear”, which are relevant to the topic. More importantly, the model also emphasizes words like “violates”, “freedom”, and “expression”, which represent the gist of the argument. The second row shows the importance weighting when providing a topic not relevant to the sentence. As we can see, the model gives higher attention to stop words like “the”, “of”, and “to”. It also gives attention to words like “violates” and “right” which are less topic-dependent and likely to appear in arguments relevant to other topics. This example illustrates that our attention-based model successfully learns which words make a sentence a relevant argument for a given topic.

As the evaluation results suggest, our model learns specific features allowing it to achieve high precision in cross-topic experiments (see §5.3). In order to better understand these features, we ranked the words of all positively classified sentences in the test sets according to their average importance weights in all cross-topic experiments. Among the top-ranked words are remarkably few topic-dependent words, but many adverbs and adjectives like “fair”, “perfect”, “wrong”, “easier”, and “impossible”. Also, verbs like “infringe”, “oppose”, and “undermine” receive high importance weights across topics. This shows that the attention-based model gives high attention to words that assign positive or negative attributes to specific entities.

5 Error Analysis

To better understand the errors of the inner-att+cos model, we manually analyzed 100 sentences randomly sampled from the false positives and false negatives of cross-topic experiments. Among the false positives, we found 42 off-topic sentences that were wrongly classified as arguments. The 58 on-topic false positives are primarily non-argumentative background information about the topic, or mere opinions about the topic without evidence (cf. the first and fourth examples in Table 1). Among the false negatives, we found 61 sentences not explicitly referring to the topic but to related aspects that make the sentence a relevant argument. For instance, the model fails to establish argumentative links between the topic “cloning” and aspects like diminishing the waiting lists for organ donation, or links between “nuclear energy” and the conditions of workers in uranium mines.

6 Adapting to New Topics

In order to quantify the amount of topic-specific data required by the models to achieve in-topic results, we gradually add target topic data in cross-topic experiments to the training data and evaluate model performance on the target test set. Figure 3 shows the average precision and recall over all topics when adding different amounts of randomly sampled topic specific data to the training data ( $x$ -axes).Each data point in the plot is the average score of $80$ experiments (ten runs with different random samples of target-topic data for each of the eight topics). As the results show, all models achieve higher recall, while the precision drops when adding target topic data to the training data. This shows that the models tend to emphasize topic-related information more than topic-independent features when target-topic data is available.

This effect is most evident for the inner-att+cos model, which uses information about the topic in the attention mechanism as well as in the similarity feature. The results, however, also show that the inner-att+cos model achieves 0.802 recall with only 30% of the target topic data, while the vanilla bilstm model and bilstm+cos model do not reach in-topic recall with all available target topic data.

Conclusion

We have presented a new approach for searching a document collection for arguments relevant to a given topic. First, we introduced an annotation scheme that is applicable to the information-seeking perspective of argument search and general enough for use on heterogeneous texts. Second, by comparing crowdsourced annotations to expert annotations, we showed that our annotation scheme is reliably applicable by untrained annotators to arbitrary Web texts. Third, we presented a new corpus, including over 25,000 instances over eight topics, that allows for cross-topic experiments using heterogeneous text types. The annotations as well as the source code for downloading the sentences from the Wayback Machine are made available for future work. Fourth, we conducted in- and cross-topic experiments and showed that our attention-based model better generalizes to unknown topics than vanilla BiLSTM models.