Question Answering through Transfer Learning from Large Fine-grained Supervision Data

Sewon Min, Minjoon Seo, Hannaneh Hajishirzi

Introduction

Question answering (QA) is a long-standing challenge in NLP, and the community has introduced several paradigms and datasets for the task over the past few years. These paradigms differ from each other in the type of questions and answers and the size of the training data, from a few hundreds to millions of examples.

We are particularly interested in the context-aware QA paradigm, where the answer to each question can be obtained by referring to its accompanying context (paragraph or a list of sentences). Under this setting, the two most notable types of supervisions are coarse sentence-level and fine-grained span-level. In sentence-level QA, the task is to pick sentences that are most relevant to the question among a list of candidates Yang et al. (2015). In span-level QA, the task is to locate the smallest span in the given paragraph that answers the question Rajpurkar et al. (2016).

In this paper, we address coarser, sentence-level QA through a standard transfer learning The borderline between transfer learning and domain adaptation is often ambiguous Mou et al. (2016). We choose the term “transfer learning” because we also adapt the pretrained QA model to an entirely different task, RTE. technique of a model trained on a large, span-supervised QA dataset. We demonstrate that the target task not only benefits from the scale of the source dataset but also the capability of the fine-grained span supervision to better learn syntactic and lexical information.

For the source dataset, we pretrain on SQuAD Rajpurkar et al. (2016), a recently-released, span-supervised QA dataset. For the source and target models, we adopt BiDAF Seo et al. (2017), one of the top-performing models in the dataset’s leaderboard. For the target datasets, we evaluate on two recent QA datasets, WikiQA Yang et al. (2015) and SemEval 2016 (Task 3A) Nakov et al. (2016), which possess sufficiently different characteristics from that of SQuAD. Our results show 8% improvement in WikiQA and 1% improevement in SemEval. In addition, we report state-of-the-art results on recognizing textual entailment (RTE) in SICK Marelli et al. (2014) with a similar transfer learning procedure.

Background and Data

Modern machine learning models, especially deep neural networks, often significantly benefit from transfer learning. In computer vision, deep convolutional neural networks trained on a large image classification dataset such as ImageNet Deng et al. (2009) have proved to be useful for initializing models on other vision tasks, such as object detection Zeiler and Fergus (2014). In natural language processing, domain adaptation has traditionally been an important topic for syntactic parsing McClosky et al. (2010) and named entity recognition Chiticariu et al. (2010), among others. With the popularity of distributed representation, pre-trained word embedding models such as word2vec Mikolov et al. (2013b, a) and glove Pennington et al. (2014) are also widely used for natural language tasks Karpathy and Fei-Fei (2015); Kumar et al. (2016). Instead of these, we initialize our models from a QA dataset and show how standard transfer learning can achieve state-of-the-art in target QA datasets.

There have been several QA paradigms in NLP, which can be categorized by the context and supervision used to answer questions. This context can range from structured and confined knowledge bases Berant et al. (2013) to unstructured and unbounded natural language form (e.g., documents on the web Voorhees and Tice (2000)) and unstructured, but restricted in size (e.g., a paragraph or multiple sentences Hermann et al. (2015)). The recent advances in neural question answering lead to numerous datasets and successful models in these paradigms Rajpurkar et al. (2016); Yang et al. (2015); Nguyen et al. (2016); Trischler et al. (2016). The answer types in these datasets are largely divided into three categories: sentence-level, in-context span, and generation. In this paper, we specifically focus on the former two and show that span-supervised models can better learn syntactic and lexical features. Among these datasets, we briefly describe three QA datasets to be used for the experiments in this paper. We also give the description of an RTE dataset for an example of a non-QA task. Refer to Table 1 to see the examples of the datasets.

Rajpurkar et al. (2016) is a recent span-based QA dataset, containing 100k/10k train/dev examples. Each example is a pair of context paragraph from Wikipedia and a question created by a human, and the answer is a span in the context.

SQUAD-T

is our modification of SQuAD dataset to allow for sentence selection QA. (‘T’ for senTence). We split the context paragraph into sentences and formulate the task as classifying whether each sentence contains the answer. This enables us to make a fair comparison between pretraining with span-supervised and sentence-supervised QA datasets.

WikiQA

Yang et al. (2015) is a sentence-level QA dataset, containing 1.9k/0.3k train/dev answerable examples. Each example consists of a real user’s Bing query and a snippet of a Wikipedia article retrieved by Bing, containing 18.6 sentences on average. The task is to classify whether each sentence provides the answer to the query.

SemEval 2016 (Task 3A)

Nakov et al. (2016) is a sentence-level QA dataset, containing 1.8k/0.2k/0.3k train/dev/test examples. Each example consists of a community question by a user and 10 comments. The task is to classify whether each comment is relevant to the question.

SICK

Marelli et al. (2014) is a dataset for recognizing textual entailment (RTE), containing 4.5K/0.5K/5.0K train/dev/test examples. Each example consists of a hypothesis and a premise, and the goal is to determine if the premise is entailed by, contradicts, or is neutral to the hypothesis (hence classification problem). We also report results on SICK to show that span-supervised QA dataset can be also useful for non-QA datasets.

Model

Among numerous models proposed for span-level QA tasks (Xiong et al., 2017; Wang and Jiang, 2017b), we adopt an open-sourced model, BiDAFhttps://allenai.github.io/bi-att-flow (Seo et al., 2017).

BiDAF-T

refers to the modified version of BiDAF to make it compatible with sentence-level QACode available at: https://github.com/shmsw25/qa-transfer. (‘T’ for senTence). In this task, the inputs are a question ${\bm{q}}$ and a list of sentences, ${\bm{x}}_{1},\dots,{\bm{x}}_{T}$ , where $T$ is the number of the sentences. Note that, unlike BiDAF, which outputs single answer per example, Here we need to output a $C$ -way classification for each $k$ -th sentence.

For WikiQA and SemEval 2016, the number of classes ( $C$ ) is $2$ , i.e. each sentence (or comment) is either relevant or not relevant. Since some of the metrics used for these datasets require full ranking, we use the predicted probability for “relevant” label to rank the sentences.

Note that BiDAF-T can be also used for the RTE dataset, where we can consider the hypothesis as a question and the premise as a context sentence ( $T=1$ ), and classify each example into ‘entailment’, ‘neutral’, or ‘contradiction’ ( $C=3$ ).

Transfer Learning.

Transfer learning between the same model architectures Strictly speaking, this is a domain adaptation scenario. is straightforward: we first initialize the weights of the target model with the weights of the source model pretrained on the source dataset, and then we further train (finetune) on the target model with the target dataset. To transfer from BiDAF (on SQuAD) to BiDAF-T, we transfer all the weights of the identical modules, and initialize the new answer module in BiDAF-T with random values. For more training details, refer to Appendix A.

Experiments

Table 2 reports the state-of-the-art results of our transfer learning on WikiQA and SemEval-2016 and the performance of previous models as well as several ablations that use no pretraining or no finetuning. There are multiple interesting observations from Table 2 as follows:

(a)

If we only train the BiDAF-T model on the target datasets with no pretraining (first row of Table 2), the results are poor. This shows the importance of both pretraining and finetuning.

(b)

Pretraining on SQuAD and SQuAD-T with no finetuning (second and third row) achieves results close to the state-of-the-art in the WikiQA dataset, but not in SemEval-2016. Interestingly, our result on SemEval-2016 is not better than only training without transfer learning. We conjecture that this is due to the significant difference between the domain of SemEval-2016 and that of SQuAD, which are from community and Wikipedia, respectively.

(c)

Pretraining on SQuAD and SQuAD-T with finetuning (fourth and fifth row) significantly outperforms (by more than 5%) the highest-rank systems on WikiQA. It also outperforms the second ranking system in SemEval-2016 and is only 1% behind the first ranking system.

(d)

Transfer learning models achieve better results with pretraining on span-level supervision (SQuAD) than coarser sentence-level supervision (SQuAD-T).We additionally perform Mann-Whitney U Test and McNemar’s Test to show the statistical significance of the advantage of span-level pretraining over sentence-level pretraining. For WikiQA, the advantage is statistically significant with the confidence levels of 97.1% and 99.6%, respectively. For SemEval, we obtain the confidence levels of 97.8% and 99.9%, respectively.

Finally, we also use the ensemble of 12 different training runs on the same BiDAF architecture, which obtains the state of the art in both datasets. This system outperforms the highest-ranking system in WikiQA by more than 8% and the best system in SemEval-2016 by 1% in every metric. It is important to note that, while we definitely benefit from the scale of SQuAD for transfer learning to smaller WikiQA, given the gap between SQuAD-T and SQuAD ( $>3\%$ ), we see a clear sign that span-supervision plays a significant role well.

Varying the size of pretraining dataset.

We vary the size of SQuAD dataset used during pretraining, and test on WikiQA with finetuning. Results are shown in Table 3. As expected, MAP on WikiQA drops as the size of SQuAD decreases. It is worth noting that pretraining on SQuAD-T (Table 2) yields 0.5 point lower MAP than pretraining on 50% of SQuAD. In other words, roughly speaking, span-level supervision data is worth more than twice the size of sentence-level supervision data for the purpose of pretraining. Also, even a small size of fine-grained supervision data helps; pretraining with 12.5% of SQuAD gives an advantage of more than 7 points than no pretraining.

Analysis.

Figure 1 shows the latently-learned attention maps between the question and one of the context sentences from a WikiQA example in Table 1. The top map is pretrained on SQuAD-T (corresponding to SQuAD-T&Yes in Table 2) and the bottom map is pretrained on SQuAD (SQuAD&Yes). The more red the color, the higher the relevance between the words. There are two interesting observations here.

First, in SQuAD-pretrained model (bottom), we see a high correspondence between question’s airbus and context’s aircraft and aerospace, but the SQuAD-T-pretrained model fails to learn such correspondence.

Second, we see that the attention map of the SQuAD-pretrained model is more sparse, indicating that it is able to more precisely localize correspondence between question and context words. In fact, we compare the sparsity of WikiQA test examples in SQuAD&Y and SQuAD-T&Y. Following Hurley and Rickard (2009), the sparsity of an attention map is defined by

More analyses including error analysis and more visualizations are shown in Appendix B.

Entailment Results.

In addition to QA experiments, we also show that the models trained on span-supervised QA can be useful for textual entailment task (RTE). Table 4 shows the transfer learning results of BiDAF-T on SICK dataset (Marelli et al., 2014), with various pretraining routines. Note that SNLI Bowman et al. (2015) is a similar task to SICK and is significantly larger (150K/10K/10K train/dev/test examples). Here we highlight three observations:

(a)

BiDAF-T pretrained on SQuAD outperforms that without any pretraining by 6% and that pretrained on SQuAD-T by 2%, which demonstrates that the transfer learning from large span-based QA gives a clear improvement.

(b)

Pretraining on SQuAD+SNLI outperforms pretraining on SNLI only. Given that SNLI is larger than SQuAD, the difference in their performance is a strong indicator that we are benefiting from not only the scale of SQuAD, but also the fine-grained supervision that it provides.

(c)

We outperform the previous state of the art by 2% with the ensemble of SQuAD+SNLI pretraining routine.

It is worth noting that Mou et al. (2016) also shows improvement on SICK by pretraining on SNLI.

Conclusion

In this paper, we show state-of-the-art results on WikiQA and SemEval-2016 (Task 3A) as well as an entailment task, SICK, outperforming previous results by 8%, 1%, and 2%, respectively. We show that question answering with sentence-level supervision can greatly benefit from standard transfer learning of a question answering model trained on a large, span-level supervision. We additionally show that such transfer learning can be applicable in other NLP tasks such as textual entailment.

Acknowledgments

This research was supported by the NSF (IIS 1616112), NSF (III 1703166), Allen Institute for AI (66-9175), Allen Distinguished Investigator Award, and Google Research Faculty Award. We thank the anonymous reviewers for their helpful comments.

References

Appendix A Training details

Convergence.

For all settings, we train models until performance on development set continue to decrease for 5k steps. Table 5 shows the median selected step on each setting.

Appendix B More Analysis

We show some more examples of attention maps in Figure 3. (Top) We see high correspondence between same word from question and context such as senator and john, in SQuAD-pretrained model, but the SQuAD-T-pretrained model fails to learn such correspondence. (Bottom) We see high correspondence between stems from question and stem from context (left) as well as plant from question and plants from context (right), in SQuAD-pretrained model, but the SQuAD-T-pretrained model fails to learn such correspondence.

Error Analysis.

Table 7 shows the comparison between answers by SQuAD-T-pretrained model and SQuAD-pretrained model on the example of WikiQA and SemEval-2016 from Table 1. On WikiQA, SQuAD-T-pretrained model selects C2 instead of the groundtruth answer C1. On SemEval-2016, SQuAD-pretrained model ranks C3 (bad comment) higher than C2 (good comment).

In addition, we sampled 100 example randomly from WikiQA and SemEval-2016, and classified them into 6 categories(Table 6). In Table 8, we compare the performance on these WikiQA examples by SQuAD-T-pretrained model and SQuAD-pretrained model. It shows that span supervision clearly helps answering questions on Category 1 and 2, which are easier to answer, with answering correctly on most of the questions in Category 1. Similarly, we show the comparison of the performance on classified examples of the model without pretraining and SQuAD-pretrained model on SemEval-2016. It also shows that span supervision helps answering questions asking information or opinion/recommendation.

Appendix C More Results

To better understand SQuAD-T dataset, we show the performance BiDAF-T with different training routines. We get MAP 89.46 and accuracy 85.34% with SQuAD-trained BiDAF model, and MAP 90.18 and accuracy 84.69% with SQuAD-T-trained BiDAF-T model. There is no large gap between the two models, as each paragraph of SQuAD-T has 5 sentences on average, which makes the classification problem easier than WikiQA.

SNLI.

Other larger RTE datasets such as SNLI also benefit from transfer learning, although the improvement is smaller. We confirm the improvement by showing that the result on SNLI when pretraining on SQuAD with BiDAF is 82.6%, which is slightly higher than that of the model pretrained on SQuAD-T (81.6%). This, however, did not outperform the state of the art (88.8%) by Wang et al. (2017). This is mostly because BiDAF (or BiDAF-T) is a QA model, which is not designed for RTE tasks.