The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

Introduction

We introduce the Natural Language Decathlon (decaNLP) in order to explore models that generalize to many different kinds of NLP tasks. decaNLP encourages a single model to simultaneously optimize for ten tasks: question answering, machine translation, document summarization, semantic parsing, sentiment analysis, natural language inference, semantic role labeling, relation extraction, goal oriented dialogue, and pronoun resolution.

We frame all tasks as question answering [Kumar et al., 2016] by allowing task specification to take the form of a natural language question $q$ : all inputs have a context, question, and answer (Fig. 1). Traditionally, NLP examples have inputs $x$ and outputs $y$ , and the underlying task $t$ is provided through explicit modeling constraints. Meta-learning approaches include $t$ as additional input [Schmidhuber, 1987, Thrun and Pratt, 1998, Thrun, 1998, Vilalta and Drissi, 2002]. Our approach does not use a single representation for any $t$ , but instead uses natural language questions that provide descriptions for underlying tasks. This allows single models to effectively multitask and makes them more suitable as pretrained models for transfer learning and meta-learning: natural language questions allow a model to generalize to completely new tasks through different but related task descriptions.

We provide a set of baselines for decaNLP that combine the basics of sequence-to-sequence learning [Sutskever et al., 2014, Bahdanau et al., 2014, Luong et al., 2015b] with pointer networks [Vinyals et al., 2015, Merity et al., 2017, Gülçehre et al., 2016, Gu et al., 2016, Nallapati et al., 2016], advanced attention mechanisms [Xiong et al., 2017], attention networks [Vaswani et al., 2017], question answering [Seo et al., 2017, Xiong et al., 2018, Yu et al., 2016, Weissenborn et al., 2017], and curriculum learning [Bengio et al., 2009].

The multitask question answering network (MQAN) is designed for decaNLP and makes use of a novel dual coattention and multi-pointer-generator decoder to multitask across all tasks in decaNLP. Our results demonstrate that training the MQAN jointly on all tasks with the right anti-curriculum strategy can achieve performance comparable to that of ten separate MQANs, each trained separately. A MQAN pretrained on decaNLP shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. Though not explicitly designed for any one task, MQAN proves to be a strong model in the single-task setting as well, achieving state-of-the-art results on the semantic parsing component of decaNLP.

We have released codehttps://github.com/salesforce/decaNLP for obtaining and preprocessing datasets, training and evaluating models, and tracking progress through a leaderboard based on decathlon scores (decaScore). We hope that the combination of these resources will facilitate research in multitask learning, transfer learning, general embeddings and encoders, architecture search, zero-shot learning, general purpose question answering, meta-learning, and other related areas of NLP.

Tasks and Metrics

decaNLP consists of 10 publicly available datasets with examples cast as (question, context, answer) triplets as shown in Fig. 1.

Question Answering. Question answering (QA) models receive a question and a context that contains information necessary to output the desired answer. We use the Stanford Question Answering Dataset (SQuAD) [Rajpurkar et al., 2016] for this task. Contexts are paragraphs taken from the English Wikipedia, and answers are sequences of words copied from the context. SQuAD uses a normalized F1 (nF1) metric that strip out articles and punctuation.

Machine Translation. Machine translation models receive an input document in a source language that must be translated into a target language. We use the 2016 English to German training data prepared for the International Workshop on Spoken Language Translation (IWSLT) [Cettolo et al., 2016]. Examples are from transcribed TED presentations that cover a wide variety of topics with conversational language. We evaluate with a corpus-level BLEU score [Papineni et al., 2002] on the 2013 and 2014 test sets as validation and test sets, respectively.

Summarization. Summarization models take in a document and output a summary of that document. Most important to recent progress in summarization was the transformation of the CNN/DailyMail (CNN/DM) corpus [Hermann et al., 2015] into a summarization dataset [Nallapati et al., 2016]. We include the non-anonymized version of this dataset in decaNLP. On average, these examples contain the longest documents in decaNLP and force models to balance extracting from the context with generation of novel, abstractive sequences of words. CNN/DM uses ROUGE-1, ROUGE-2, and ROUGE-L scores [Lin, 2004]. We average these three measures to compute an overall ROUGE score.

Natural Language Inference. Natural Language Inference (NLI) models receive two input sentences: a premise and a hypothesis. Models must then classify the inference relationship between the two as one of entailment, neutrality, or contradiction. We use the Multi-Genre Natural Language Inference Corpus (MNLI) [Williams et al., 2017] which provides training examples from multiple domains (transcribed speech, popular fiction, government reports) and test pairs from seen and unseen domains. MNLI uses an exact match (EM) score.

Sentiment Analysis. Sentiment analysis models are trained to classify the sentiment expressed by input text. The Stanford Sentiment Treebank (SST) [Socher et al., 2013] consists of movie reviews with the corresponding sentiment (positive, neutral, negative). We use the unparsed, binary version [Radford et al., 2017]. SST also uses an EM score.

Semantic Role Labeling. Semantic role labeling (SRL) models are given a sentence and predicate (typically a verb) and must determine ‘who did what to whom,’ ‘when,’ and ‘where’ [Johansson and Nugues, 2008]. We use an SRL dataset that treats the task as question answering, QA-SRL [He et al., 2015]. This dataset covers both news and Wikipedia domains, but we only use the latter in order to ensure that all data for decaNLP can be freely downloaded. We evaluate QA-SRL with the nF1 metric used for SQuAD.

Relation Extraction. Relation extraction systems take in a piece of unstructured text and the kind of relation that is to be extracted from that text. In this setting, it is important that models can report that the relation is not present and cannot be extracted. As with SRL, we use a dataset that maps relations to a set of questions so that relation extraction can be treated as question answering: QA-ZRE [Levy et al., 2017]. Evaluation of the dataset is designed to measure zero shot performance on new kinds of relations – the dataset is split so that relations seen at test time are unseen at train time. This kind of zero-shot relation extraction, framed as question answering, makes it possible to generalize to new relations. QA-ZRE uses a corpus-level F1 metric (cF1) in order to accurately account for unanswerable questions. This F1 metric defines precision as the true positive count divided by the number of times the system returned a non-null answer and recall as the true positive count divided by the number of instances that have an answer.

Goal-Oriented Dialogue. Dialogue state tracking is a key component of goal-oriented dialogue systems. Based on user utterances, actions taken already, and conversation history, dialogue state trackers keep track of which predefined goals the user has for the dialogue system and which kinds of requests the user makes as the system and user interact turn-by-turn. We use the English Wizard of Oz (WOZ) restaurant reservation task [Wen et al., 2016], which comes with a predefined ontology of foods, dates, times, addresses, and other information that would help an agent make a reservation for a customer. WOZ is evaluated by turn-based dialogue state EM (dsEM) over the goals of the customers.

Semantic Parsing. SQL query generation is related to semantic parsing. Models based on the WikiSQL dataset [Zhong et al., 2017] translate natural language questions into structured SQL queries so that users can interact with a database in natural language. WikiSQL is evaluated by a logical form exact match (lfEM) to ensure that models do not obtain correct answers from incorrectly generated queries.

Pronoun Resolution. Our final task is based on Winograd schemas [Winograd, 1972], which require pronoun resolution: "Joan made sure to thank Susan for the help she had [given/received]. Who had [given/received] help? Susan or Joan?". We started with examples taken from the Winograd Schema Challenge [Levesque et al., 2011] and modified them to ensure that answers were a single word from the context. This modified Winograd Schema Challenge (MWSC) ensures that scores are neither inflated nor deflated by oddities in phrasing or inconsistencies between context, question, and answer. We evaluate with an EM score.

The Decathlon Score (decaScore). Models competing on decaNLP are evaluated using an additive combination of each task-specific metric. All metrics fall between and $100$ , so that the decaScore naturally falls between and $1000$ for ten tasks. Using an additive combination avoids issues that arise from weighing different metrics. All metrics are case insensitive.

Multitask Question Answering Network (MQAN)

Because every task is framed as question answering and trained jointly, we call our model a multitask question answering network (MQAN). Each example consists of a context, question, and answer as shown in Fig. 1. Many recent QA models for question answering typically assume the answer can be copied from the context [Wang and Jiang, 2017, Seo et al., 2017, Xiong et al., 2018], but this assumption does not hold for general question answering. The question often contains key information that constrains the answer space. Noting this, we extend the coattention of [Xiong et al., 2017] to enrich the representation of not only the input but also the question. Also, the pointer-mechanism of [See et al., 2017] is generalized into a hierarchical, multi-pointer-generator that enables the capacity to copy directly from the question and the context.

During training, the MQAN takes as input three sequences: a context $c$ with $l$ tokens, a question $q$ with $m$ tokens, and an answer $a$ with $n$ tokens. Each of these is represented by a matrix where the $i$ th row of the matrix corresponds to a $d_{emb}$ -dimensional embedding (such as word or character vectors) for the $i$ th token in the sequence:

Answer Representations. During training, the decoder begins by projecting the answer embeddings onto a $d$ -dimensional space:

Multi-head Decoder Attention. We use self-attentionThe decoder operates step by step. To prevent the decoder from seeing future time-steps during training, appropriate entries of $XY^{\top}$ are set to a large negative number prior to the softmax in (22). [Vaswani et al., 2017] so that the decoder is aware of previous outputs (or a special intialization token in the case of no previous outputs) and attention over the context to prepare for the next output. Refer to Appendix C for definitions of MultiHead attention and FFN, the residual feedforward network applied after MultiHead attention over the context.

Context and Question Attention. This intermediate state is used to get attention weights $\alpha_{t}^{C}$ and $\alpha_{t}^{Q}$ to allow the decoder to focus on encoded information relevant to time step $t$ .

Recurrent Context State. Context representations are combined with these weights and fed through a feedforward network with tanh activation to form the recurrent context state and question state:

Multi-Pointer-Generator. Our model must be able to generate tokens that are not in the context or the question. We give it access to $v$ additional vocabulary tokens. We obtain distributions over tokens in the context, question, and this external vocabulary, respectively, as

We train using a token-level negative log-likelihood loss over all time-steps: $\mathcal{L}=-\sum_{t}^{T}\log p(a_{t})$ .

Experiments and Analysis

Multi-Pointer-Generator and task identification. At each step, the MQAN decides between three choices: generating from the vocabulary, pointing to the question, and pointing to the context. While the model is not trained with explicit supervision for these decisions, it learns to switch between the three options. Fig. 3 presents statistics of how often the final model chooses each option. For SQuAD, QA-SRL, and WikiSQL, the model mostly copies from the context. This is intuitive because all tokens necessary to correctly answer questions from these datasets are contained in the context. The model also usually copies from the context for CNN/DM because answer summaries consist mostly of words from the context with few words generated from outside the context in between.

For SST, MNLI, and MWSC, the model prefers the question pointer because the question contains the tokens for acceptable classes. Because the model learns to use the question pointer in this way, it can do zero-shot classification as discussed in 4. For IWSLT and WOZ, the model prefers generating from the vocabulary because German words and dialogue state fields are rarely in the context. The models also avoids copying for QA-ZRE; half of those examples require generating ‘unanswerable’ from the external vocabulary.

Sampled answers confirm that the model does not confuse tasks. German words are only ever output during translation from English to German. The model never outputs anything but ’positive’ and ’negative’ for sentiment analysis.

Adaptation to new tasks. MQAN trained ondecaNLP learn to generalize beyond the specific domains for any one task while also learning representations that make learning completely new tasks easier. For two new tasks (English-to-Czech translation and named entity recognition - NER), fine-tuning a MQAN trained on decaNLP requires fewer iterations and reaches a better final performance than training from a random initialization (Fig. 4). For the translation experiment, we use the IWSLT 2016 En $\rightarrow$ Cs dataset and for NER, we use OntoNotes 5.0 [Hovy et al., 2006].

Zero-shot domain adaptation for text classification. Because MNLI is included in decaNLP, it is possible to adapt to the related Stanford Natural Language Inference Corpus (SNLI) [Bowman et al., 2015]. Fine-tuning a MQAN pretrained on decaNLP achieves an $87\%$ exact match score, which is a $2$ point increase over training from a random initialization and $2$ points from the state of the art [Kim et al., 2018]. More remarkably, without any fine-tuning on SNLI, a MQAN pretrained on decaNLP still achieves an exact match score of $62\%$ .

Because decaNLP contains SST, it can also perform well on other binary sentiment classification tasks. On Amazon and Yelp reviews [Kotzias et al., 2015], a MQAN pretrained on decaNLP achieves exact match scores of $82.1\%$ and $80.8\%$ , respectively, without any fine-tuning.

Additionally, rephrasing questions by replacing the tokens for the training labels positive/negative with happy/angry or supportive/unsupportive at inference time, leads to only small degradation in performance. The model’s reliance on the question pointer for SST (see Figure 3) allows it to copy different, but related class labels with little confusion. This suggests these multitask models are more robust to slight variations in questions and tasks and can generalize to new and unseen classes.

These results demonstrate that models trained on decaNLP have potential to simultanesouly generalize to out-of-domain contexts and questions for multiple tasks and even adapt to unseen classes for text classification. This kind of zero-shot domain adaptation in both input and output spaces suggests that the breadth of tasks in decaNLP encourages generalization beyond what can be achieved by training for a single task.

This section contains work related to aspects of decaNLP and MQAN that are not task-specific. See Appendix A for work related to each individual task.

Most success in making use of the relatedness between natural language tasks stem from transfer learning. Word2Vec [Mikolov et al., 2013a, b], skip-thought vectors [Kiros et al., 2015] and GloVe [Pennington et al., 2014] yield pretrained embeddings that capture useful information about natural language. The embeddings [Collobert and Weston, 2008, Collobert et al., 2011], intermediate representations [Peters et al., 2018], and weights of language models can be transferred to similar architectures [Ramachandran et al., 2017] and classification tasks [Howard and Ruder, 2018]. Intermediate representations from supervised machine translation models improve performance on question answering, sentiment analysis, and natural language inference [McCann et al., 2017]. Question answering datasets support each other as well as entailment tasks [Min et al., 2017], and high-resource machine translation can support low-resource machine translation [Zoph et al., 2016]. This work shows that the combination of MQAN and decaNLP makes it possible to transfer an entire end-to-end model that can be adapted for any NLP task cast as question answering.

Multitask Learning in NLP.

Unified architectures have arisen for chunking, POS tagging, NER, and SRL [Collobert et al., 2011] as well as dependency parsing, semantic relatedness, and natural language inference [Hashimoto et al., 2016]. Multitask learning over different machine translation language pairs can enable zero-shot translation [Johnson et al., 2017], and sequence-to-sequence architectures can be used to multitask across translation, parsing, and image captioning [Luong et al., 2015a] using varying numbers of encoders and decoders. These tasks can also be learned with image classification and speech recognition with careful modularization [Kaiser et al., 2017], and the success of this approach extends to visual and textual question answering [Xiong et al., 2016]. Learning such modularization can further mitigate interference between tasks [Ruder et al., 2017].

More generally, multitask learning has been successful when models are able to capitalize on relatedness amongst tasks while mitigating interference from dissimilarities [Caruana, 1997]. When tasks are sufficiently related, they can provide an inductive bias [Mitchell, 1980] that forces models to learn more generally useful representations. By unifying tasks under a single perspective, it is possible to explore these relationships [Wang et al., 2018, Poliak et al., 2018a, b].

MQAN trained on decaNLP is the first, single model to achieve reasonable performance on such a wide variety of complex NLP tasks without task-specific modules or parameters, with little evidence of catastrophic interference, and without parse trees, chunks, POS tags, or other intermediate representations. This sets the foundation for general question answering models.

Optimization and Catastrophic Forgetting.

Multitask learning presents a set of optimization problems that extend beyond the NLP setting. Multi-objective optimization [Deb, 2014] naturally connects to multitask learning and typically involves querying a decision-maker who weighs different objectives. Much effort has gone into mitigating catastrophic forgetting [McCloskey and Cohen, 1989, Ratcliff, 1990, Kemker et al., 2017] by penalizing the norm of parameters when training on a new task [Kirkpatrick et al., 2017], the norm of the difference between parameters for previously learned tasks during parameter updates [Hashimoto et al., 2016], incrementally matching modes [Lee et al., 2017], rehearsing on old tasks [Robins, 1995], using adaptive memory buffers [Gepperth and Karaoguz, 2016], finding task-specific paths through networks [Fernando et al., 2017], and packing new tasks into already trained networks [Mallya and Lazebnik, 2017].

MQAN is able to perform nearly as well or better in the multitask setting as in the single-task setting for each task despite being capped at the same number of trainable parameters in both. A collection of MQANs trained for each task individually would use far more trainable parameters than a single MQAN trained jointly on decaNLP. This suggests that MQAN successfully uses trainable parameters more efficiently in the multitask setting by learning to pack or share parameters in a way that limits catastrophic forgetting.

Meta-Learning

Meta-learning attempts to train models on a variety of tasks so that they can easily learn new tasks [Thrun and Pratt, 1998, Thrun, 1998, Vilalta and Drissi, 2002]. Past work has shown how to learn rules for learning [Schmidhuber, 1987, Bengio et al., 1992], train meta-agents that control parameter updates [Hochreiter et al., 2001, Andrychowicz et al., 2016], augment models with special memory mechanisms [Santoro et al., 2016, Schmidhuber, 1992], and maximize the degree to which models can learn new tasks [Finn et al., 2017].

We introduced the Natural Language Decathlon (decaNLP), a new benchmark for measuring the performance of NLP models across ten tasks that appear disparate until unified as question answering. We presented MQAN, a model for general question answering that uses a multi-pointer-generator decoder to capitalize on questions as natural language descriptions of tasks. Despite not having any task-specific modules, we trained MQAN on all decaNLP tasks jointly, and we showed that anti-curriculum learning gave further improvements. After training on decaNLP , MQAN exhibits transfer learning and zero-shot capabilities. When used as pretrained weights, MQAN improved performance on new tasks. It also demonstrated zero-shot domain adaptation capabilities on text classification from new domains. We hope the the decaNLP benchmark, experimental results, and publicly available code encourage further research into general models for NLP.

Appendix A Further Related Work

Question Answering. Early success on the SQuAD dataset exploited the fact that all answers can be found verbatim in the context. State-of-the-art models point to start and end tokens in the document . This allowed deterministic answer extraction to overtake sequential token generation . This quirk of the dataset does not hold for question answering in general, so recent models for SQuAD are not necessarily general question answering models . While datasets like TriviaQA and NewsQA could also represent question answering, SQuAD is particularly interesting because the human level performance of SQuAD models in the single-task setting depends on a quirk that does not generalize to all forms of question answering. Including SQuAD in decaNLP challenges models to integrate techniques learned from a single-task approach into a more general approach while evaluation remains grounded in the document. Many of the alternatives are larger and can be used as additional training data or incorporated into future iterations of the decaNLP once the more well-understood SQuAD dataset has been mastered in the multitask setting.

Machine Translation. Until recently, the standard approach trained recurrent models with attention on a single source-target language pair . Models that use only convolution or attention have shown that recurrence is not essential for the task, but recurrence can contribute to the strongest models . While training these models on many source and target languages at the same time remains difficult, limiting models to one source language and many target languages or vice versa can lead to strong performance when resources are limited or null .

While much larger corpora and many other language pairs exist, the English-German IWSLT dataset provides the same order of magnitude of training data as the other tasks in decaNLP. We encourage the use of larger corpora or multiple language pairs to improve performance, but we did not want to skew the first iteration of the challenge too far towards machine translation.

Recent approaches combine recurrent neural networks with pointer networks to generate output sequences that contain key words copied from the document . Coverage mechanisms and temporal attention improve problems with redundancy in long summaries. Reinforcement learning has pushed performance using common summarization metrics as well as alternative metrics that transfer knowledge from another task .

While new corpora like NEWSROOM are even larger, CNN/DM remains the current standard benchmark, so we include it in decaNLP and encourage augmentation with datasets like NEWSROOM.

Natural Language Inference

NLI has a long history playing roles in tasks like information retrieval and semantic parsing . The introduction of the Stanford Natural Language Inference Corpus (SNLI) by spurred a new wave of interest in NLI, its connections to other tasks, and general sentence representations. The most successful approaches make use of attentional models that match and align words in the premise to those in the hypothesis , but recent non-attentional models designed to extract useful sentence representations have nearly closed the gap .

The dataset we use, the Multi-Genre Natural Language Inference Corpus (MNLI) introduced by , is the successor to SNLI. Recent approaches to MNLI use methods developed on SNLI and have even pointed out the similarities between models for question answering and NLI .

Sentiment Analysis

Because SST came with parse trees for every example, some approaches use all of the sub-tree labels by modeling trees explicitly as in the original paper. Others use sub-tree labels implicitly , and still others do not use the sub-trees at all . This suggests that while the many sub-tree labels might facilitate learning, they are not necessary to train state-of-the-art models.

Semantic Role Labeling

Traditionally, models have made use of syntactic parsing information , but recent methods have demonstrated that it is not necessary to use syntactic information as additional input . State-of-the-art approaches treat SRL as a tagging problem , make use of that specific structure to constrain decoding, and mix recurrent and self-attentive layers .

Because QA-SRL treats SRL as question answering , it abstracts away the many task-specific constraints of treating SRL as a tagging problem with hand-designed verb-specific roles or grammars. This preserves much of the structure extracted by prior formulations while also allowing models to extract structure that is not syntax-based.

Relation Extraction

QA-ZRE introduced a similar idea for relation extraction . By associating natural language questions with relations, this dataset reduces relation extraction to question answering. This makes it possible to use question answering models in place of more traditional relation extraction models that often do not make use of the linguistic similarities amongst relations. This in turn makes it possible to do zero-shot relation extraction.

Goal-Oriented Dialogue

Dialogue state tracking requires a system to estimate a users goals and and requests given the dialogue context, and it plays a crucial role in goal-oriented dialogue systems. Most models use a structured approach , with the most recent work making use of both global and local modules to learns representations of the user utterance and previous system actions .

Semantic Parsing

Similarly, recent approaches to the semantic parsing WikiSQL dataset have made use of structured approaches that move from coarse sketches of the input to fine-grained structured outputs , direclty employing a type system , or making use of dependency graphs .

Appendix B Preprocessing and Training Details

All data is lowercased as is common for SQuAD, IWSLT, CNN/DM, and WikiSQL; casing is irrelevant for the evaluation of the other tasks. We use the RevTok tokenizerhttps://github.com/jekbradbury/revtok to provide simple, yet completely reversible tokenization, which is crucial for detokenizing generated sequences for evaluation. The generative vocabulary in Eq. 11 contains the most frequent $50000$ words in the combined training sets for all tasks in decaNLP. SQuAD examples with context longer than $400$ tokens were excluded during training and CNN/DM examples had contexts truncated to $400$ tokens during training and evaluation. Only MNLI examples with a label other than ‘-’ were included during training and evaluation as is standard. For WOZ, we train turn-by-turn to predict the change in belief state including user requests as an additional slot, but during evaluation we only consider the cumulative belief state as is standard. We do not perform any form of beam search or otherwise refine greedily sampled outputs for any tasks to avoid task-specific post-processing where possible.

Appendix C Multitask Question Answering Network (MQAN) Encoder

Recall from Section 3 that the encoder has three input sequences during training: a context $c$ with $l$ tokens, a question $q$ with $m$ tokens, and an answer $a$ with $n$ tokens. Each of these is represented by a matrix where the $i$ th row of the matrix corresponds to a $d_{emb}$ -dimensional embedding (such as word or character vectors) for the $i$ th token in the sequence:

Independent Encoding. A linear layer projects input matrices onto a common $d$ -dimensional space.

Let ${\rm softmax}\left(X\right)$ denote a column-wise softmax that normalizes each column of the matrix $X$ to have entries that sum to $1$ . We obtain alignments by normalizing dot-product similarity scores between representations of one sequence with those of the other:

Dual Coattention. These alignments are used to compute weighted summations of the information from one sequence that is relevant to a single token in the other.

The coattended representations use the same weights to transfer information gained from alignments back to the original sequences:

Compression. In order to compress information from dual coattention back to the more manageable dimension $d$ , we concatenate all four prior representations for each sequence along the last dimension and feed into separate BiLSTMs:

Self-Attention. Next, we use multi-head, scaled dot-product attention to capture long distance dependencies within each sequence. Let

All linear transformations in Eq. (23) project to $d$ so that multi-head attention representations maintain dimensionality:

Final Encoding. Finally, we aggregate all of this information across time with two BiLSTMs:

These matrices are given to the decoder to generate the answer.