A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue

Mihail Eric, Christopher D. Manning

Introduction

Effective task-oriented dialogue systems are becoming important as society progresses toward using voice for interacting with devices and performing everyday tasks such as scheduling. To that end, research efforts have focused on using machine learning methods to train agents using dialogue corpora. One line of work has tackled the problem using partially observable Markov decision processes and reinforcement learning with carefully designed action spaces . However, the large, hand-designed action and state spaces make this class of models brittle and unscalable, and in practice most deployed dialogue systems remain hand-written, rule-based systems.

Recently, neural network models have achieved success on a variety of natural language processing tasks , due to their ability to implicitly learn powerful distributed representations from data in an end-to-end trainable fashion. This paper extends recent work examining the utility of distributed state representations for task-oriented dialogue agents, without providing rules or manually tuning features.

One prominent line of recent neural dialogue work has continued to build systems with modularly-connected representation, belief state, and generation components [2016b]. These models must learn to explicitly represent user intent through intermediate supervision, and hence suffer from not being truly end-to-end trainable. Other work stores dialogue context in a memory module and repeatedly queries and reasons about this context to select an adequate system response . While reasoning over memory is appealing, these models simply choose among a set of utterances rather than generating text and also must have temporal dialogue features explicitly encoded.

However, the present literature lacks results for now standard sequence-to-sequence architectures, and we aim to fill this gap by building increasingly complex models of text generation, starting with a vanilla sequence-to-sequence recurrent architecture. The result is a simple, intuitive, and highly competitive model, which outperforms the more complex model of ?) by 6.9%. Our contributions are as follows: 1) We perform a systematic, empirical analysis of increasingly complex sequence-to-sequence models for task-oriented dialogue, and 2) we develop a recurrent neural dialogue architecture augmented with an attention-based copy mechanism that is able to significantly outperform more complex models on a variety of metrics on realistic data.

Architecture

We use neural encoder-decoder architectures to frame dialogue as a sequence-to-sequence learning problem. Given a dialogue between a user (u) and a system (s), we represent the dialogue utterances as {(u1,s1),(u2,s2),,(uk,sk)}\{(u_{1},s_{1}),(u_{2},s_{2}),\ldots,(u_{k},s_{k})\} where kk denotes the number of turns in the dialogue. At the i\mboxthi{{}^{\mbox{th}}} turn of the dialogue, we encode the aggregated dialogue context composed of the tokens of (u1,s1,,si1,ui)(u_{1},s_{1},\ldots,s_{i-1},u_{i}). Letting x1,,xmx_{1},\ldots,x_{m} denote these tokens, we first embed these tokens using a trained embedding function ϕemb\phi^{emb} that maps each token to a fixed-dimensional vector. These mappings are fed into the encoder to produce context-sensitive hidden representations h1,,hmh_{1},\ldots,h_{m}.

where W1W_{1}, W2W_{2}, UU, and vv are trainable parameters of the model and oto_{t} represents the logits over the tokens of the output vocabulary VV. During training, the next token yty_{t} is predicted so as to maximize the log-likelihood of the correct output sequence given the input sequence.

In our best performing model, we augment the inputs to the encoder by adding entity type features. Classes present in the knowledge base of the dataset, namely the 8 distinct entity types referred to in Table 1, are encoded as one-hot vectors. Whenever a token of a certain entity type is seen during encoding, we append the appropriate one-hot vector to the token’s word embedding before it is fed into the recurrent cell. These type features improve generalization to novel entities by allowing the model to hone in on positions with particularly relevant bits of dialogue context during its soft attention and copying. Other cited work using the DSTC2 dataset implement similar mechanisms whereby they expand the feature representations of candidate system responses based on whether there is lexical entity class matching with provided dialogue context. In these works, such features are referred to as match features.

All of our architectures use an LSTM cell as the recurrent unit with a bias of 1 added to the forget gate in the style of .

Experiments

For our experiments, we used dialogues extracted from the Dialogue State Tracking Challenge 2 (DSTC2) , a restaurant reservation system dataset. While the goal of the original challenge was building a system for inferring dialogue state, for our study, we use the version of the data from ?), which ignores the dialogue state annotations, using only the raw text of the dialogues. The raw text includes user and system utterances as well as the API calls the system would make to the underlying KB in response to the user’s queries. Our model then aims to predict both these system utterances and API calls, each of which is regarded as a turn of the dialogue. We use the train/validation/test splits from this modified version of the dataset. The dataset is appealing for a number of reasons: 1) It is derived from a real-world system so it presents the kind of linguistic diversity and conversational abilities we would hope for in an effective dialogue agent. 2) It is grounded via an underlying knowledge base of restaurant entities and their attributes. 3) Previous results have been reported on it so we can directly compare our model performance. We include statistics of the dataset in Table 1.

2 Training

We trained using a cross-entropy loss and the Adam optimizer , applying dropout as a regularizer to the input and output of the LSTM. We identified hyperparameters by random search, evaluating on a held-out validation subset of the data. Dropout keep rates ranged from 0.75 to 0.95. We used word embeddings with size 300, and hidden layer and cell sizes were set to 353, identified through our search. We applied gradient clipping with a clip-value of 10 to avoid gradient explosions during training. The attention, output parameters, word embeddings, and LSTM weights were randomly initialized from a uniform unit-scaled distribution in the style of .

3 Metrics

Evaluation of dialogue systems is known to be difficult . We employ several metrics for assessing specific aspects of our model, drawn from previous work:

Per-Response Accuracy: Bordes and Weston report a per-turn response accuracy, which tests their model’s ability to select the system response at a certain timestep. Their system does a multiclass classification over a predefined candidate set of responses, which was created by aggregating all system responses seen in the training, validation, and test sets. Our model actually generates each individual token of the response, and we consider a prediction to be correct only if every token of the model output matches the corresponding token in the gold response. Evaluating using this metric on our model is therefore significantly more stringent a test than for the model of Bordes and Weston .

Per-Dialogue Accuracy: Bordes and Weston also report a per-dialogue accuracy, which assesses their model’s ability to produce every system response of the dialogue correctly. We calculate a similar value of dialogue accuracy, though again our model generates every token of every response.

BLEU: We use the BLEU metric, commonly employed in evaluating machine translation systems , which has also been used in past literature for evaluating dialogue systems . We calculate average BLEU score over all responses generated by the system, and primarily report these scores to gauge our model’s ability to accurately generate the language patterns seen in DSTC2.

Entity F1: Each system response in the test data defines a gold set of entities. To compute an entity F1, we micro-average over the entire set of system dialogue responses. This metric evaluates the model’s ability to generate relevant entities from the underlying knowledge base and to capture the semantics of the user-initiated dialogue flow.

Our experiments show that sometimes our model generates a response to a given input that is perfectly reasonable, but is penalized because our evaluation metrics involve direct comparison to the gold system output. For example, given a user request for an australian restaurant, the gold system output is you are looking for an australian restaurant right? whereas our system outputs what part of town do you have in mind?, which is a more directed follow-up intended to narrow down the search space of candidate restaurants the system should propose. This issue, which recurs with evaluation of dialogue or other generative systems, could be alleviated through more forgiving evaluation procedures based on beam search decoding.

4 Results

In Table 2, we present the results of our models compared to the reported performance of the best performing model of , which is a variant of an end-to-end memory network . Their model is referred to as MemNN. We also include the model of , referred to as GMemNN, and the model of , referred to as QRN, which currently is the state-of-the-art. In the table, Seq2Seq refers to our vanilla encoder-decoder architecture with (1), (2), and (3) LSTM layers respectively. +Attn refers to a 1-layer Seq2Seq with attention-based decoding. +Copy refers to +Attn with our copy-mechanism added. +EntType refers to +Copy with entity class features added to encoder inputs.

We see that a 1-layer vanilla encoder-decoder is already able to significantly outperform MemNN in both per-response and per-dialogue accuracies, despite our more stringent setting. Adding layers to Seq2Seq leads to a drop in performance, suggesting an overly powerful model for the small dataset size. Adding an attention-based decoding to the vanilla model increases BLEU although per-response and per-dialogue accuracies suffer a bit. Adding our attention-based entity copy mechanism achieves substantial increases in per-response accuracies and entity F1. Adding entity class features to +Copy achieves our best-performing model, in terms of per-response accuracy and entity F1. This model achieves a 6.9% increase in per-response accuracy on DSTC2 over MemNN, including +1.5% per-dialogue accuracy, and is on par with the performance of GMemNN, including beating its per-dialogue accuracy. It also achieves the highest entity F1.

Discussion and Conclusion

We have iteratively built out a class of neural models for task-oriented dialogue that is able to outperform other more intricately designed neural architectures on a number of metrics. The model incorporates in a simple way abilities that we believe are essential to building good task-oriented dialogue agents, namely maintaining dialogue state and being able to extract and use relevant entities in its responses, without requiring intermediate supervision of dialogue state or belief tracker modules. Other dialogue models tested on DSTC2 that are more performant in per-response accuracy are equipped with sufficiently more complex mechanisms than our model. Taking inspiration from and , GMemNN uses an explicit memory module as well as an adaptive gating mechanism to learn to attend to relevant memories. The QRN model employs a variant of a recurrent unit that is intended to handle local and global interactions in sequential data. We contrast with these works by bootstrapping off of more empirically accepted Seq2Seq architectures through intuitive extensions, while still producing highly competitive models.

We attribute the large gains in per-response accuracy and entity F1 demonstrated by our +EntType to its ability to pick out the relevant KB entities from the dialogue context fed into the encoder. In Figure 1, we see the attention-based copy weights of the model, indicating that the model is able to learn the relevant entities it should focus on in the input context. The powerful language modelling abilities of the Seq2Seq backbone allow smooth integration of these extracted entities into both system-generated API calls and natural language responses as shown in the figure.

The appeal of our model comes from the simplicity and effectiveness of framing system response generation as a sequence-to-sequence mapping with a soft copy mechanism over relevant context. Unlike the task-oriented dialogue agents of Wen et. al [2016b], our architecture does not explicitly model belief states or KB slot-value trackers, and we preserve full end-to-end-trainability. Further, in contrast to other referenced work on DSTC2, our model offers more linguistic versatility due to its generative nature while still remaining highly competitive and outperforming other models. Of course, this is not to deny the importance of dialogue agents which can more effectively use a knowledge base to answer user requests, and this remains a good avenue for further work. Nevertheless, we hope this simple and effective architecture can be a strong baseline for future research efforts on task-oriented dialogue.

Acknowledgments

The authors wish to thank the reviewers, Lakshmi Krishnan, Francois Charette, and He He for their valuable feedback and insights. We gratefully acknowledge the funding of the Ford Research and Innovation Center, under Grant No. 124344. The views expressed here are those of the authors and do not necessarily represent or reflect the views of the Ford Research and Innovation Center.

References