Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, Steve Young

Introduction

The natural language generation (NLG) component provides much of the persona of a spoken dialogue system (SDS), and it has a significant impact on a user’s impression of the system. As noted in ?), a good generator usually depends on several factors: adequacy, fluency, readability, and variation. Previous approaches attacked the NLG problem in different ways. The most common and widely adopted today is the rule-based (or template-based) approach [Cheyer and Guzzoni, 2007, Mirkovic and Cavedon, 2011]. Despite its robustness and adequacy, the frequent repetition of identical, rather stilted, output forms make talking to a rule-based generator rather tedious. Furthermore, the approach does not easily scale to large open domain systems[Young et al., 2013, Gašić et al., 2014, Henderson et al., 2014]. Hence approaches to NLG are required that can be readily scaled whilst meeting the above requirements.

The trainable generator approach exemplified by the HALOGEN [Langkilde and Knight, 1998] and SPaRKy system [Stent et al., 2004] provides a possible way forward. These systems include specific trainable modules within the generation framework to allow the model to adapt to different domains [Walker et al., 2007], or reproduce certain style [Mairesse and Walker, 2011]. However, these approaches still require a handcrafted generator to define the decision space within which statistics can be used for optimisation. The resulting utterances are therefore constrained by the predefined syntax and any domain-specific colloquial responses must be added manually.

More recently, corpus-based methods [Oh and Rudnicky, 2000, Mairesse and Young, 2014, Wen et al., 2015] have received attention as access to data becomes increasingly available. By defining a flexible learning structure, corpus-based methods aim to learn generation directly from data by adopting an over-generation and reranking paradigm [Oh and Rudnicky, 2000], in which final responses are obtained by reranking a set of candidates generated from a stochastic generator. Learning from data directly enables the system to mimic human responses more naturally, removes the dependency on predefined rules, and makes the system easier to build and extend to other domains. As detailed in Sections 2 and 3, however, these existing approaches have weaknesses in the areas of training data efficiency, accuracy and naturalness.

This paper presents a statistical NLG based on a semantically controlled Long Short-term Memory (LSTM) recurrent network. It can learn from unaligned data by jointly optimising its sentence planning and surface realisation components using a simple cross entropy training criterion without any heuristics, and good quality language variation is obtained simply by randomly sampling the network outputs. We start in Section 3 by defining the framework of the proposed neural language generator. We introduce the semantically controlled LSTM (SC-LSTM) cell in Section 3.1, then we discuss how to extend it to a deep structure in Section 3.2. As suggested in ?), a backward reranker is introduced in Section 3.3 to improve fluency. Training and decoding details are described in Section 3.4 and 3.5.

Section 4 presents an evaluation of the proposed approach in the context of an application providing information about venues in the San Francisco area. In Section 4.2, we first show that our generator outperforms several baselines using objective metrics. We experimented on two different ontologies to show not only that good performance can be achieved across domains, but how easy and quick the development lifecycle is. In order to assess the subjective performance of our system, a quality test and a pairwise preference test are presented in Section 4.3. The results show that our approach can produce high quality utterances that are considered to be more natural and are preferred to previous approaches. We conclude with a brief summary and future work in Section 5.

Related Work

Conventional approaches to NLG typically divide the task into sentence planning and surface realisation. Sentence planning maps input semantic symbols into an intermediary form representing the utterance, e.g. a tree-like or template structure, then surface realisation converts the intermediate structure into the final text [Walker et al., 2002, Stent et al., 2004]. Although statistical sentence planning has been explored previously, for example, generating the most likely context-free derivations given a corpus [Belz, 2008] or maximising the expected reward using reinforcement learning [Rieser and Lemon, 2010], these methods still rely on a pre-existing, handcrafted generator. To minimise handcrafting, ?) proposed learning sentence planning rules directly from a corpus of utterances labelled with Rhetorical Structure Theory (RST) discourse relations [Mann and Thompson, 1988]. However, the required corpus labelling is expensive and additional handcrafting is still needed to map the sentence plan to a valid syntactic form.

As noted above, corpus-based NLG aims at learning generation decisions from data with minimal dependence on rules and heuristics. A pioneer in this direction is the class-based n-gram language model (LM) approach proposed by ?). ?) later addressed some of the limitations of class-based LMs in the over-generation phase by using a modified generator based on a syntactic dependency tree. ?) proposed a phrase-based NLG system based on factored LMs that can learn from a semantically aligned corpus. Although active learning [Mairesse et al., 2010] was also proposed to allow learning online directly from users, the requirement for human annotated alignments limits the scalability of the system. Another similar approach casts NLG as a template extraction and matching problem, e.g., ?) train a set of log-linear models to make a series of generation decisions to choose the most suitable template for realisation. ?) later show that the outputs can be further improved by an SVM reranker making them comparable to human-authored texts. However, template matching approaches do not generalise well to unseen combinations of semantic elements.

The use of neural network-based (NN) approaches to NLG is relatively unexplored. The stock reporter system ANA by ?) is perhaps the first NN-based generator, although generation was only done at the phrase level. Recent advances in recurrent neural network-based language models (RNNLM) [Mikolov et al., 2010, Mikolov et al., 2011a] have demonstrated the value of distributed representations and the ability to model arbitrarily long dependencies. ?) describes a simple variant of the RNN that can generate meaningful sentences by learning from a character-level corpus. More recently, ?) have demonstrated that an RNNLM is capable of generating image descriptions by conditioning the network model on a pre-trained convolutional image feature representation. ?) also describes interesting work using RNNs to generate Chinese poetry. A forerunner of the system presented here is described in ?), in which a forward RNN generator, a CNN reranker, and a backward RNN reranker are trained jointly to generate utterances. Although the system was easy to train and extend to other domains, a heuristic gate control was needed to ensure that all of the attribute-value information in the system’s response was accurately captured by the generated utterance. Furthermore, the handling of unusual slot-value pairs by the CNN reranker was rather arbitrary. In contrast, the LSTM-based system described in this paper can deal with these problems automatically by learning the control of gates and surface realisation jointly.

Training an RNN with long range dependencies is difficult because of the vanishing gradient problem [Bengio et al., 1994]. ?) mitigated this problem by replacing the sigmoid activation in the RNN recurrent connection with a self-recurrent memory block and a set of multiplication gates to mimic the read, write, and reset operations in digital computers. The resulting architecture is dubbed the Long Short-term Memory (LSTM) network. It has been shown to be effective in a variety of tasks, such as speech recognition [Graves et al., 2013b], handwriting recognition [Graves et al., 2009], spoken language understanding [Yao et al., 2014], and machine translation [Sutskever et al., 2014]. Recent work by ?) has demonstrated that an NN structure augmented with a carefully designed memory block and differentiable read/write operations can learn to mimic computer programs. Moreover, the ability to train deep networks provides a more sophisticated way of exploiting relations between labels and features, therefore making the prediction more accurate [Hinton et al., 2012]. By extending an LSTM network to be both deep in space and time, ?) shows the resulting network can used to synthesise handwriting indistinguishable from that of a human.

The Neural Language Generator

After updating Equation 6 by Equation 9, the output distribution is formed by applying a softmax function $g$ , and the distribution is sampled to obtain the next token,

2 The Deep Structure

Deep Neural Networks (DNN) enable increased discrimination by learning multiple layers of features, and represent the state-of-the-art for many applications such as speech recognition [Graves et al., 2013b] and natural language processing [Collobert and Weston, 2008]. The neural language generator proposed in this paper can be easily extended to be deep in both space and time by stacking multiple LSTM cells on top of the original structure. As shown in Figure 2, skip connections are applied to the inputs of all hidden layers as well as between all hidden layers and the outputs [Graves, 2013]. This reduces the number of processing steps between the bottom of the network and the top, and therefore mitigates the vanishing gradient problem [Bengio et al., 1994] in the vertical direction. To allow all hidden layer information to influence the reading gate, Equation 7 is changed to

where $l$ is the hidden layer index and $\alpha_{l}$ is a layer-wise constant. Since the network tends to overfit when the structure becomes more complex, the dropout technique [Srivastava et al., 2014] is used to regularise the network. As suggested in [Zaremba et al., 2014], dropout was only applied to the non-recurrent connections, as shown in the Figure 2. It was not applied to word embeddings since pre-trained word vectors were used.

3 Backward LSTM reranking

4 Training

5 Decoding

The decoding procedure is split into two phases: (a) over-generation, and (b) reranking. In the over-generation phase, the forward generator conditioned on the given DA, is used to sequentially generate utterances by random sampling of the predicted next word distributions. In the reranking phase, the cost of the backward reranker $F_{b}(\theta)$ is computed. Together with the cost $F_{f}(\theta)$ from the forward generator, the reranking score $R$ is computed as:

Experiments

The target application for our generation system is a spoken dialogue system providing information about certain venues in San Francisco. In order to demonstrate the scalability of the proposed method and its performance in different domains, we tested on two domains that talk about restaurants and hotels respectively. There are 8 system dialogue act types such as inform to present information about restaurants, confirm to check that a slot value has been recognised correctly, and reject to advise that the user’s constraints cannot be met. Each domain contains 12 attributes (slots), some are common to both domains and the others are domain specific. The detailed ontologies for the two domains are provided in Table 1. To form a training corpus for each domain, dialogues collected from a previous user trial [Gašić et al., 2015] of a statistical dialogue manager were randomly sampled and shown to workers recruited via the Amazon Mechanical Turk (AMT) service. Workers were shown each dialogue turn by turn and asked to enter an appropriate system response in natural English corresponding to each system DA. For each domain around 5K system utterances were collected from about 1K randomly sampled dialogues. Each categorical value was replaced by a token representing its slot, and slots that appeared multiple times in a DA were merged into one. After processing and grouping each utterance according to its delexicalised DA, we obtained 248 distinct DAs in the restaurant domain and 164 in the hotel domain. The average number of slots per DA for each domain is 2.25 and 1.95, respectively.

2 Objective Evaluation

The last three blocks in Table 2 compares the proposed method with previous RNN approaches. LSTM generally works better than vanilla RNN due to its ability to model long range dependencies more efficiently. We also found that by using gates, whether learned or heuristic, gave much lower slot error rates. As an aside, the ability of the SC-LSTM to learn gates is also exemplified in Figure 3. Finally, by combining the learned gate approach with the deep architecture (+deep), we obtained the best overall performance.

3 Human Evaluation

[ label = tab:qualitytest, pos = t, notespar, caption=Real user trial for utterance quality assessment on two metrics (rating out of 3), averaging over top 5 realisations. Statistical significance was computed using a two-tailed Student’s t-test, between deep and all others. ]l c c \tnote[*] $p<0.05$ \tnote[**] $p<0.005$ Method Informativeness Naturalness +deep 2.58 2.51 sc-lstm 2.59 2.50 rnn w/ 2.53 2.42\tmark[*] classlm 2.46\tmark[**] 2.45

[ label = tab:preftest, pos = t, notespar, caption=Pairwise preference test among four systems. Statistical significance was computed using two-tailed binomial test. ]l! c c c c \tnote[*] $p<0.05$ \tnote[**] $p<0.005$ Pref.% classlm rnn w/ sc-lstm +deep classlm - 46.0 40.9\tmark[**] 37.7\tmark[**] rnn w/ 54.0 - 43.0 35.7\tmark[*] sc-lstm 59.1\tmark[*] 57 - 47.6 +deep 62.3\tmark[**] 64.3\tmark[**] 52.4 -

Since automatic metrics may not consistently agree with human perception [Stent et al., 2005], human testing is needed to assess subjective quality. To do this, a set of judges were recruited using AMT. For each task, two systems among the four (classlm, rnn w/, sc-lstm, and +deep) were randomly selected to generate utterances from a set of newly sampled dialogues in the restaurant domain. In order to evaluate system performance in the presence of language variation, each system generated 5 different surface realisations for each input DA and the human judges were asked to score each of them in terms of informativeness and naturalness (rating out of 3), and also asked to state a preference between the two. Here informativeness is defined as whether the utterance contains all the information specified in the DA, and naturalness is defined as whether the utterance could plausibly have been produced by a human. In order to decrease the amount of information presented to the judges, utterances that appeared identically in both systems were filtered out. We tested 1000 DAs in total, and after filtering there were approximately 1300 generated utterances per system.

Table LABEL:tab:qualitytest shows the quality assessments which exhibit the same general trend as the objective results. The SC-LSTM systems (sc-lstm & +deep) outperform the class-based LMs (classlm) and the RNN with heuristic gates (rnn w/) in both metrics. The deep SC-LSTM system (+deep) is significantly better than the class LMs (classlm) in terms of informativeness, and better than the RNN with heuristic gates (rnn w/) in terms of naturalness. The preference test results are shown in Table LABEL:tab:preftest. Again, the SC-LSTM systems (sc-lstm & +deep) were significantly preferred by the judges. Moreover, the judges recorded a strong preference for the deep approach (+deep) compared to the others, though the preference is not significant when comparing to its shallow counterpart (sc-lstm). Example dialogue acts and their top-5 realisations are shown in Table 3.

Conclusion and Future Work

In this paper we have proposed a neural network-based generator that is capable of generating natural linguistically varied responses based on a deep, semantically controlled LSTM architecture which we call SC-LSTM. The generator can be trained on unaligned data by jointly optimising its sentence planning and surface realisation components using a simple cross entropy criterion without any heuristics or handcrafting. We found that the SC-LSTM model achieved the best overall performance on two objective metrics across two different domains. An evaluation by human judges also confirmed that the SC-LSTM approach is strongly preferred to a variety of existing methods.

This work represents a line of research that tries to model the NLG problem in a unified architecture, whereby the entire model is end-to-end trainable from data. We contend that this approach can produce more natural responses which are more similar to colloquial styles found in human conversations. Another key potential advantage of neural network based language processing is the implicit use of distributed representations for words and a single compact parameter encoding of the information to be conveyed. This suggests that it should be possible to further condition the generator on some dialogue features such discourse information or social cues during the conversation. Furthermore, adopting a corpus based regime enables domain scalability and multilingual NLG to be achieved with less cost and a shorter lifecycle. These latter aspects will be the focus of our future work in this area.

Acknowledgements

Tsung-Hsien Wen and David Vandyke are supported by Toshiba Research Europe Ltd, Cambridge Research Laboratory.