Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings

Ondřej Dušek, Filip Jurčíček

Introduction

In spoken dialogue systems (SDS), the task of natural language generation (NLG) is to convert a meaning representation (MR) produced by the dialogue manager into one or more sentences in a natural language. It is traditionally divided into two subtasks: sentence planning, which decides on the overall sentence structure, and surface realization, determining the exact word forms and linearizing the structure into a string [Reiter and Dale, 2000]. While some generators keep this division and use a two-step pipeline [Walker et al., 2001, Rieser et al., 2010, Dethlefs et al., 2013], others apply a joint model for both tasks [Wong and Mooney, 2007, Konstas and Lapata, 2013].

We present a new, conceptually simple NLG system for SDS that is able to operate in both modes: it either produces natural language strings or generates deep syntax dependency trees, which are subsequently processed by an external surface realizer [Dušek et al., 2015]. This allows us to show a direct comparison of two-step generation, where sentence planning and surface realization are separated, with a joint, one-step approach.

Our generator is based on the sequence-to-sequence (seq2seq) generation technique [Cho et al., 2014, Sutskever et al., 2014], combined with beam search and an nn-best list reranker to suppress irrelevant information in the outputs. Unlike most previous NLG systems for SDS (e.g., [Stent et al., 2004, Raux et al., 2005, Mairesse et al., 2010]), it is trainable from unaligned pairs of MR and sentences alone. We experiment with using much less training data than recent systems based on recurrent neural networks (RNN) [Wen et al., 2015b, Mei et al., 2015], and we find that our generator learns successfully to produce both strings and deep syntax trees on the BAGEL restaurant information dataset [Mairesse et al., 2010]. It is able to surpass nn-gram-based scores achieved previously by ?), offering a simpler setup and more relevant outputs.

We introduce the generation setting in Section 2 and describe our generator architecture in Section 3. Section 4 details our experiments, Section 5 analyzes the results. We summarize related work in Section 6 and offer conclusions in Section 7.

Generator Setting

The input to our generator are dialogue acts (DA) [Young et al., 2010] representing an action, such as inform or request, along with one or more attributes (slots) and their values. Our generator operates in two modes, producing either deep syntax trees [Dušek et al., 2012] or natural language strings (see Fig. 1). The first mode corresponds to the sentence planning NLG stage as it decides the syntactic shape of the output sentence; the resulting deep syntax tree involves content words (lemmas) and their syntactic form (formemes, purple in Fig. 1). The trees are linearized to strings using a surface realizer from the TectoMT translation system [Dušek et al., 2015]. The second generator mode joins sentence planning and surface realization into one step, producing natural language sentences directly.

Both modes offer their advantages: The two-step mode simplifies generation by abstracting away from complex surface syntax and morphology, which can be handled by a handcrafted, domain-independent module to ensure grammatical correctness at all times [Dušek and Jurčíček, 2015], and the joint mode does not need to model structure explicitly and avoids accumulating errors along the pipeline [Konstas and Lapata, 2013].

The Seq2seq Generation Model

Our generator is based on the seq2seq approach [Cho et al., 2014, Sutskever et al., 2014], a type of an encoder-decoder RNN architecture operating on variable-length sequences of tokens. We address the necessary conversion of input DA and output trees/sentences into sequences in Section 3.1 and then describe the main seq2seq component in Section 3.2. It is supplemented by a reranker, as explained in Section 3.3.

We represent DA, deep syntax trees, and sentences as sequences of tokens to enable their usage in the sequence-based RNN components of our generator (see Sections 3.2 and 3.3). Each token is represented by its embedding – a vector of floating-point numbers [Bengio et al., 2003].

To form a sequence representation of a DA, we create a triple of the structure “DA type, slot, value” for each slot in the DA and concatenate the triples (see Fig. 3). The deep syntax tree output from the seq2seq generator is represented in a bracketed notation similar to the one used by ?, see Fig. 2). The inputs to the reranker are always a sequence of tokens; structure is disregarded in trees, resulting in a list of lemma-formeme pairs (see Fig. 2).

2 Seq2seq Generator

Our seq2seq generator with attention [Bahdanau et al., 2015, see Fig. 3]We use the implementation in the TensorFlow framework [Abadi et al., 2015]. starts with the encoder stage, which uses an RNN to encode an input sequence x={x1,,xn}\mathbf{x}=\{x_{1},\dots,x_{n}\} into a sequence of encoder outputs and hidden states h={h1,,hn}\mathbf{h}=\{h_{1},\dots,h_{n}\}, where ht=\mboxlstm(xt,ht1)h_{t}=\mbox{lstm}(x_{t},h_{t-1}), a non-linear function represented by the long-short-term memory (LSTM) cell [Graves, 2013].

The decoder stage then uses the hidden states to generate a sequence y={y1,,ym}\mathbf{y}=\{y_{1},\dots,y_{m}\} with a second LSTM-based RNN. The probability of each output token is defined as:

Here, sts_{t} is the decoder state where s0=hns_{0}=h_{n} and st=\mboxlstm((yt1ct)WS,st1)s_{t}=\mbox{lstm}((y_{t-1}\circ c_{t})W_{S},s_{t-1}), i.e., the decoder is initialized by the last hidden state and uses the previous output token at each step. WYW_{Y} and WSW_{S} are learned linear projection matrices and “\circ” denotes concatenation. ctc_{t} is the context vector – a weighted sum of the encoder hidden states ct=i=1nαtihic_{t}=\sum_{i=1}^{n}\alpha_{ti}h_{i}, where αti\alpha_{ti} corresponds to an alignment model, represented by a feed-forward network with a single tanh\tanh hidden layer.

On top of this basic seq2seq model, we implemented a simple beam search for decoding [Sutskever et al., 2014, Bahdanau et al., 2015]. It proceeds left-to-right and keeps track of log probabilities of top nn possible output sequences, expanding them one token at a time.

3 Reranker

To ensure that the output trees/strings correspond semantically to the input DA, we implemented a classifier to rerank the nn-best beam search outputs and penalize those missing required information and/or adding irrelevant one. Similarly to ?), the classifier provides a binary decision for an output tree/string on the presence of all dialogue act types and slot-value combinations seen in the training data, producing a 1-hot vector. The input DA is converted to a similar 1-hot vector and the reranking penalty of the sentence is the Hamming distance between the two vectors (see Fig. 4). Weighted penalties for all sentences are subtracted from their nn-best list log probabilities.

We employ a similar architecture for the classifier as in our seq2seq generator encoder (see Section 3.2), with an RNN encoder operating on the output trees/strings and a single logistic layer for classification over the last encoder hidden state. Given an output sequence representing a string or a tree y={y1,,yn}\mathbf{y}=\{y_{1},\dots,y_{n}\} (cf. Section 3.1), the encoder again produces a sequence of hidden states h={h1,,hn}\mathbf{h}=\{h_{1},\dots,h_{n}\} where ht=\mboxlstm(yt,ht1)h_{t}=\mbox{lstm}(y_{t},h_{t-1}). The output binary vector oo is computed as:

Here, WRW_{R} is a learned projection matrix and bb is a corresponding bias term.

Experiments

We perform our experiments on the BAGEL data set of ?), which contains 202 DA from the restaurant information domain with two natural language paraphrases each, describing restaurant locations, price ranges, food types etc. Some properties such as restaurant names or phone numbers are delexicalized (replaced with “X” symbols) to avoid data sparsity.We adopt the delexicalization scenario used by ?) and ?).Unlike ?), we do not use manually annotated alignment of slots and values in the input DA to target words and phrases and let the generator learn it from data, which simplifies training data preparation but makes our task harder. We lowercase the data and treat plural -s as separate tokens for generating into strings, and we apply automatic analysis from the Treex NLP toolkit [Popel and Žabokrtský, 2010] to obtain deep syntax trees for training tree-based generator setups.The input vocabulary size is around 45 (DA types, slots, and values added up) and output vocabulary sizes are around 170 for string generation and 180 for tree generation (45 formemes and 135 lemmas). Same as ?), we apply 10-fold cross-validation, with 181 training DA and 21 testing DA. In addition, we reserve 10 DA from the training set for validation.We treat the two paraphrases for the same DA as separate instances in the training set but use them together as two references to measure BLEU and NIST scores [Papineni et al., 2002, Doddington, 2002] on the validation and test sets.

To train our seq2seq generator, we use the Adam optimizer [Kingma and Ba, 2015] to minimize unweighted sequence cross-entropy.Based on a few preliminary experiments, the learning rate is set to 0.001, embedding size 50, LSTM cell size 128, and batch size 20. Reranking penalty for decoding is 100. We perform 10 runs with different random initialization of the network and up to 1,000 passes over the training data,Training is terminated early if the top 10 so far achieved validation BLEU scores do not change for 100 passes. validating after each pass and selecting the parameters that yield the highest BLEU score on the validation set. Neither beam search nor the reranker are used for validation.

We use the Adam optimizer minimizing cross-entropy to train the reranker as well.We use the same settings as with the seq2seq generator. We perform a single run of up to 100 passes over the data, and we also validate after each pass and select the parameters giving minimal Hamming distance on both validation and training set.The validation set is given 10 times more importance.

Results

The results of our experiments and a comparison to previous works on this dataset are shown in Table 1. We include BLEU and NIST scores and the number of semantic errors (incorrect, missing, and repeated information), which we assessed manually on a sample of 42 output sentences (outputs of two randomly selected cross-validation runs).

The outputs of direct string generation show that the models learn to produce fluent sentences in the domain style;The average sentence length is around 13 tokens. incoherent sentences are rare, but semantic errors are very frequent in the greedy search. Most errors involve confusion of semantically close items, e.g., Italian instead of French or riverside area instead of city centre (see Table 2); items occurring more frequently are preferred regardless of their relevance. The beam search brings a BLEU improvement but keeps most semantic errors in place. The reranker is able to reduce the number of semantic errors while increasing automatic scores considerably. Using a larger beam increases the effect of the reranker as expected, resulting in slightly improved outputs.

Models generating deep syntax trees are also able to learn the domain style, and they have virtually no problems producing valid trees.The generated sequences are longer, but have a very rigid structure, i.e., less uncertainty per generation step. The average output length is around 36 tokens in the generated sequence or 9 tree nodes; surface realizer outputs have a similar length as the sentences produced in direct string generation. The surface realizer works almost flawlessly on this limited domain [Dušek and Jurčíček, 2015], leaving the seq2seq generator as the major error source. The syntax-generating models tend to make different kinds of errors than the string-based models: Some outputs are valid trees but not entirely syntactically fluent; missing, incorrect, or repeated information is more frequent than a confusion of semantically similar items (see Table 2). Semantic error rates of greedy and beam-search decoding are lower than for string-based models, partly because confusion of two similar items counts as two errors. The beam search brings an increase in BLEU but also in the number of semantic errors. The reranker is able to reduce the number of errors and improve automatic scores slightly. A larger beam leads to a small BLEU decrease even though the sentences contain less errors; here, NIST reflects the situation more accurately.

A comparison of the two approaches goes in favor of the joint setup: Without the reranker, models generating trees produce less semantic errors and gain higher BLEU/NIST scores. However, with the reranker, the string-based model is able to reduce the number of semantic errors while producing outputs significantly better in terms of BLEU/NIST.The difference is statistically significant at 99% level according to pairwise bootstrap resampling test [Koehn, 2004]. In addition, the joint setup does not need an external surface realizer. The best results of both setups surpass the best results on this dataset using training data without manual alignments [Dušek and Jurčíček, 2015] in both automatic metricsThe BLEU/NIST differences are statistically significant. and the number of semantic errors.

Related Work

While most recent NLG systems attempt to learn generation from data, the choice of a particular approach – pipeline or joint – is often arbitrary and depends on system architecture or particular generation domain. Works using the pipeline approach in SDS tend to focus on sentence planning, improving a handcrafted generator [Walker et al., 2001, Stent et al., 2004, Paiva and Evans, 2005] or using perceptron-guided A* search [Dušek and Jurčíček, 2015]. Generators taking the joint approach employ various methods, e.g., factored language models [Mairesse et al., 2010], inverted parsing [Wong and Mooney, 2007, Konstas and Lapata, 2013], or a pipeline of discriminative classifiers [Angeli et al., 2010]. Unlike most previous NLG systems, our generator is trainable from unaligned pairs of MR and sentences alone.

Recent RNN-based generators are most similar to our work. ?) combined two RNN with a convolutional network reranker; ?) later replaced basic sigmoid cells with an LSTM. ?) present the only seq2seq-based NLG system known to us. We extend the previous works by generating deep syntax trees as well as strings and directly comparing pipeline and joint generation. In addition, we experiment with an order-of-magnitude smaller dataset than other RNN-based systems.

Conclusions and Future Work

We have presented a direct comparison of two-step generation via deep syntax trees with a direct generation into strings, both using the same NLG system based on the seq2seq approach. While both approaches offer decent performance, their outputs are quite different. The results show the direct approach as more favorable, with significantly higher nn-gram based scores and a similar number of semantic errors in the output.

We also showed that our generator can learn to produce meaningful utterances using a much smaller amount of training data than what is typically used for RNN-based approaches. The resulting models had virtually no problems with producing fluent, coherent sentences or with generating valid structure of bracketed deep syntax trees. Our generator was able to surpass the best BLEU/NIST scores on the same dataset previously achieved by a perceptron-based generator of ?) while reducing the amount of irrelevant information on the output.

Our generator is released on GitHub at the following URL:

We intend to apply it to other datasets for a broader comparison, and we plan further improvements, such as enhancing the reranker or including a bidirectional encoder [Bahdanau et al., 2015, Mei et al., 2015, Jean et al., 2015] and sequence level training [Ranzato et al., 2015].

Acknowledgments

This work was funded by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221 and core research funding, SVV project 260 333, and GAUK grant 2058214 of Charles University in Prague. It used language resources stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071). We thank our colleagues and the anonymous reviewers for helpful comments.

References