Multi-domain Neural Network Language Generation for Spoken Dialogue Systems

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M. Rojas-Barahona, Pei-Hao Su, David Vandyke, Steve Young

Introduction

Modern Spoken Dialogue Systems (SDS) are typically developed according to a well-defined ontology, which provides a structured representation of the domain data that the dialogue system can talk about, such as searching for a restaurant or shopping for a laptop. Unlike conventional approaches employing a substantial amount of handcrafting for each individual processing component [Ward and Issar, 1994, Bohus and Rudnicky, 2009], statistical approaches to SDS promise a domain-scalable framework which requires a minimal amount of human intervention [Young et al., 2013]. ?) showed improved performance in belief tracking by training a general model and adapting it to specific domains. Similar benefit can be observed in ?), in which a Bayesian committee machine [Tresp, 2000] was used to model policy learning in a multi-domain SDS regime.

In past decades, adaptive NLG has been studied from linguistic perspectives, such as systems that learn to tailor user preferences [Walker et al., 2007], convey a specific personality trait [Mairesse and Walker, 2008, Mairesse and Walker, 2011], or align with their conversational partner [Isard et al., 2006]. Domain adaptation was first addressed by ?) using a generator based on the Lexical Functional Grammar (LFG) f-structures [Kaplan and Bresnan, 1982]. Although these approaches can model rich linguistic phenomenon, they are not readily adaptable to data since they still require many handcrafted rules to define the search space. Recently, RNN-based language generation has been introduced [Wen et al., 2015a, Wen et al., 2015b]. This class of statistical generators can learn generation decisions directly from dialogue act (DA)-utterance pairs without any semantic annotations [Mairesse and Young, 2014] or hand-coded grammars [Langkilde and Knight, 1998, Walker et al., 2002]. Many existing adaptation approaches [Wen et al., 2013, Shi et al., 2015, Chen et al., 2015] can be directly applied due to the flexibility of the underlying RNN language model (RNNLM) architecture [Mikolov et al., 2010].

Discriminative training (DT) has been successfully used to train RNNs for various tasks. By optimising directly against the desired objective function such as BLEU score [Auli and Gao, 2014] or Word Error Rate [Kuo et al., 2002], the model can explore its output space and learn to discriminate between good and bad hypotheses. In this paper we show that DT can enable a generator to learn more efficiently when in-domain data is scarce.

The paper presents an incremental recipe for training multi-domain language generators based on a purely data-driven, RNN-based generation model. Following a review of related work in section 2, section 3 describes the detailed RNN generator architecture. The data counterfeiting approach for synthesising an in-domain dataset is introduced in section 4, where it is compared to the simple model fine-tuning approach. In section 5, we describe our proposed DT procedure for training natural language generators. Following a brief review of the data sets used in section 6, corpus-based evaluation results are presented in section 7. In order to assess the subjective performance of our system, a quality test and a pairwise preference test are presented in section 8. The results show that the proposed adaptation recipe improves not only the objective scores but also the user’s perceived quality of the system. We conclude with a brief summary in section 9.

Related Work

Domain adaptation problems arise when we have a sufficient amount of labeled data in one domain (the source domain), but have little or no labeled data in another related domain (the target domain). Domain adaptability for real world speech and language applications is especially important because both language usage and the topics of interest are constantly evolving. Historically, domain adaptation has been less well studied in the NLG community. The most relevant work was done by ?). They showed that an LFG f-structure based generator could yield better performance when trained on in-domain sentences paired with pseudo parse tree inputs generated from a state-of-the-art, but out-of-domain parser. The SPoT-based generator proposed by ?) has the potential to address domain adaptation problems. However, their published work has focused on tailoring user preferences [Walker et al., 2007] and mimicking personality traits [Mairesse and Walker, 2011]. ?) proposed a Reinforcement Learning (RL) framework in which policy and NLG components can be jointly optimised and adapted based on online user feedback. In contrast, ?) has proposed using active learning to mitigate the data sparsity problem when training data-driven NLG systems. Furthermore, ?) trained statistical surface realisers from unlabelled data by an automatic slot labelling technique.

In general, feature-based adaptation is perhaps the most widely used technique [Blitzer et al., 2007, Pan and Yang, 2010, Duan et al., 2012]. By exploiting correlations and similarities between data points, it has been successfully applied to problems like speaker adaptation [Gauvain and Lee, 1994, Leggetter and Woodland, 1995] and various tasks in natural language processing [Daumé III, 2009]. In contrast, model-based adaptation is particularly useful for language modeling (LM) [Bellegarda, 2004]. Mixture-based topic LMs [Gildea and Hofmann, 1999] are widely used in N-gram LMs for domain adaptation. Similar ideas have been applied to applications that require adapting LMs, such as machine translation (MT) [Koehn and Schroeder, 2007] and personalised speech recognition [Wen et al., 2012].

Domain adaptation for Neural Network (NN)-based LMs has also been studied in the past. A feature augmented RNNLM was first proposed by ?), but later applied to multi-genre broadcast speech recognition [Chen et al., 2015] and personalised language modeling [Wen et al., 2013]. These methods are based on fine-tuning existing network parameters on adaptation data. However, careful regularisation is often necessary [Yu et al., 2013]. In a slightly different area, ?) applied curriculum learning to RNNLM adaptation.

Discriminative training (DT) [Collins, 2002] is an alternative to the maximum likelihood (ML) criterion. For classification, DT can be split into two phases: (1) decoding training examples using the current model and scoring them, and (2) adjusting the model parameters to maximise the separation between the correct target annotation and the competing incorrect annotations. It has been successfully applied to many research problems, such as speech recognition [Kuo et al., 2002, Voigtlaender et al., 2015] and MT [He and Deng, 2012, Auli et al., 2014]. Recently, ?) trained an RNNLM with a DT objective and showed improved performance on an MT task. However, their RNN probabilities only served as input features to a phrase-based MT system.

The Neural Language Generator

After the hidden layer state is obtained, the computation of the next word distribution and sampling from it is straightforward,

Training Multi-domain Models

A straightforward way to adapt NN-based models to a target domain is to continue training or fine-tuning a well-trained generator on whatever new target domain data is available. This training procedure is as follows:

Although this method can benefit from parameter sharing of the LM part of the network, the parameters of similar input slot-value pairs are not shared4. In other words, realisation of any unseen slot-value pair in the target domain can only be learned from scratch. Adaptation offers no benefit in this case.

2 Data Counterfeiting

In order to maximise the effect of domain adaptation, the model should be able to (1) generate acceptable realisations for unseen slot-value pairs based on similar slot-value pairs seen in the training data, and (2) continue to distinguish slot-value pairs that are similar but nevertheless distinct. Instead of exploring weight tying strategies in different training stages (which is complex to implement and typically relies on ad hoc tying rules), we propose instead a data counterfeiting approach to synthesise target domain data from source domain data. The procedure is shown in Figure 1 and described as following:

Categorise slots in both source and target domain into classes, according to some similarity measure. In our case, we categorise them based on their functional type to yield three classes: informable, requestable, and binaryInformable class include all non-binary informable slots while binary class includes all binary informable slots..

This approach allows the generator to share realisations among slot-value pairs that have similar functionalities, therefore facilitates the transfer learning of rare slot-value pairs in the target domain. Furthermore, the approach also preserves the co-occurrence statistics of slot-value pairs and their realisations. This allows the model to learn the gating mechanism even before adaptation data is introduced.

Discriminative Training

In contrast to the traditional ML criteria (Equation 1) whose goal is to maximise the log-likelihood of correct examples, DT aims at separating correct examples from competing incorrect examples. Given a training instance $(d_{i},\Omega_{i})$ , the training process starts by generating a set of candidate sentences $Gen(d_{i})$ using the current model parameter $\theta$ and DA $d_{i}$ . The discriminative cost function can therefore be written as

where $L(\Omega,\Omega_{i})$ is the scoring function evaluating candidate $\Omega$ by taking ground truth $\Omega_{i}$ as reference. $p_{\theta}(\Omega|d_{i})$ is the normalised probability of the candidate and is calculated by

$\gamma\in[0,\infty]$ is a tuned scaling factor that flattens the distribution for $\gamma<1$ and sharpens it for $\gamma>1$ . The unnormalised candidate likelihood $\log p(\Omega|d_{i},\theta)$ is produced by summing token likelihoods from the RNN generator output,

The scoring function $L(\Omega,\Omega_{i})$ can be further generalised to take several scoring functions into account

where $\beta_{j}$ is the weight for $j$ -th scoring function. Since the cost function presented here (Equation 5) is differentiable everywhere, back propagation can be applied to calculate the gradients and update parameters directly.

Datasets

In order to test our proposed recipe for training multi-domain language generators, we conducted experiments using four different domains: finding a restaurant, finding a hotel, buying a laptop, and buying a television. Datasets for the restaurant and hotel domains have been previously released by ?). These were created by workers recruited by Amazon Mechanical Turk (AMT) by asking them to propose an appropriate natural language realisation corresponding to each system dialogue act actually generated by a dialogue system. However, the number of actually occurring DA combinations in the restaurant and hotel domains were rather limited ( $\sim$ 200) and since multiple references were collected for each DA, the resulting datasets are not sufficiently diverse to enable the assessment of the generalisation capability of the different training methods over unseen semantic inputs.

In order to create more diverse datasets for the laptop and TV domains, we enumerated all possible combinations of dialogue act types and slots based on the ontology shown in Table 1. This yielded about 13K distinct DAs in the laptop domain and 7K distinct DAs in the TV domain. We then used AMT workers to collect just one realisation for each DA. Since the resulting datasets have a much larger input space but only one training example for each DA, the system must learn partial realisations of concepts and be able to recombine and apply them to unseen DAs. Also note that the number of act types and slots of the new ontology is larger, which makes NLG in both laptop and TV domains much harder.

Corpus-based Evaluation

The generators were implemented using the Theano library [Bergstra et al., 2010, Bastien et al., 2012], and trained by partitioning each of the collected corpora into a training, validation, and testing set in the ratio 3:1:1. All the generators were trained by treating each sentence as a mini-batch. An $l_{2}$ regularisation term was added to the objective function for every 10 training examples. The hidden layer size was set to be 100 for all cases. Stochastic gradient descent and back propagation through time [Werbos, 1990] were used to optimise the parameters. In order to prevent overfitting, early stopping was implemented using the validation set.

During decoding, we over-generated 20 utterances and selected the top 5 realisations for each DA according to the following reranking criteria,

2 Data Counterfeiting

We first compared the data counterfeiting (counterfeit) approach with the model fine-tuning (tune) method and models trained from scratch (scratch). Figure 2 shows the result of adapting models between similar domains, from laptop to TV. Because of the parameter sharing in the LM part of the network, model fine-tuning (tune) achieves a better BLEU score than training from scratch (scratch) when target domain data is limited. However, if we apply the data counterfeiting (counterfeit) method, we obtain an even greater BLEU score gain. This is mainly due to the better realisation of unseen slot-value pairs. On the other hand, data counterfeiting (counterfeit) also brings a substantial reduction in slot error rate. This is because it preserves the co-occurrence statistics between slot-value pairs and realisations, which allows the model to learn good semantic alignments even before adaptation data is introduced. Similar results can be seen in Figure 3, in which adaptation was performed on more disjoint domains: restaurant and hotel joint domain to laptop and TV joint domain. The data counterfeiting (counterfeit) method is still superior to the other methods.

3 Discriminative Training

The generator parameters obtained from data counterfeiting and ML adaptation were further tuned by applying DT. In each case, the models were optimised using two objective functions: BLEU-4 score and slot error rate. However, we used a soft version of BLEU called sentence BLEU as described in ?), to mitigate the sparse n-gram match problem of BLEU at the sentence level. In our experiments, we set $\gamma$ to 5.0 and $\beta_{j}$ to 1.0 and -1.0 for BLEU and ERR, respectively. For each DA, we applied our generator 50 times to generate candidate sentences. Repeated candidates were removed. We treated the remaining candidates as a single batch and updated the model parameters by the procedure described in section 5. We evaluated performance of the algorithm on the laptop to TV adaptation scenario, and compared models with and without discriminative training (ML+DT & ML). The results are shown in Figure 4 where it can be seen that DT consistently improves generator performance on both metrics. Another interesting point to note is that slot error rate is easier to optimise compared to BLEU (ERR $\rightarrow 0$ after DT). This is probably because the sentence BLEU optimisation criterion is only an approximation of the corpus BLEU score used for evaluation.

Human Evaluation

Since automatic metrics may not consistently agree with human perception [Stent et al., 2005], human testing is needed to assess subjective quality. To do this, a set of judges were recruited using AMT. We tested our models on two adaptation scenarios: laptop to TV and TV to laptop. For each task, two systems among the four were compared: training from scratch using full dataset (scrALL), adapting with DT training but only 10% of target domain data (DT-10%), adapting with ML training but only 10% of target domain data (ML-10%), and training from scratch using only 10% of target domain data (scr-10%). In order to evaluate system performance in the presence of language variation, each system generated 5 different surface realisations for each input DA and the human judges were asked to score each of them in terms of informativeness and naturalness (rating out of 3), and also asked to state a preference between the two. Here informativeness is defined as whether the utterance contains all the information specified in the DA, and naturalness is defined as whether the utterance could plausibly have been produced by a human. In order to decrease the amount of information presented to the judges, utterances that appeared identically in both systems were filtered out. We tested about 2000 DAs for each scenario distributed uniformly between contrasts except that allowed 50% more comparisons between ML-10% and DT-10% because they were close.

Table 2 shows the subjective quality assessments which exhibit the same general trend as the objective results. If a large amount of target domain data is available, training everything from scratch (scrALL) achieves a very good performance and adaptation is not necessary. However, if only a limited amount of in-domain data is available, efficient adaptation is critical (DT-10% & ML-10% $>$ scr-10%). Moreover, judges also preferred the DT trained generator (DT-10%) compared to the ML trained generator (ML-10%), especially for informativeness. In the laptop to TV scenario, the informativeness score of DT method (DT-10%) was considered indistinguishable when comparing to the method trained with full training set (scrALL). The preference test results are shown in Table 3. Again, adaptation methods (DT-10% & ML-10%) are crucial to bridge the gap between domains when the target domain data is scarce (DT-10% & ML-10% $>$ scr-10%). The results also suggest that the DT training approach (DT-10%) was preferred compared to ML training (ML-10%), even though the preference in this case was not statistically significant.

Conclusion and Future Work

In this paper we have proposed a procedure for training multi-domain, RNN-based language generators, by data counterfeiting and discriminative training. The procedure is general and applicable to any data-driven language generator. Both corpus-based evaluation and human assessment were performed. Objective measures on corpus data have demonstrated that by applying this procedure to adapt models between four different dialogue domains, good performance can be achieved with much less training data. Subjective assessment by human judges confirm the effectiveness of the approach.

The proposed domain adaptation method requires a small amount of annotated data to be collected offline. In our future work, we intend to focus on training the generator on the fly with real user feedback during conversation.

Acknowledgments

Tsung-Hsien Wen and David Vandyke are supported by Toshiba Research Europe Ltd, Cambridge Research Laboratory.

Introduction

Related Work

The Neural Language Generator

Training Multi-domain Models

2 Data Counterfeiting

Discriminative Training

Datasets

Corpus-based Evaluation

2 Data Counterfeiting

3 Discriminative Training

Human Evaluation

Conclusion and Future Work

Acknowledgments

References