Continuously Learning Neural Dialogue Management

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, Steve Young

Introduction

Developing a robust Spoken Dialogue System (SDS) traditionally requires a substantial amount of hand-crafted rules combined with various statistical components. In a task-oriented SDS, teaching a system how to respond appropriately is non-trivial. More recently, this dialogue management task has been formulated as a reinforcement learning (RL) problem which can be automatically optimised through human interaction [Levin and Pieraccini, 1997, Roy et al., 2000, Williams and Young, 2007, Jurčíček et al., 2011, Young et al., 2013]. In this framework, the system learns by a trial and error process governed by a potentially delayed learning objective, a reward function, that determines dialogue success [El Asri et al., 2014, Su et al., 2015, Vandyke et al., 2015, Su et al., 2016]. To enable the system to be trained on-line, sample-efficient learning algorithms have been proposed [Gašić and Young, 2014, Daubigney et al., 2014] which can learn policies from a minimal number of dialogues. However, even with such methods, performance is still poor in the early training stages, and this can impact negatively on the user experience. For these and other reasons, most commercial systems still hand-craft the dialogue management to ensure its stability.

Supervised learning (SL) has also been used in dialogue research where a policy is trained to produce an example response given the dialogue state. Wizard-of-Oz (WoZ) methods [Kelley, 1984, Dahlbäck et al., 1993] have been widely used for collecting domain-specific training corpora. Recently an emerging line of research has focused on training a network-based dialogue model, mostly in text-input schemes [Vinyals and Le, 2015, Serban et al., 2015, Wen et al., 2016, Bordes and Weston, 2016]. These systems were directly trained on past dialogues without detailed specification of the internal dialogue state. However, there are two key limitations of using the SL approach for SDS. Firstly, the effects of selecting an action on the future course of the dialogue are not considered. Secondly, there may be a very large number of dialogue states for which an appropriate response must be generated. Hence, the SL training set may lack sufficient coverage. Another issue is that there is no reason to suppose a human wizard is acting optimally, especially at high noise levels. These problems exacerbate in larger domains where multi-step planning is needed. Thus, learning to mimic a human wizard does not necessary lead to optimal behaviour.

To get the best of both SL- and RL-based dialogue management, this paper describes a network-based model which is initially trained with a supervised spoken dialogue dataset. Since the training data may be mismatched to the deployment environment, the model is further improved by RL in interaction with a simulated user or human users. The advantage of the proposed framework is that a single model can be trained using both SL and RL without modifying the system architecture. This resembles the training process used in AlphaGo [Silver et al., 2016] for the game of Go. In addition, unlike most of the-state-of-the-art RL-based dialogue systems [Gašić and Young, 2014, Cuayáhuitl et al., 2015] which operate on a constrained set of summary actions to limit the policy space and minimise expensive training costs, our model operates on a full action set.

Neural Dialogue Management

The proposed framework addresses the dialogue management component in a modular SDS. As depicted in Figure 1, the input to the model is the belief state $s$ which encodes the understood user intents along with the dialogue history [Henderson et al., 2014b, Mrkšić et al., 2015], and the output is the master dialogue action $a$ that decides the semantic reply. This is subsequently passed to the natural language generator [Wen et al., 2015].

Dialogue management is represented as a Policy Network, a neural network with one hidden layer exploiting tanh non-linearities, an output layer consisting of two softmax partitions and six sigmoid partitions. For the softmax outputs, one is for predicting DiaAct, a multi-class label over five dialogue acts: {request, offer, confirm, select, bye}, and the other for predicting Query, containing four options for the search constraint: {food, pricerange, area, none}. Query options only matter if the dialogue act in {request, confirm, select} is used. The sigmoid partitions are for Offer, each of which is used to determine a binary prediction when making system offerSystem-offer slots are slots the system can mention, such as area, phone number and postcode..

Given the system’s understanding of the user, the model’s role is to determine what the intent of the system response should be and which slot to talk about. The exact value in each slot is decided by a separate database parser, where the query is the top prediction of each user-informable slotUser-informable slots are slots used by the user to constrain the search, such as area and price range. from the dialogue state tracker and the output is a matched entity. This output forms the system’s semantic reply, the master dialogue action.

In the first phase, the policy network is trained on corpus data. This data may come from WoZ collection or from interactions between users and an already existing policy, such as a hand-crafted baseline. The objective here is to ‘mimic’ the response behaviour within the supervised data.

The training objective for each sample is to minimise a joint cross-entropy loss $\mathcal{L}(\theta)$ between model action labels $y$ defined in §2 and predictions $p$ :

where DiaAct $d_{a}$ and Query $q$ outputs are categorical distributions, and the Offer set $O_{s}$ contains six binary offer slots. $\theta$ are the network parameters.

2 Phase II: Reinforcement Learning

The policy trained in phase I on a fixed dataset may not generalise well. In spoken dialogue, the noise level may vary across conditions and thus can significantly affect performance. Hence, the second phase of the training pipeline aims at improving the SL trained policy network by further training using policy-gradient based RL. The model is given the freedom to select any combination of master action. The training objective is to find a parametrised policy $\pi_{\theta}$ that maximises the expected reward $J(\theta)$ of a dialogue with $T$ turns: $J(\theta)=E\left[\sum_{t=1}^{T}\gamma^{t}r(s_{t},a_{t})\middle|\pi_{\theta}\right]$ , where $\gamma$ is the discount factor and $r(s_{t},a_{t})$ is the reward when taking master action $a_{t}$ in dialogue state $s_{t}$ . Note that the structure and initial weights of $\pi_{\theta}$ are fixed by the SL pre-training phase, since the RL training aims to improve the SL trained model.

Here a batch algorithm is adopted, and all transitions are sampled under the current policy. At each update iteration, $N$ episodes were generated, where the $i$ th episode consists of a set of transition tuples $\{(s_{t}^{i},a_{t}^{i},r_{t}^{i})\}_{t=0}^{T_{i}}$ . The estimated gradient is estimated using the likelihood ratio trick:

where $R_{t}^{i}=\sum_{t^{\prime}=t}^{T_{i}}\gamma^{t^{\prime}-t}r_{t^{\prime}}^{i}$ is the cumulative return from time-step $t$ to $T_{i}$ . Gradient descent is, however, slow and has poor convergence properties.

Natural gradient [Amari, 1998] improves upon the above ’vanilla’ gradient method by computing an ascent direction that approximately ensures a small change in the policy distribution. This direction is $w=F(\theta)^{-1}{\nabla_{\theta}J(\theta)}$ , where $F(\theta)$ is the Fisher information matrix (FIM). Based on this, ?) developed the Natural Actor-Critic (NAC). In its episodic case (eNAC), the FIM does not need to be explicitly computed to obtain the natural gradient $w$ . eNAC uses a least square method:

where $C$ is a constant and $\forall n\in\{1,...,N\}$ an analytical solution can be obtained. For larger models with more parameters, a truncated variant [Schulman et al., 2015] can also be used to practically calculate the natural gradient.

Experience replay [Lin, 1992] is utilised to randomly sample mini-batches of experiences from a reply pool $\mathcal{P}$ . This increases data efficiency by re-using experience samples in multiple updates and reduces the data correlation. As the gradient is highly correlated with the return $R$ , to ensure stable training, a unity-based reward normalisation is adopted to normalise the total return $R_{n}$ between 0 and 1.

Experimental Results

The target application is a live telephone-based SDS providing restaurant information for the Cambridge (UK) area. The domain consists of approximately 150 venues, each having 6 slots out of which 3 can be used by the system to constrain the search (food-type, area and price-range) and 3 are informable properties (phone-number, address and postcode) available once a database entity has been found.

The model was implemented using the Theano library [Theano Development Team, 2016]. The size of the hidden layer was set to 32 and all the weights were randomly initialised between -0.1 and 0.1.

A corpus consisting of 720 user dialogues in the Cambridge restaurant domain was split into 4:1:1 for training, validation and testing. This corpus was collected via the Amazon Mechanical Turk (AMT) service, where paid subjects interacted through speech with a well-behaved dialogue system as proposed in [Su et al., 2016]. The raw data contains the top N speech recognition (ASR) results which were passed to a rule-based semantic decoder and the focus belief state tracker [Henderson et al., 2014a] to obtain the belief state that serves as the input feature to the proposed policy network. The turn-level labels were tagged according to §2. Adagrad [Duchi et al., 2011] per dialogue was used during backpropagation to train each model based on the objective in Equation 1. To prevent over-fitting, early stopping was applied based on the held-out validation set.

Table 1 shows the weighted F-1 scores computed on the test set for each label. We can clearly see that the model accurately determines the type of reply (DiaAct) and generally provides the right information (Offer). The hypothesised reason for the lower accuracy of Query is that the SL training data contains robust ASR results and thus the system examples contain more offers and less queries. This can be mitigated with a larger dataset covering more diverse situations, or improved via an RL approach.

2 Policy Network in Simulation

The policy network was tested with a simulated user [Schatzmann et al., 2006] which provides the interaction at the semantic level. As shown in Figure 2, the first grid points labelled ‘SL:0’ represent the performance of the SL model under various semantic error rates (SER), averaged over 500 dialogues.

The SL model was then further trained using RL at different SERs. As the SL model is already performing well, the exploration parameter $\epsilon$ was set to 0.1. The size of the experience replay pool ${\mathcal{L}}$ was 2,000, and the mini-batch size was set to 32. For each update, natural gradient was calculated by eNAC to update the model weights of size $\sim$ 2600. The total return given to each dialogue was set to $20\times\mathds{1}(\mathcal{D})-T$ , where $T$ is the dialogue turn number and $\mathds{1}(\mathcal{D})$ is the success indicator for dialogue $\mathcal{D}$ . Maximum dialogue length was set to 30 turns. Return normalisation was used to stabilise training.

The success rate of the SL model can be seen to increase for all SERs during 6,000 training dialogues, spreading between 1-8% improvement. Generally speaking, the greatest improvement occurs when the SER is most different to the SL training set, which are the higher SER conditions here. In this case, as the semantic hypotheses were more corrupted, the model learned to double-check more on what the user really wanted. This indicates the model’s ability to refine its own behaviour via RL.

3 Policy Network with Real Users

Starting from the same SL policy network as in §3.2, the model was improved via RL using human subjects recruited via AMT. The policy network was plugged-in to a modular SDS, comprising the Microsoft’s Bing speech recogniser www.microsoft.com/cognitive-services/en-us/speech-api., a rule-based semantic decoder, the focus belief state tracker, and a template-based natural language generator.

To ensure the dialogue quality, only those dialogues whose objective system check matched with the user rating were considered [Gašić et al., 2013]. Based on this, two parallel policies were trained with 200 dialogues. To evaluate the resulting policies, policy learning was disabled and a further 110 dialogues were collected with both the SL only and SL+RL models. The AMT users were asked to rate the dialogue quality by answering the question “Do you think this dialogue was successful?” on a 6-point Likert scale and also providing a binary rating on dialogue success. The average quality rating (scaled from 0 to 5) is shown in Table 2 with one standard error. The results indicate that the SL-model could work quite well with humans, but was improved by RL on the 200 training dialogues. This demonstrates that on-line RL is a viable approach to adapt a dialogue system to changing environmental conditions.

Conclusion

This paper has presented a two-step development for the dialogue management in SDS using a unified neural network framework, where the model can be trained on a fixed dialogue dataset using SL and subsequently improved via RL through simulated or spoken interactions. The experiments demonstrated the efficiency of the proposed model with only a few hundred supervised dialogue examples. The model was further tested in simulated and real user settings. In a mismatched environment, the model was capable of modifying its behaviour via a delayed reward signal and achieved better success rate.

Acknowledgments

Pei-Hao Su is supported by Cambridge Trust and the Ministry of Education, Taiwan.