A User Simulator for Task-Completion Dialogues

Xiujun Li, Zachary C. Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, Yun-Nung Chen

Introduction

Practical dialogue systems consist of several components. The natural language understanding (NLU) module maps free texts to structured semantic frames of utterances. The natural language generation (NLG) module maps the structured representations back into a natural-language form. Knowledge bases (KBs) and state trackers provide access to side information and track the evolving state of the dialogue, respectively. The dialogue policy is a central component of the system that chooses an action given the current state of the dialogue.

In traditional systems, dialogue policies might be programmed explicitly with rules. However, rule-based approaches have several weaknesses. First, for complex systems, it may not be easy to design a reasonable rule-based policy. Second, the optimal policy might change over time, as user behavior changes. A rule-based system cannot cope with such non-stationarity. Thus, reinforcement learning, in which policies are learned automatically from experience, offers an appealing alternative.

Typically, researchers seek to optimize dialogue policies with either supervised learning (SL) or reinforcement learning (RL) methods. In SL approaches, a policy is trained to imitate the observed actions of an expert. Supervised learning approaches often require a large amount of expert-labeled data for training. For task-specific domains, intensive domain knowledge is usually required for collecting and annotating actual human-human or human-machine conversations, and is often expensive and time-consuming. Additionally, even with a large amount of training data, it is possible that some dialogue state spaces may not be explored sufficiently in the training data, preventing a supervised learner to find a good policy.

In contrast, RL approaches allow an agent to learn without any expert-generated example. Given only a reward signal, the agent can optimize a dialogue policy through interaction with users. Unfortunately, RL can require many samples from an environment, making learning from scratch with real users impractical. To overcome this limitation, many researchers in the dialogue systems community train RL agents using simulated users .

The goal of user simulation is to generate natural and reasonable conversations, allowing the RL agent to explore the policy space. The simulation-based approach allows an agent to explore trajectories which may not exist in previously observed data, overcoming a central limitation of imitation-based approaches. Dialogue agents trained on these simulators can then serve as an effective starting point, after which they can be deployed against real humans to improve further via reinforcement learning.

2 Related Work

Given the reliance of the research community on user simulations, it seems important to assess the quality of the simulator. How best to assess a user simulator remains an open issue, and there is no universally accepted metric . One important feature of a good user simulator requires coherent behavior throughout the dialogue; ideally, a good metric should measure the correlation between user simulation and real human behaviors, but it is hard to find a widely accepted metric. Therefore, to the best of our knowledge, there is no standard way to build a user simulator. Here, we summarize the literature of user simulation in different aspects:

At the granularity level, the user simulator can operate either at the dialog-actHere, a dialog-act consists of one intent, as well as zero, one or multiple slot-value pairs. In the rest of the paper, we will use dialog-acts and dialog actions interchangeably level, or at the utterance level .

At the methodology level, the user simulator could use a rule-based approach, or a model-based approach where the model is learned from training data.

Many models have been introduced for user modeling in different dialogue systems. Early work employed a simple, naive bi-gram model $P(a_{u}|a_{m})$ to predict the next user-act $a_{u}$ based on the last system-act $a_{m}$ . The parameters of this model are simple, but it cannot produce coherent user behaviors, for two reasons: (1) this model can only look at the last system action, and (2) if the user changes its goal, this bi-gram model might produce some illogical behavior since it does not consider the user goal when generating the next user-act. Much of the follow-up work on user simulators has tried to address these issues. The first issue can be addressed by looking at longer dialogue histories to select the next user action ; the second issue can be attacked by explicitly incorporating the user goal into user state modeling .

The recently proposed sequence-to-sequence approach has inspired end-to-end trainable user simulators . This approach treats user-turn dialogue to agent-turn dialogue as a source-to-target sequence generation problem, which might be suitable for chatbot-like systems, but may not work well for domain-specific, task-completion dialogue systems, which require the ability to interact with databases and aggregate useful information into the system responses. The benefit of such model-based approaches is they do not need intensive feature engineering, but they typically require a large amount of labeled data to generalize well and deal with user states not included in the training data. On the other hand, agenda-based user simulation provides a convenient mechanism to explicitly encode the dialogue history and user goal. The user goal consists of slot-value pairs describing the user’s requests and constraints. A stack-like format models the state transitions and user action generation as a sequence of simple push and pop operations, which ensures the consistency of user behavior over the course of conversation.

In this paper, we combine the benefits of both model-based and rule-based approaches. Our user simulation for the task-completion dialogue setting follows an agenda-based approach at the dialog-act level, and a sequence-to-sequence natural language generation (NLG) component is used to convert the selected dialog-act into natural language.

Dialogue Systems for Task-Completion

We consider a dialogue system for helping users to book movie tickets or to look up the movies they want, by interacting with them in natural language. Over the course of conversation, the agent gathers information about the customer’s desires and ultimately books the movie tickets, or identify the movie of interest. The environment then assesses a binary outcome (success or failure) at the end of the conversation, based on (1) whether a movie is booked, and (2) whether the movie satisfies the user’s constraints.

The data we used in the paper was collected via Amazon Mechanical Turk, and the annotation was done internally using our own schema. There are $11$ intents (i.e., inform, request, confirm_question, confirm_answer, etc.), and $29$ slots (i.e., moviename, starttime, theater, numberofpeople, etc.). Most of the slots are informable slots, which users can use to constrain the search, and some are requestable slots, of which users can ask values from the agent. For example, numberofpeople cannot be a requestable slot, since arguably user knows how many tickets he or she wants to buy. In total, we labeled $280$ dialogues in the movie domain, and the average number of turns per dialogue is approximately $11$ .

User Simulator

In this work, we follow the agenda-based user simulation approach , in which a stack-like representation of user state provides a convenient mechanism to explicitly encode the dialogue history and user’s goal, and user state update (state transition and user action generation) can be modeled as sequences of push and pop operations with stacks. Here, we describe the rule-based user simulator in detail.

In the task-oriented dialogue setting, the first step of user simulation is to generate a user goal; the agent knows nothing about the user goal but its objective is to help the user to accomplish this goal. Hence, the entire conversation exchange is around this goal implicitly. Generally, the definition of user goal contains two parts:

inform_slots contain a number of slot value pairs which serve as constraints from the user.

request_slots contain a set of slots that user has no information about the values, but wants to get the values from the agent side during the conversation.

To make the user goal more realistic, we add some constraints in the user goal: Slots are split into two groups. For movie-booking scenario, some of elements must appear in the user goal, we called these elements as Required slots, which includes moviename, theater, starttime, date, numberofpeople; the rest slots are Optional slots; ticket is a default slot which always appears in the request_slots part of user goal.

We generated the user goals from the labeled dataset, using two mechanisms. One mechanism is to extract all the slots (known and unknown) from the first user turns (excluding the greeting user turn) in the data, since usually the first turn contains some or all the required information from user. The other mechanism is to extract all the slots (known and unknown) that first appear in all the user turns, and then aggregate them into one user goal. We dump these user goals into a file as the user-goal database for the simulator. Every time when running a dialogue, we randomly sample one user goal from this user goal database.

2 User Action

The work focuses on user-initiated dialogues, so we randomly generated a user goal as the first turn (a user turn). To make the user-act more reasonable, we add further constraints in the generation process. For example, the first user turn is usually a request turn; it has at least one informable slot; if the user knows the movie name, moviename will appear in the first user turn; etc.

During the course of a dialogue, the user simulator maintains a compact stack-like representation named as user agenda , where the user state $s_{u}$ is factored into an agenda $A$ and a goal $G$ , which consists of constraints $C$ and request $R$ . At each time-step $t$ , the user simulator will generate the next user action $a_{u,t}$ based on the its current status $s_{u,t}$ and the last agent action $a_{m,t-1}$ , and then update the current status $s^{\prime}_{u,t}$ . Here, when training or testing a policy without natural language understanding (NLU), an error model is introduced to simulate the noise from the NLU component, and noisy communication between the user and agent. There are two types of noise channels in the error model: one is at the intent level, the other is slot level. Furthermore, at the slot level, there are three kinds of possible noise:

slot deletion: to simulate the scenario that the slot was not recognized by the NLU;

incorrect slot value: to simulate the scenario that the slot name was recognized correctly, but the slot value was not recognized correctly, e.g., wrong word segmentation;

incorrect slot: to simulate the scenario that both the slot and its value were not recognized correctly.

When training or testing a policy with natural language understanding (NLU), it is not necessary to use the error model because the NLU component itself introduces noise.

If the agent action is inform(taskcomplete), this is to inform that the agent has gathered all the information and is ready to book the movie ticket. The user simulator will check whether the current stack is empty, and also conduct constraint checking to make sure that the agent is trying to book the right movie tickets. This guarantees that the user behaves in a consistent, goal-oriented manner.

3 Dialogue Status

There are three statuses for a dialogue: no_outcome_yet, success and failure. The status is no_outcome_yet if the agent has not issued the inform(taskcomplete) action and if the number of turns of the conversation has not exceeded the maximum value; otherwise, the dialogue is finished with either a success or a failure outcome. To be a success dialogue, the agent must answer all the questions (a.k.a. requestable slots of the user) and book the right movie tickets finally, within the maximum number of turns. All other cases are failure dialogues. For example, the whole dialogue exceeds the limit of max turns, or the agent books the wrong movie tickets for the user.

There is a special case, where the user’s constraints are not satisfiable in our movie database, and the agent correctly informs that no ticket can be booked. One can argue this is a successful outcome, as the agent does what is correct. Here, we choose to treat it as a failure, as no ticket is booked. It should be noted that this choice does not affect algorithm comparison much.

4 Natural Language Understanding (NLU)

The natural language understanding (NLU) component is a recurrent neural network model with long-short term memory (LSTM) cells. This single NLU model can do intent prediction, and slot filling simultaneously. For joint modeling of intent and slots, the predicted tag set is a concatenated set of IOB-format slot tags and intent tags, and an additional token is introduced at the end of each utterance, its supervised label is an intent tag, while the supervised label of all other preceding words is an IOB tag. In this way, we can still use the sequence-to-sequence training approach, the last hidden layer of the sequence is supposed to be a condensed semantic representation of the whole input utterance, so that it can be utilized for intent prediction at the utterance level. This model is trained using all available dialogue actions and utterance pairs in our labeled dataset.

5 Natural Language Generation (NLG)

The user simulator is designed on dialog act level, but it can also work on utterance level, we provide a natural language generation (NLG) component in the framework. Due to the limited labeled dataset, our empirical tests found that a pure model-based NLG might not generalize well, which will introduce a lot of noise for the policy training. Thus, we use a hybrid approach which consists of:

Template-based NLG: outputs some predefined rule-based templates for dialog acts

Model-based NLG: is trained on our labeled dataset in a sequence-to-sequence fashion. It takes dialog-acts as input, and generates template-like sentences with slot placeholders via an LSTM decoder. Then, a post-processing scan is performed to replace the slot placeholders with their actual values . In the LSTM decoder, we apply beam search, which iteratively considers the top $k$ best sentences up to time step $t$ when generating the token of the time step $t+1$ . For the sake of the trade-off between the speed and performance, we use the beam size of $3$ in our experiments.

In our hybrid model, if the dialog act can be found in the predefined rule-based templates, we use the template-based NLG for generating the utterance; otherwise, the utterance is generated by the model-based NLG.

Usages

We conduct experiments training agents with our user simulator for the following two tasks. The first is a task-completion dialogue setting on the movie-booking domain . Here, the agent’s job is to engage with the user in a dialogue with the ultimate goal of helping the user to successfully book a movie. To measure the quality of the agent, there are three metrics: {success rateSuccess rate is sometimes known as task completion rate — the fraction of dialoges that finish successfully., average reward, average turns}; each of them provides different information about the quality of agents. There exists a strong correlation among them: generally, a good policy should have a higher success rate, higher average reward and lower average turns. Here, we choose success rate as our major evaluation metric to report for the quality of agents. In the appendix, Table 1 demonstrates some example dialogues for this task.

The second task pertains to training an KB-InfoBot . The setting is a simplified version of the previous goal-oriented dialogues, in which an agent and user communicate with only two intents (request and inform). Accordingly, for this task the experiments in KB-InfoBot engage a simplified version of the simulator described in this paper, using the two aforementioned intents and six slots. In this paper, the knowledge-base is drawn from the IMDB dataset. In the appendix, Table 2 demonstrates some example dialogues for KB-InfoBot.

Discussion

In this paper, we demonstrated that rule-based user simulation can be a safe way to train reinforcement learning agents for task-completion dialogues. Since rule-based user simulation requires application-specific domain knowledge to curate these hand-crafted rules, it is usually a time-consuming process. One improvement for the current user simulation in the task-completion dialogue setting is to include user goal changes which make the dialogue more complex, but also realistic. Another potential direction for future improvement is model-based user simulation for task-completion dialogues. The advantage of model-based user simulation is that it can be adapted to other domains easily as long as there are enough labeled data. Since model-based user simulation is data-driven, one potential risk is that it asks for a large amount of labeled data to train a good simulator, and it might be risky to use the user simulator to train RL agents due to the uncertainty of the model. When training reinforcement learning agents with such a user simulator, the RL agents can easily learn these errors or loopholes existing in the model-based user simulator and make the false dialogues “success”. In this case, the quality of learned RL policy can be misleadingly high. But model-based user simulator for task-completion dialogue setting is still a good direction to investigate.

Acknowledgments

We thank Asli Celikyilmaz, Alex Marin, Paul Crook, Dilek Hakkani-Tür, Hisami Suzuki, Ricky Loynd and Li Deng for their insightful comments and discussion in the project.

References

Appendix A Recipes

This framework provides you a way to develop and compare different algorithms/models (i.e., agents in the dialogue setting). The dialogue system consists of two parts: agent and user simulator. Here, we walk through some examples to show how to build and plug in your own agents and user simulators.

For all the agents, they are inherited from the Agent class (agent.py) which provides some common interfaces for users to implement their agents. In the agent_baseline.py file, five basic rule-based agents are implemented:

InformAgent informs all the slots one by one in every turn; it cannot request any information/slot.

RequestAllAgent requests all the slots one by one in every turn; it cannot inform any information/slot.

RandomAgent requests any random request in every turn; it cannot inform any information/slot.

EchoAgent informs the slot in the request slots of last user action; it cannot request any information/slot.

RequestBasicsAgent requests all basic slots in a subset one by one, then chooses inform(taskcomplete) at the last turn; it cannot inform any information/slot.

All the agents just re-implement two functions: initialize_episode and state_to_action. Here state_to_action function makes no assumption about the structure of the agent, it is an interface to implement the mapping from state to action, which is the core part in the agent. Here is an example of RequestBasicsAgent:

All the above rule-based agents can support only either inform or request action, here you can practice to implement a sophisticated rule-based agent which can support multiple actions, including inform, request, confirm_question, confirm_answer, deny etc.

agent_dqn.py provides a RL agent (agt=9), which wraps a DQN model. Besides the two above functions, there are two major functions in the RL agent: run_policy and train. run_policy implements an $\epsilon$ -greedy policy, and train calls the batch training function of DQN.

agent_cmd.py provides a command line agent (agt=0), which you as an agent can interact with the user simulator. The command line agent supports two types of input: natural language (cmd_input_mode=0) and dialog act(cmd_input_mode=1). Listing 3 shows an example of command line agent interacting with the user simulator via the natural language; Listing 4 shows an example of command line agent interacting with the user simulator via dialog act form. Note:

When the last user turn is a request action, the system will show a line of suggested available answers in the database for the agent, like the turn 0 in the Listing 4 . Both rule-based agents and RL agent, they will answer the user with the slot values from the database. Here a special case for command line agent is, human (as command line agent) might type any random answer to user’s request, when the typed answer is not in the database, the state tracker will correct it, and force the agent to use the values from the database in the agent response. For example, in turn 1 of the Listing 4 , if you input inform(theater=amc pacific), the actual answer received by the user is inform(theater=carmike summit 16), because amc pacific doesn’t exist in the database, to avoid this wired behavior that agent informs the user a unavailable value, we restrict the agent to use the values from the suggested list.

The last second turn of agent is usually an inform(taskcomplete) in dialog act form or something like “Okay, your tickets are booked.” in natural language, which is to inform the user simulator that the agent nearly completes the task, and is ready to book the movie tickets.

To end a conversation, the last turn of the agent is usually a thanks() in dialog act form or “thanks” in natural language.

A.2 How to build your own user simulator?

Similarly, there is one user simulator class (usersim.py) which provides a few common interfaces for users to implement their user simulators. All the user simulators are inherited from this class, they should re-implement these two functions: initialize_episode and next. The usersim_rule.py file implements a rule-based user simulator. Here the next function implements all the rules and mechanism to issue the next user action based on the last agent action. Here is the example of usersim_rule.py:

Appendix B Training Details

To train a RL agent, you can either start with some rule policy experience tuples to initialize the experience replay buffer pool or start with an empty experience replay buffer pool. We recommend to use some rule or supervised policy to initialize the experience replay buffer pool, many work have demonstrated the benefits of such strategy as a good initialization to speed up the RL training. Here, we use a very simple rule-based policy to initialize the experience replay buffer pool.

The RL agent is a DQN network. In the training, we use the $\epsilon$ -greedy policy and a dynamic experience replay buffer pool. The size of experience replay buffer pool is dynamic changing. One important trick of DQN is to introduce the target network, which is updated slowly and used to compute the target value in a short period.

The training procedure goes like this way: at each simulation epoch, we simulate $N$ dialogues and add these state transition tuples ( $s_{t},a_{t},r_{t},s_{t+1}$ ) into experience replay buffer pool, train and update the current DQN network. In one simulation epoch, the current DQN network will be updated multiple times, depending on the batch size and the current size of experience replay buffer, at the end of simulation epoch, the target network will be replaced by the current DQN network, the target DQN network is only updated for once in one simulation epoch. The experience replay strategy is critic for the training . Our experience reply buffer update strategy is as follows: at the beginning, we will accumulate all the experience tuples from the simulation and flush the experience reply buffer pool till the current RL agent reaches a success rate threshold (i.e. success_rate_threshold = 0.30), then use the experience tuples from the current RL agent to re-fill the buffer. The intuition behind is the initial performance of the DQN is not good to generate enough good experience replay tuples, thus we do not flush the experience replay pool till the current RL agent can reach a certain success rate which we think it is good, for example, the performance of a rule-based agent. Then in the following training process, at every simulation epoch, we estimate the success rate of the current DQN agent, if the current DQN agent is better enough (i.e. better than the target network), the experience replay buffer poll will be flushed and re-filled. Figure 1 shows a learning curve for RL agent without NLU and NLG, Figure 2 is a learning curve for RL agent with NLU and NLG, it takes longer time to train the RL agent to adapt the errors and noise from NLU and NLG.

Appendix C Sample Dialogues

Table 1 shows one success and one failure dialogue examples generated by the rule-based agent and RL agent interacting with user simulator in the movie-booking domain. To be informative, we also explicitly show the user goal at the head of the dialogue, but the agent knows nothing about the user goal, its goal is to help the user to accomplish this goal and book the right movie tickets.

C.2 KB-InfoBot

Table 2 shows some sample dialogues between the user simulator and SimpleRL-SoftKB and End2End-RL agents . Value of the critic_rating slot is a common source of error in the user simulator, and hence all learned policies tend to ask for this value multiple times.