Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings

He He, Anusha Balakrishnan, Mihail Eric, Percy Liang

Introduction

Current task-oriented dialogue systems Young et al. (2013); Wen et al. (2017); Dhingra et al. (2017) require a pre-defined dialogue state (e.g., slots such as food type and price range for a restaurant searching task) and a fixed set of dialogue acts (e.g., request, inform). However, human conversation often requires richer dialogue states and more nuanced, pragmatic dialogue acts. Recent open-domain chat systems (Shang et al., 2015; Serban et al., 2015b; Sordoni et al., 2015; Li et al., 2016a; Lowe et al., 2017; Mei et al., 2017) learn a mapping directly from previous utterances to the next utterance. While these models capture open-ended aspects of dialogue, the lack of structured dialogue state prevents them from being directly applied to settings that require interfacing with structured knowledge.

In order to bridge the gap between the two types of systems, we focus on a symmetric collaborative dialogue setting, which is task-oriented but encourages open-ended dialogue acts. In our setting, two agents, each with a private list of items with attributes, must communicate to identify the unique shared item. Consider the dialogue in Figure 1, in which two people are trying to find their mutual friend. By asking “do you have anyone who went to columbia?”, B is suggesting that she has some Columbia friends, and that they probably work at Google. Such conversational implicature is lost when interpreting the utterance as simply an information request. In addition, it is hard to define a structured state that captures the diverse semantics in many utterances (e.g., defining “most of”, “might be”; see details in Table 1).

To model both structured and open-ended context, we propose the Dynamic Knowledge Graph Network (DynoNet), in which the dialogue state is modeled as a knowledge graph with an embedding for each node (Section 3). Our model is similar to EntNet Henaff et al. (2017) in that node/entity embeddings are updated recurrently given new utterances. The difference is that we structure entities as a knowledge graph; as the dialogue proceeds, new nodes are added and new context is propagated on the graph. An attention-based mechanism Bahdanau et al. (2015) over the node embeddings drives generation of new utterances. Our model’s use of knowledge graphs captures the grounding capability of classic task-oriented systems and the graph embedding provides the representational flexibility of neural models.

The naturalness of communication in the symmetric collaborative setting enables large-scale data collection: We were able to crowdsource around 11K human-human dialogues on Amazon Mechanical Turk (AMT) in less than 15 hours.The dataset is available publicly at https://stanfordnlp.github.io/cocoa/. We show that the new dataset calls for more flexible representations beyond fully-structured states (Section 2.2).

In addition to conducting the third-party human evaluation adopted by most work Liu et al. (2016); Li et al. (2016b, c), we also conduct partner evaluation Wen et al. (2017) where AMT workers rate their conversational partners (other workers or our models) based on fluency, correctness, cooperation, and human-likeness. We compare DynoNet with baseline neural models and a strong rule-based system. The results show that DynoNet can perform the task with humans efficiently and naturally; it also captures some strategic aspects of human-human dialogues.

The contributions of this work are: (i) a new symmetric collaborative dialogue setting and a large dialogue corpus that pushes the boundaries of existing dialogue systems; (ii) DynoNet, which integrates semantically rich utterances with structured knowledge to represent open-ended dialogue states; (iii) multiple automatic metrics based on bot-bot chat and a comparison of third-party and partner evaluation.

Symmetric Collaborative Dialogue

We begin by introducing a collaborative task between two agents and describe the human-human dialogue collection process. We show that our data exhibits diverse, interesting language phenomena.

In the symmetric collaborative dialogue setting, there are two agents, A and B, each with a private knowledge base— $\text{KB}_{\text{A}}$ and $\text{KB}_{\text{B}}$ , respectively. Each knowledge base includes a list of items, where each item has a value for each attribute. For example, in the MutualFriends setting, Figure 1, items are friends and attributes are name, school, etc. There is a shared item that A and B both have; their goal is to converse with each other to determine the shared item and select it. Formally, an agent is a mapping from its private KB and the dialogue thus far (sequence of utterances) to the next utterance to generate or a selection. A dialogue is considered successful when both agents correctly select the shared item. This setting has parallels in human-computer collaboration where each agent has complementary expertise.

2 Data collection

We created a schema with 7 attributes and approximately 3K entities (attribute values). To elicit linguistic and strategic variants, we generate a random scenario for each task by varying the number of items (5 to 12), the number attributes (3 or 4), and the distribution of values for each attribute (skewed to uniform). See Appendix A and B for details of schema and scenario generation.

We crowdsourced dialogues on AMT by randomly pairing up workers to perform the task within 5 minutes.If the workers exceed the time limit, the dialogue is marked as unsuccessful (but still logged). Our chat interface is shown in Figure 2. To discourage random guessing, we prevent workers from selecting more than once every 10 seconds. Our task was very popular and we collected 11K dialogues over a period of 13.5 hours.Tasks are put up in batches; the total time excludes intervals between batches. Of these, over 9K dialogues are successful. Unsuccessful dialogues are usually the result of either worker leaving the chat prematurely.

3 Dataset statistics

We show the basic statistics of our dataset in Table 3. An utterance is defined as a message sent by one of the agents. The average utterance length is short due to the informality of the chat, however, an agent usually sends multiple utterances in one turn. Some example dialogues are shown in Table 6 and Appendix I.

We categorize utterances into coarse types—inform, ask, answer, greeting, apology—by pattern matching (Appendix E). There are 7.4% multi-type utterances, and 30.9% utterances contain more than one entity. In Table 1, we show example utterances with rich semantics that cannot be sufficiently represented by traditional slot-values. Some of the standard ones are also non-trivial due to coreference and logical compositionality.

Our dataset also exhibits some interesting communication phenomena. Coreference occurs frequently when people check multiple attributes of one item. Sometimes mentions are dropped, as an utterance simply continues from the partner’s utterance. People occasionally use external knowledge to group items with out-of-schema attributes (e.g., gender based on names, location based on schools). We summarize these phenomena in Table 2. In addition, we find 30% utterances involve cross-talk where the conversation does not progress linearly (e.g., italic utterances in Figure 1), a common characteristic of online chat (Ivanovic, 2005).

One strategic aspect of this task is choosing the order of attributes to mention. We find that people tend to start from attributes with fewer unique values, e.g., “all my friends like morning” given the $\text{KB}_{\text{B}}$ in Table 6, as intuitively it would help exclude items quickly given fewer values to check.Our goal is to model human behavior thus we do not discuss the optimal strategy here. We provide a more detailed analysis of strategy in Section 4.2 and Appendix F.

Dynamic Knowledge Graph Network

The diverse semantics in our data motivates us to combine unstructured representation of the dialogue history with structured knowledge. Our model consists of three components shown in Figure 3: (i) a dynamic knowledge graph, which represents the agent’s private KB and shared dialogue history as a graph (Section 3.1), (ii) a graph embedding over the nodes (Section 3.2), and (iii) an utterance generator (Section 3.3).

The knowledge graph represents entities and relations in the agent’s private KB, e.g., item-1’s company is google. As the conversation unfolds, utterances are embedded and incorporated into node embeddings of mentioned entities. For instance, in Figure 3, “anyone went to columbia” updates the embedding of columbia. Next, each node recursively passes its embedding to neighboring nodes so that related entities (e.g., those in the same row or column) also receive information from the most recent utterance. In our example, jessica and josh both receive new context when columbia is mentioned. Finally, the utterance generator, an LSTM, produces the next utterance by attending to the node embeddings.

Given a dialogue of $T$ utterances, we construct graphs $(G_{t})_{t=1}^{T}$ over the KB and dialogue history for agent A. It is important to differentiate perspectives of the two agents as they have different KBs. Thereafter we assume the perspective of agent A, i.e., accessing $\text{KB}_{\text{A}}$ for A only, and refer to B as the partner. There are three types of nodes: item nodes, attribute nodes, and entity nodes. Edges between nodes represent their relations. For example, (item-1, hasSchool, columbia) means that the first item has attribute school whose value is columbia. An example graph is shown in Figure 3. The graph $G_{t}$ is updated based on utterance $t$ by taking $G_{t-1}$ and adding a new node for any entity mentioned in utterance $t$ but not in $\text{KB}_{\text{A}}$ . We use a rule-based lexicon to link text spans to entities. See details in Appendix D.

2 Graph Embedding

Given a knowledge graph, we are interested in computing a vector representation for each node $v$ that captures both its unstructured context from the dialogue history and its structured context in the KB. A node embedding $V_{t}(v)$ for each node $v\in G_{t}$ is built from three parts: structural properties of an entity defined by the KB, embeddings of utterances in the dialogue history, and message passing between neighboring nodes.

Simple structural properties of the KB often govern what is talked about; e.g., a high-frequency entity is usually interesting to mention (consider “All my friends like dancing.”). We represent this type of information as a feature vector $F_{t}(v)$ , which includes the degree and type (item, attribute, or entity type) of node $v$ , and whether it has been mentioned in the current turn. Each feature is encoded as a one-hot vector and they are concatenated to form $F_{t}(v)$ .

Mention Vectors.

Here, $\sigma$ is the sigmoid function and $W^{\text{inc}}$ is a parameter matrix.

Recursive Node Embeddings.

We propagate information between nodes according to the structure of the knowledge graph. In Figure 3, given “anyone went to columbia?”, the agent should focus on her friends who went to Columbia University. Therefore, we want this utterance to be sent to item nodes connected to columbia, and one step further to other attributes of these items because they might be mentioned next as relevant information, e.g., jessica and josh.

We compute the node embeddings recursively, analogous to belief propagation:

where $V_{t}^{k}(v)$ is the depth- $k$ node embedding at turn $t$ and $N_{t}(v)$ denotes the set of nodes adjacent to $v$ . The message from a neighboring node $v^{\prime}$ depends on its embedding at depth- $(k-1)$ , the edge label $e_{v\rightarrow v^{\prime}}$ (embedded by a relation embedding function $R$ ), and a parameter matrix $W^{\text{mp}}$ . Messages from all neighbors are aggregated by $\max$ , the element-wise max operation.Using sum or mean slightly hurts performance. Example message passing paths are shown in Figure 3.

The final node embedding is the concatenation of embeddings at each depth:

where $K$ is a hyperparameter (we experiment with $K\in\{0,1,2\}$ ) and $V_{t}^{0}(v)=\left[F_{t}(v),M_{t}(v)\right]$ .

3 Utterance Embedding and Generation

We embed and generate utterances using Long Short Term Memory (LSTM) networks that take the graph embeddings into account.

On turn $t$ , upon receiving an utterance consisting of $n_{t}$ tokens, $x_{t}=(x_{t,1},\dots,x_{t,n_{t}})$ , the LSTM maps it to a vector as follows:

where $h_{t,0}=h_{t-1,n_{t-1}}$ , and $A_{t}$ is an entity abstraction function, explained below. The final hidden state $h_{t,n_{t}}$ is used as the utterance embedding $u_{t}$ , which updates the mention vectors as described in Section 3.2.

In our dialogue task, the identity of an entity is unimportant. For example, replacing google with alphabet in Figure 1 should make little difference to the conversation. The role of an entity is determined instead by its relation to other entities and relevant utterances. Therefore, we define the abstraction $A_{t}(y)$ for a word $y$ as follows: if $y$ is linked to an entity $v$ , then we represent an entity by its type (school, company etc.) embedding concatenated with its current node embedding: $A_{t}(y)=[E_{\text{type}(y)},V_{t}(v)]$ . Note that $V_{t}(v)$ is determined only by its structural features and its context. If $y$ is a non-entity, then $A_{t}(y)$ is the word embedding of $y$ concatenated with a zero vector of the same dimensionality as $V_{t}(v)$ . This way, the representation of an entity only depends on its structural properties given by the KB and the dialogue context, which enables the model to generalize to unseen entities at test time.

Generation.

Now, assuming we have embedded utterance $x_{t-1}$ into $h_{t-1,n_{t-1}}$ as described above, we use another LSTM to generate utterance $x_{t}$ . Formally, we carry over the last utterance embedding $h_{t,0}=h_{t-1,n_{t-1}}$ and define:

where $c_{t,j}$ is a weighted sum of node embeddings in the current turn: $c_{t,j}=\sum_{v\in G_{t}}\alpha_{t,j,v}V_{t}(v)$ , where $\alpha_{t,j,v}$ are the attention weights over the nodes. Intuitively, high weight should be given to relevant entity nodes as shown in Figure 3. We compute the weights through standard attention mechanism (Bahdanau et al., 2015):

where vector $w^{\text{attn}}$ and $W^{\text{attn}}$ are parameters.

Finally, we define a distribution over both words in the vocabulary and nodes in $G_{t}$ using the copying mechanism of Jia and Liang (2016):

where $y$ is a word in the vocabulary, $W^{\text{vocab}}$ and $b$ are parameters, and $r(v)$ is the realization of the entity represented by node $v$ , e.g., google is realized to “Google” during copying. We realize an entity by sampling from the empirical distribution of its surface forms found in the training data.

Experiments

We compare our model with a rule-based system and a baseline neural model. Both automatic and human evaluations are conducted to test the models in terms of fluency, correctness, cooperation, and human-likeness. The results show that DynoNet is able to converse with humans in a coherent and strategic way.

We randomly split the data into train, dev, and test sets (8:1:1). We use a one-layer LSTM with 100 hidden units and 100-dimensional word vectors for both the encoder and the decoder (Section 3.3). Each successful dialogue is turned into two examples, each from the perspective of one of the two agents. We maximize the log-likelihood of all utterances in the dialogues. The parameters are optimized by AdaGrad Duchi et al. (2010) with an initial learning rate of 0.5. We trained for at least 10 epochs; after that, training stops if there is no improvement on the dev set for 5 epochs. By default, we perform $K=2$ iterations of message passing to compute node embeddings (Section 3.2). For decoding, we sequentially sample from the output distribution with a softmax temperature of 0.5. Since selection is a common ‘utterance’ in our dataset and neural generation models are susceptible to over-generating common sentences, we halve its probability during sampling. Hyperparameters are tuned on the dev set.

We compare DynoNet with its static cousion (StanoNet) and a rule-based system (Rule). StanoNet uses $G_{0}$ throughout the dialogue, thus the dialogue history is completely contained in the LSTM states instead of being injected into the knowledge graph. Rule maintains weights for each entity and each item in the KB to decide what to talk about and which item to select. It has a pattern-matching semantic parser, a rule-based policy, and a templated generator. See Appendix G for details.

2 Evaluation

We test our systems in two interactive settings: bot-bot chat and bot-human chat. We perform both automatic evaluation and human evaluation.

For language variation, we report the average utterance length $L_{u}$ and the unigram entropy $H$ in Table 4. Compared to Rule, the neural models tend to generate shorter utterances Li et al. (2016b); Serban et al. (2017b). However, they are more diverse; for example, questions are asked in multiple ways such as “Do you have …”, “Any friends like …”, “What about …”.

At the discourse level, we expect the distribution of a bot’s utterance types to match the distribution of human’s. We show percentages of each utterance type in Table 4. For Rule, the decision about which action to take is written in the rules, while StanoNet and DynoNet learned to behave in a more human-like way, frequently informing and asking questions.

To measure effectiveness, we compute the overall success rate ( $C$ ) and the success rate per turn ( $C_{T}$ ) and per selection ( $C_{S}$ ). As shown in Table 4, humans are the best at this game, followed by Rule which is comparable to DynoNet.

Next, we investigate the strategies leading to these results. An agent needs to decide which entity/attribute to check first to quickly reduce the search space. We hypothesize that humans tend to first focus on a majority entity and an attribute with fewer unique values (Section 2.3). For example, in the scenario in Table 6, time and location are likely to be mentioned first. We show the average frequency of first-mentioned entities (#Ent1) and the average number of unique values for first-mentioned attributes ( $|\text{Attr}_{1}|$ ) in Table 4. Both numbers are normalized to $$ with respect to all entities/attributes in the corresponding KB. Both DynoNet and StanoNet successfully match human’s starting strategy by favoring entities of higher frequency and attributes of smaller domain size.

To examine the overall strategy, we show the average number of attributes (#Attr) and entities (#Ent) mentioned during the conversation in Table 4. Humans and DynoNet strategically focus on a few attributes and entities, whereas Rule needs almost twice entities to achieve similar success rates. This suggests that the effectiveness of Rule mainly comes from large amounts of unselective information, which is consistent with comments from their human partners.

Partner Evaluation.

We generated 200 new scenarios and put up the bots on AMT using the same chat interface that was used for data collection. The bots follow simple turn-taking rules explained in Appendix H. Each AMT worker is randomly paired with Rule, StanoNet, DynoNet, or another human (but the worker doesn’t know which), and we make sure that all four types of agents are tested in each scenario at least once. At the end of each dialogue, humans are asked to rate their partner in terms of fluency, correctness, cooperation, and human-likeness from 1 (very bad) to 5 (very good), along with optional comments.

We show the average ratings (with significance tests) in Table 5 and the histograms in Appendix J. In terms of fluency, the models have similar performance since the utterances are usually short. Judgment on correctness is a mere guess since the evaluator cannot see the partner’s KB; we will analyze correctness more meaningfully in the third-party evaluation below.

Noticeably, DynoNet is more cooperative than the other models. As shown in the example dialogues in Table 6, DynoNet cooperates smoothly with the human partner, e.g., replying with relevant information about morning/indoor friends when the partner mentioned that all her friends prefer morning and most like indoor. StanoNet starts well but doesn’t follow up on the morning friend, presumably because the morning node is not updated dynamically when mentioned by the partner. Rule follows the partner poorly. In the comments, the biggest complaint about Rule was that it was not ‘listening’ or ‘understanding’. Overall, DynoNet achieves better partner satisfaction, especially in cooperation.

Third-party Evaluation.

We also created a third-party evaluation task, where an independent AMT worker is shown a conversation and the KB of one of the agents; she is asked to rate the same aspects of the agent as in the partner evaluation and provide justifications. Each agent in a dialogue is rated by at least 5 people.

The average ratings and histograms are shown in Table 5 and Appendix J. For correctness, we see that Rule has the best performance since it always tells the truth, whereas humans can make mistakes due to carelessness and the neural models can generate false information. For example, in Table 6, DynoNet ‘lied’ when saying that it has a morning friend who likes outdoor.

Surprisingly, there is a discrepancy between the two evaluation modes in terms of cooperation and human-likeness. Manual analysis of the comments indicates that third-party evaluators focus less on the dialogue strategy and more on linguistic features, probably because they were not fully engaged in the dialogue. For example, justification for cooperation often mentions frequent questions and timely answers, less attention is paid to what is asked about though.

For human-likeness, partner evaluation is largely correlated with coherence (e.g., not repeating or ignoring past information) and task success, whereas third-party evaluators often rely on informality (e.g., usage of colloquia like “hiya”, capitalization, and abbreviation) or intuition. Interestingly, third-party evaluators noted most phenomena listed in Table 2 as indicators of human-beings, e.g., correcting oneself, making chit-chat other than simply finishing the task. See example comments in Appendix K.

3 Ablation Studies

Our model has two novel designs: entity abstraction and message passing for node embeddings. Table 7 shows what happens if we ablate these. When the number of message passing iterations, $K$ , is reduced from 2 to 0, the loss consistently increases. Removing entity abstraction—meaning adding entity embeddings to node embeddings and the LSTM input embeddings—also degrades performance. This shows that DynoNet benefits from contextually-defined, structural node embeddings rather than ones based on a classic lookup table.

Discussion and Related Work

There has been a recent surge of interest in end-to-end task-oriented dialogue systems, though progress has been limited by the size of available datasets Serban et al. (2015a). Most work focuses on information-querying tasks, using Wizard-of-Oz data collection Williams et al. (2016); Asri et al. (2016) or simulators Bordes and Weston (2017); Li et al. (2016d), In contrast, collaborative dialogues are easy to collect as natural human conversations, and are also challenging enough given the large number of scenarios and diverse conversation phenomena. There are some interesting strategic dialogue datasets—settlers of Catan Afantenos et al. (2012) (2K turns) and the cards corpus Potts (2012) (1.3K dialogues), as well as work on dialogue strategies Keizer et al. (2017); Vogel et al. (2013), though no full dialogue system has been built for these datasets.

Most task-oriented dialogue systems follow the POMDP-based approach Williams and Young (2007); Young et al. (2013). Despite their success Wen et al. (2017); Dhingra et al. (2017); Su et al. (2016), the requirement for handcrafted slots limits their scalability to new domains and burdens data collection with extra state labeling. To go past this limit, Bordes and Weston (2017) proposed a Memory-Networks-based approach without domain-specific features. However, the memory is unstructured and interfacing with KBs relies on API calls, whereas our model embeds both the dialogue history and the KB structurally. Williams et al. (2017) use an LSTM to automatically infer the dialogue state, but as they focus on dialogue control rather than the full problem, the response is modeled as a templated action, which restricts the generation of richer utterances. Our network architecture is most similar to EntNet Henaff et al. (2017), where memories are also updated by input sentences recurrently. The main difference is that our model allows information to be propagated between structured entities, which is shown to be crucial in our setting (Section 4.3).

Our work is also related to language generation conditioned on knowledge bases Mei et al. (2016); Kiddon et al. (2016). One challenge here is to avoid generating false or contradicting statements, which is currently a weakness of neural models. Our model is mostly accurate when generating facts and answering existence questions about a single entity, but will need a more advanced attention mechanism for generating utterances involving multiple entities, e.g., attending to items or attributes first, then selecting entities; generating high-level concepts before composing them to natural tokens Serban et al. (2017a).

In conclusion, we believe the symmetric collaborative dialogue setting and our dataset provide unique opportunities at the interface of traditional task-oriented dialogue and open-domain chat. We also offered DynoNet as a promising means for open-ended dialogue state representation. Our dataset facilitates the study of pragmatics and human strategies in dialogue—a good stepping stone towards learning more complex dialogues such as negotiation.

This work is supported by DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF-15-1-0462. Mike Kayser worked on an early version of the project while he was at Stanford. We also thank members of the Stanford NLP group for insightful discussions.

Reproducibility.

All code, data, and experiments for this paper are available on the CodaLab platform: https://worksheets.codalab.org/worksheets/0xc757f29f5c794e5eb7bfa8ca9c945573.

References

Appendix A Knowledge Base Schema

The attribute set $\mathcal{A}$ for the MutualFriends task contains name, school, major, company, hobby, time-of-day preference, and location preference. Each attribute $a$ has a set of possible values (entities) $\mathcal{E}_{a}$ . For name, school, major, company, and hobby, we collected a large set of values from various online sources.Names: https://www.ssa.gov/oact/babynames/decades/century.html Schools: http://doors.stanford.edu/~sr/universities.html Majors: http://www.a2zcolleges.com/majors Companies: https://en.wikipedia.org/wiki/List_of_companies_of_the_United_States Hobbies: https://en.wikipedia.org/wiki/List_of_hobbies We used three possible values (morning, afternoon, and evening) for the time-of-day preference, and two possible values (indoors and outdoors) for the location preference.

Appendix B Scenario Generation

We generate scenarios randomly to vary task complexity and elicit linguistic and strategic variants. A scenario $S$ is characterized by the number of items ( $N_{S}$ ), the attribute set ( $\mathcal{A}_{S}$ ) whose size is $M_{S}$ , and the values for each attribute $a\in\mathcal{A}_{S}$ in the two KBs.

Sample $N_{S}$ and $M_{S}$ uniformly from $\{5,\dots,12\}$ and $\{3,4\}$ respectively.

Generate $\mathcal{A}_{S}$ by sampling $M_{S}$ attributes without replacement from $\mathcal{A}$ .

For each attribute $a\in\mathcal{A}_{S}$ , sample the concentration parameter $\alpha_{a}$ uniformly from the set $\{0.3,1,3\}$ .

Generate two KBs by sampling $N_{S}$ values for each attribute $a$ from a Dirichlet-multinomial distribution over the value set $\mathcal{E}_{a}$ with the concentration parameter $\alpha_{a}$ .

We repeat the last step until the two KBs have one unique common item.

Appendix C Chat Interface

In order to collect real-time dialogue between humans, we set up a web server and redirect AMT workers to our website. Visitors are randomly paired up as they arrive. For each pair, we choose a random scenario, and randomly assign a KB to each dialogue participant. We instruct people to play intelligently, to refrain from brute-force tactics (e.g., mentioning every attribute value), and to use grammatical sentences. To discourage random guessing, we prevent users from selecting a friend (item) more than once every 10 seconds. Each worker was paid $0.35 for a successful dialogue within a 5-minute time limit. We log each utterance in the dialogue along with timing information.

Appendix D Entity Linking and Realization

We use a rule-based lexicon to link text spans to entities. For every entity in the schema, we compute different variations of its canonical name, including acronyms, strings with a certain edit distance, prefixes, and morphological variants. Given a text span, a set of candidate entities is returned by string matching. A heuristic ranker then scores each candidate (e.g., considering whether the span is a substring of a candidate, the edit distance between the span and a candidate etc.). The highest-scoring candidate is returned.

A linked entity is considered as a single token and its surface form is ignored in all models. At generation time, we realize an entity by sampling from the empirical distribution of its surface forms in the training set.

Appendix E Utterance Categorization

We categorize utterances into inform, ask, answer, greeting, apology heuristically by pattern matching.

An ask utterance asks for information regarding the partner’s KB. We detect these utterances by checking for the presence of a ‘?’ and/or a question word like “do”, “does”, “what”, etc.

An inform utterance provides information about the agent’s KB. We define it as an utterances that mentions entities in the KB and is not an ask utterance.

An answer utterance simply provides a positive/negative response to a question, containing words like “yes”, “no”, “nope”, etc.

A greeting utterance contains words like “hi” or “hello”; it often occurs at the beginning of a dialogue.

An apology utterance contains the word “sorry”, which is typically associated with corrections and wrong selections.

See Table 2 and Table 1 for examples of these utterance types.

Appendix F Strategy

During scenario generation, we varied the number of attributes, the number of items in each KB, and the distribution of values for each attribute. We find that as the number of items and/or attributes grows, the dialogue length and the completion time also increase, indicating that the task becomes harder. We also anticipated that varying the value of $\alpha$ would impact the overall strategy (for example, the order in which attributes are mentioned) since $\alpha$ controls the skewness of the distribution of values for an attribute.

On examining the data, we find that humans tend to first mention attributes with a more skewed (i.e., less uniform) distribution of values. Specifically, we rank the $\alpha$ values of all attributes in a scenario (see step 3 in Section B), and bin them into 3 distribution groups—least_uniform, medium, and most_uniform, according to the ranking where higher $\alpha$ values corresponds to more uniform distributions. For scenarios with 3 attributes, each group contains one attributes. For scenarios with 4 attributes, we put the two attributes with rankings in the middle to medium. In Figure 4, we plot the histogram of the distribution group of the first-mentioned attribute in a dialogues, which shows that skewed attributes are mentioned much more frequently.

Appendix G Rule-based System

The rule-based bot takes the following actions: greeting, informing or asking about a set of entities, answering a question, and selecting an item. The set of entities to inform/ask is sampled randomly given the entity weights. Initially, each entity is weighted by its count in the KB. We then increment or decrement weights of entities mentioned by the partner and its related entities (in the same row or column), depending on whether the mention is positive or negative. A negative mention contains words like “no”, “none”, “n’t” etc. Similarly, each item has an initial weight of 1, which is updated depending on the partner’s mention of its attributes.

If there exists an item with weight larger than 1, the bot selects the highest-weighted item with probability 0.3. If a question is received, the bot informs facts of the entities being asked, e.g., “anyone went to columbia?”, “I have 2 friends who went to columbia”. Otherwise, the bot samples an entity set and randomly chooses between informing and asking about the entities.

All utterances are generated by sentence templates, and parsing of the partner’s utterance is done by entity linking and pattern matching (Section E).

Appendix H Turn-taking Rules

Turn-taking is universal in human conversations and the bot needs to decide when to ‘talk’ (send an utterance). To prevent the bot from generating utterances continuously and forming a monologue, we allow it to send at most one utterance if the utterance contains any entity, and two utterances otherwise. When sending more than one utterance in a turn, the bot must wait for 1 to 2 seconds in between. In addition, after an utterance is generated by the model (almost instantly), the bot must hold on for some time to simulate message typing before sending. We used a typing speed of 7 chars / sec and added an additional random delay between 0 to 1.5s after ‘typing’. The rules are applied to all models.

Appendix I Additional Human-Bot Dialogue

We show another set of human-bot/human chats in Table 8. In this scenario, the distribution of values are more uniform compared to Table 6. Nevertheless, we see that StanoNet and DynoNet still learned to start from relatively high-frequency entities. They also appear more cooperative and mentions relevant entities in the dialogue context compared to Rule.

Appendix J Histograms of Ratings from Human Evaluations

The histograms of ratings from partner and third-party evaluations is shown in Figure 5 and Figure 6 respectively. As these figures show, there are some obvious discrepancies between the ratings made by agents who chatted with the bot and those made by an ‘objective’ third party. These ratings provide some interesting insights into how dialogue participants in this task setting perceive their partners, and what constitutes a ‘human-like’ or a ‘fluent’ partner.

Appendix K Example Comments from Partner and Third-party Evaluations

In Table 9, we show several pairs of ratings and comments on human-likeness for the same dialogue from both the partner evaluation and the third-party evaluation. As a conversation participant, the dialogue partner often judges from the cooperation and strategy perspective, whereas the third-party evaluator relies more on linguistic features (e.g., length, spelling, formality).