Towards a Human-like Open-Domain Chatbot

Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le

cs.CL cs.LG cs.NE stat.ML

Introduction

The ability to converse freely in natural language is one of the hallmarks of human intelligence, and is likely a requirement for true artificial intelligence. In order to explore this aspect of intelligence, many researchers are working on open-domain chatbots. Unlike closed-domain chatbots, which respond to keywords or intents to accomplish specific tasks, open-domain chatbots can engage in conversation on any topic. Some open-domain chatbots such as MILABOT Serban et al. (2017), XiaoIce Zhou et al. (2018)https://www.msxiaobing.com/, Gunrock Chen et al. (2018), Mitsuku Worswick (2018)https://www.pandorabots.com/mitsuku/ and Cleverbothttps://www.cleverbot.com/ (by Rollo Carpenter) display human-like attributes, but rely on complex frameworks, such as dialog managers with knowledge-based, retrieval-based, or rule-based systems. End-to-end neural network approaches Shang et al. (2015); Vinyals and Le (2015); Sordoni et al. (2015); Serban et al. (2016); Zhang et al. (2019), on the other hand, offer the simplicity of a single learned model. Despite much research, open-domain chatbots still have weaknesses that prevent them from being generally useful: they often respond to open-ended input in ways that do not make sense, or with replies that are vague and generic.

Here we present Meena, a generative chatbot model that was trained end-to-end on 40B words mined and filtered from public domain social media conversations. With Meena, we push the limits of the end-to-end approach and show that a large-scale low-perplexity model can be a good conversationalist. We use a seq2seq model (Sutskever et al., 2014; Bahdanau et al., 2015) with the Evolved Transformer (So et al., 2019) as the main architecture. The model is trained on multi-turn conversations where the input sequence is all turns of the context (up to 7) and the output sequence is the response. Our best model has 2.6B parameters and achieves a test perplexity of 10.2 based on a vocabulary of 8K BPE subwords Sennrich et al. (2016).

To measure the quality of Meena and other chatbots, we propose a simple human evaluation metric. Sensibleness and Specificity Average (SSA) combines two fundamental aspects of a human-like chatbot: making sense and being specific. We ask human judges to label every model response on these two criteria. The first part of the metric, sensibleness, is a basic requirement. To converse properly with a human, a bot’s responses have to make sense in context; humans typically take this for granted when conversing with one another, and our evaluations find that 97% of human-produced statements meet this criterion (see Section 4.2). However, making sense is not enough. If a model is designed with sensibleness as its only objective, its responses could be vague and boring, since that is a safe strategy to avoid being penalised for not making sense. For example, closed-domain chatbots typically respond with a generic apology when a human asks something outside their domain; some end-to-end learned chatbots respond “I don’t know” to many inputs (Li et al., 2016a); and Turing Test contest entrants often try to avoid detection by being strategically vague (Venkatesh et al., 2018). They succeed in not generating gibberish or contradicting themselves, but at the cost of not really saying anything of substance. To mitigate this, we add a second dimension to the SSA metric, which asks our evaluators whether a response is specific given the context. This prevents bots from hiding behind vague replies, allowing us to more openly examine what they are capable of. As discussed in Section 2.1, this successfully distinguishes between generic and lively responses, while also being simple and easy for crowd workers to understand.

We compare Meena, humans, and other open-domain chatbots using the SSA metric with two types of human evaluation: static and interactive. For static evaluation, we curated a dataset with 1,477 multi-turn conversations. For interactive evaluation, humans could chat about anything they wanted. We were surprised, but pleased, to discover that the SSA metric shows strong correlation with Meena’s perplexity, both in static and interactive evaluation. In other words, the better that Meena fit its training data, the more sensible and specific its chat responses became. At first glance, this result may seem intuitive, but it surprised us because recent research found a poor correlation between human evaluation scores and automatic metrics such as BLEU (Liu et al., 2016; Lowe et al., 2017).

Our best end-to-end learned model has an average of 72% SSA. The full version of Meena scores 79% by incorporating a filtering mechanism and tuned decoding (Section 5). This is still below the 86% SSA achieved by an average human, but is far closer than the other chatbots we tested. We note that humans have very high sensibleness, but significantly lower specificity, as detailed in Section 4.2.

We will also discuss weaknesses of our methodology. For example, our static evaluation dataset is too restricted to capture all aspects of human conversations. Nevertheless, the fact that Meena achieves such a high SSA score and that there is a correlation between SSA and perplexity means that a human-like chatbot, in terms of sensibleness and specificity, could be in sight if we can attain better perplexity.

Our contributions are: (1) proposing a simple human evaluation metric for multi-turn open-domain chatbots that captures basic, but important, attributes of human conversation; (2) showing evidence that perplexity is an automatic metric that correlates with human judgment, in contrast to recent findings on other automatic metrics mentioned above; (3) demonstrating that an end-to-end neural model with sufficiently low perplexity can surpass the sensibleness and specificity of existing chatbots that rely on complex, handcrafted frameworks developed over many years.

Evaluating chatbots

Evaluating chatbots and natural language generation is a well-known challenge (Liu et al., 2016; Lowe et al., 2017; Novikova et al., 2017; Hashimoto et al., 2019), which we aim to address in this paper. First, we propose a human evaluation metric that captures key elements of human-likeness of conversational responses (Section 2.1). We then describe two human-evaluation setups: static, in which we benchmark models on a fixed set of multi-turn contexts to generate responses (Section 2.2); and interactive, where we allow humans to chat freely with chatbots (Section 2.4). Lastly, we detail our automatic evaluation metric for fast development and end-to-end optimization (Section 2.7).

To measure the quality of a response given a context, we propose a sequence of two questions. We first ask whether the response, given the context, makes sense. Sensibleness arguably covers some of the most basic aspects of conversational human-likeness, such as common sense and logical coherence. Sensibleness also captures other important aspects of a chatbot, such as consistency. The crowd worker is asked to use common sense to judge if a response is completely reasonable in context. If anything seems off — confusing, illogical, out of context, or factually wrong — then it should be labeled as, “does not make sense”.

However, being sensible is not enough. A generic response (e.g., I don’t know) can be sensible, but it is also boring and unspecific. Such responses are frequently generated by bots that are evaluated according to metrics like sensibleness alone (Li et al., 2016a; Venkatesh et al., 2018). To illustrate this, we create GenericBot: a trivial bot that always replies to questions with “I don’t know” and to statements with “ok” (examples in Appendix Table B). On static evaluation (using a fixed set of prompts and bot-generated responses), 70% of GenericBot’s responses are labeled sensible, surpassing even DialoGPT (62%), even though DialoGPT is clearly more human-like than GenericBot. To overcome this issue, we need our evaluation to separate more fully human-like conversation from bland and generic statements. Therefore, if a response is labeled as sensible, we further ask the crowd worker to determine if it is specific to the given context. For example, if A says, “I love tennis,” and B responds, “That’s nice,” then the utterance should be marked, “not specific”. That reply could be used in dozens of different contexts. However, if B responds, “Me too, I can’t get enough of Roger Federer!” then it is marked as “specific”, since it relates closely to what is being discussed. Responses labeled not sensible are considered not specific. In GenericBot’s case, none of the responses are specific, whereas 39% of DialoGPT’s responses are specific.

This sequence of two questions is designed to start with the most concrete and basic human quality (sensibleness) and then progress to the arguably more subjective human quality (specificity). The degree of subjectivity is somewhat quantified in the crowd worker agreement. We measure crowd worker consistency for every model benchmark using agreement and Krippendorff’s alpha Krippendorff (2011), shown in Table 1. The agreement is reasonable considering the questions are subjective and the final results are always aggregated labels (e.g., average sensibleness across all chatbot responses).

Given a set of responses labeled as described above, we can calculate sensibleness and specificity as the percentage of responses labeled as sensible and specific, respectively. To combine these two into one metric, we take a simple average of the two, which we call SSA (sensibleness and specificity average). SSA is a proxy for human likeness, which also penalizes chatbots that consistently produce generic responses. For example, GenericBot’s SSA is 35% and DialoGPT’s SSA is 51%, providing a much more fair separation and ranking than sensibleness alone.

Before arriving at SSA, and before any of the chatbots were tested, the authors of this paper conducted several rounds of pilot studies on what to ask crowd workers and how to best phrase the instructions. We settled on the two-question SSA for several reasons: it was easy for crowd workers to understand; alternative additional questions did not add extra information; and more subjective questions result in lower agreement between crowd workers.

As an additional check on the SSA metric, we reran a static evaluation, this time asking crowd workers to assess whether or not a response is “humanlike”. We find that there is a high correlation between those labels and the two components of the SSA metric (Figures 2, 9, 10). Compared to a direct evaluation of what crowd workers consider to be “humanlike”, SSA has significant advantages for large-scale evaluation tasks: it is more objective, easier for crowd workers to understand, and penalizes boring and vague responses. Nevertheless, these findings give us confidence that SSA is indeed capturing important aspects of human likeness.

2 Static Evaluation

In order to have a common benchmark to easily compare models, we create a collection of 1,477 conversational contexts with between 1 and 3 conversation turns, that we call the Mini-Turing Benchmark (MTB). We started this dataset by compiling single-turn contexts (e.g., “How are you?”) from multiple sources, such as from the workhttp://ai.stanford.edu/~quocle/QAresults.pdf of Vinyals and Le (2015) and the transcripts of the Loebner Prizehttps://aisb.org.uk/events/loebner-prize contests (years 2014-2018). In total, there were 315 single-turn contexts, which we then extended to include 500 two-turn and 662 three-turn contexts.

The MTB also contains contexts with personality questions (e.g. “Do you like cats?”), some of which expect responses with personality consistency. For example, the context “A: Do you like movies?; B: Yeah. I like sci-fi mostly; A: Really? Which is your favorite?” expects a consistent response such as I love Back to the Future. On the other hand, a response like I don’t like movies would be a contradiction, and thus not considered sensible.

When evaluating chatbots, all MTB contexts are fed to the models or presented to humans to obtain responses. We send the resulting $(context,response)$ pairs to crowd workers and asked whether each response given the context is sensible and specific as defined in 2.1. We call this static evaluation because the contexts are fixed.

3 Interactive Evaluation

Static evaluation may be suitable for comparing models, but it is biased by how the static evaluation dataset was constructed. To address this, we create an additional evaluation mode where the crowd workers can chat 1:1 with a chatbot about anything they want. As with static evaluation, workers are also asked to decide whether each response from the chatbot is sensible and specific as defined in 2.1. Conversations start with “Hi!” from the chatbot to mark the beginning of the conversation and crowd workers have no expectation or instructions about domain or topic of the conversation. A conversation is required to last at least 14 turns (7 from chatbot) and at most 28 turns. We collected 100 such conversations for each model (i.e., at least 700 labeled turns per model). We then measure the percentage of labeled turns that are sensible and specific.

Unlike a typical Turing test (Turing, 1950), we tell the human judges upfront that they are about to chat with an experimental chatbot and ask them to label what the chatbot says in terms of sensibleness and specificity. This shifts the focus of the judges and chatbot creators from optimizing for deception detection to optimizing for detecting and maximizing human-like qualities (e.g., sensibleness). Similar to our approach, Ghandeharioun et al. (2019) also conduct interactive evaluation by allowing humans to chat freely with bots. Their setup, however, focuses on evaluating conversations as a whole (as opposed to at the level of individual turns) and judges evaluate for quality, fluency, diversity, relatedness, and empathy.

4 Estimate of Human Performance

To estimate static SSA of humans we ask crowd workers to respond to MTB contexts. Additionally, to estimate human interactive SSA, we leveraged the help of internal company volunteers to collect 100 human-human conversations following mostly the same instructions as crowd workers for every other chatbot. Labeling of sensibleness and specificity was conducted by independent crowd workers with majority voting of 5 workers per human turn. The difference from the rest of the evaluations is that, in this case, participants knew they were chatting with another human. In contrast, when humans chat with a chatbot they will occasionally say unusual things to test the chatbot’s limits. Hill et al. (2015) describe differences in human behavior when talking to a chatbot. That said, we never incentivize humans to chat adversarially with chatbots in any of our evaluations.

5 Evaluation of Cleverbot and DialoGPT

To integrate with Cleverbot, we leverage its API. For DialoGPT, we use its open sourced 762M parameter model.https://github.com/microsoft/DialoGPT It is worth mentioning that we initially tried the 345M parameter DialoGPT model, because it was reported to perform best on single-turn human evaluation. However, the 345M parameter model seemed to perform noticeably worse than the 762M one in preliminary evaluations of multi-turn conversations. Our human evaluation is multi-turn, so we select the 762M model.

The DialoGPT authors were unable to release their decoding script at the time of writing. Therefore, following their published description, we use top-K decoding with $K=10$ . We adapt the decoding implementation by Wolf et al. (2019). Moreover, since the backward model was also not released we were not able to try their MMI re-ranking (Li et al., 2016a).

Both Cleverbot and DialoGPT were evaluated using the same crowd sourcing setup as for Meena.

6 Evaluation of Mitsuku and XiaoIce

Because we chose to use the free Mitsuku web appPandorabots offers a paid enterprise package, which includes the Mitsuku API., and there is no public API for XiaoIce, we called on the help of internal company volunteers and only conducted interactive evaluation. Volunteers collectively had 100 conversations with Mitsuku, and 119 with XiaoIce on their publicly available web apps. The volunteers conversed with the chatbots following mostly the same instructions that crowd workers follow for every other chatbot. The difference is that humans would say “Hi!” for the first turn, instead of the chatbot, in order to keep the first turn the same as other cases. Labeling of sensibleness and specificity in all cases was conducted by independent crowd workers with majority voting of 5 workers per chatbot turn. Also note that both XiaoIce and Mitsuku sometimes include an image in their reply and occasionally, volunteers include text descriptions of the images they see. The presence of the image may in some cases change the sensibleness of the response for better or worse.

XiaoIce interacts in Mandarin so both the volunteers and the independent crowd workers were native Mandarin speakers. The group of volunteers for XiaoIce, Mitsuku, and human-human conversations were mostly disjoint. Other than requiring a knowledge of Mandarin for XiaoIce conversations, volunteer selection was arbitrary. We had 29 volunteers for XiaoIce, 43 for Mitsuku, and 21 for human-human.

To reset Mitsuku state between conversations, volunteers refreshed the web page. During the writing of this paper there was no clear way to reset the state of XiaoIce. The XiaoIce team have informed us that not resetting the state negatively affects the model’s control of the context.From personal communication with the XiaoIce team, after the writing of the paper. Also, most XiaoIce volunteers shared the same Weibo account.Weibo is a microblogging service mostly used in China, which also allows users to chat with XiaoIce: https://www.weibo.com/ The XiaoIce team confirmed that account reuse negatively impacts the internal profile constructed by XiaoIce for a user. The XiaoIce team further suggested that, if the same Weibo account needs to be reused, we should wait at least one hour between volunteers using the account. In our experiments, we may have sometimes waited less than that amount of time between volunteers, although we made sure the account was only used by one volunteer at a time. Finally, the XiaoIce team mentioned that in the past few months (as of this writing), a limited version of XiaoIce with the smallest index has been served on Weibo. This version is expected to produce less satisfactory responses.

Direct comparisons between XiaoIce and other chatbots come with a caveat: XiaoIce can be seen as a product that optimizes for long-term user engagement, of which dialog generation is just one component. In other words, Meena is arguably at an advantage when comparing SSA scores.

7 Automatic Evaluation

For quick research iterations, we focus on perplexity. Unlike the previous two evaluation types, perplexity is an automatic metric. A seq2seq model outputs a probability distribution over possible next response tokens. Perplexity measures how well the model predicts the test set data; in other words, how accurately it anticipates what people will say next. When interpreting perplexity scores, bear in mind that lower is better and that the theoretical minimum is one.

As shown in Section 4, this commonly used metric correlates with human judgement of sensibleness and specificity. This is encouraging, because it is both automatic and directly optimizable with the standard cross-entropy loss function.

Meena chatbot

As described above, recent work on end-to-end dialog models has fallen into two broad categories: (1) complex models with human-designed components, and (2) large neural network models (known as end-to-end models) that are closer to generic learning frameworks. End-to-end models have shown promise, but clear limitations Gao et al. (2019a). An open question has been: in order to reach a point where a model can carry out high-quality, multi-turn conversations with humans, could we simply take an end-to-end model and make it bigger—by adding more training data and increasing its parameter count—or is it necessary to combine such a model with other components? In this section we describe the Meena model, the largest end-to-end model to enter the field so far. We believe it answers the open research question, by showing that a large end-to-end model can generate almost humanlike chat responses in an open-domain setting.

In this section, we will describe the training data, architecture, and decoding algorithm. We will also provide a few sample conversations that Meena has had with humans.

The dataset used to train Meena is mined and filtered from public domain social media conversations. The source data are essentially message trees involving multiple speakers: the very first message is the root; replies to a message are its child nodes. Any path along the tree induces a conversation where each message is a conversation turn. By treating each turn in a conversation path as a response and all the previous turns (up to 7) as a context, we create a training example of the form (context, response) pair.

We also filter the data to improve the generation quality. A message is removed if any of the following conditions holds: 1. the number of subwords is less than 2 or more than 128; 2. the percentage of alphabetic characters is less than 70%; 3. message contains URL; 4. author’s username contains “bot”; 5. the message is repeated more than 100 times; 6. the message has a high $n$ -gram overlap with the parent’s text; 7. the message is potentially unsafe or offensive with respect to a commercial text classifier. In addition, we remove copies of the parent’s text quoted in a message.

For simplicity, when a message is removed, we drop all sub-trees rooted under it. After these filtering steps, the number of $(context,response)$ pairs extracted is 867M. The text is tokenized using byte-pair-encoding (BPE) Sennrich et al. (2016) with the sentencepiece library.https://github.com/google/sentencepiece We use a vocabulary of 8K BPE subwords, which we found in our early experiments to be sufficient for generating specific responses while still allowing us to fit larger models in memory.

The final Meena dataset contains 341GB of text (40B words). In comparison, GPT-2 Radford et al. (2019) has been trained on 40GB of Internet text (8 million web pages).

2 Model Architecture

The best performing Meena model is an Evolved Transformer (ET) (So et al., 2019) seq2seq model with 2.6B parameters, which includes 1 ET encoder block and 13 ET decoder blocks. The Evolved Transformer is an evolutionary NAS architecture (Real et al., 2017, 2018) based on the Transformer Vaswani et al. (2017). Our largest (i.e., maximum memory usage) Evolved Transformer scored 10.2 perplexity and our largest vanilla Transformer scored perplexity 10.7 for the same number of training steps (738k). The largest vanilla Transformer had 32 decoder layers with other architectural hyperparameters held constant.An Evolved Transformer block is about twice as deep as a Transformer layer

For comparison, the extra-large GPT-2 model Radford et al. (2019) has 1.5B parameters and is a language model (i.e., decoder only); whereas the large conversational model from the recent DialoGPT work Zhang et al. (2019) has 762M parameters.

Meena’s hidden size is 2,560 and the number of attention heads is 32. We share the embeddings across the encoder, the decoder, and the softmax layer. The encoder and decoder each have a maximum length of 128 tokens (i.e., 256 combined). The hyperparameters of our best model were found via manual coordinate-descent search.

3 Training Details

We trained our best model for 30 days on a TPU-v3 Pod (2,048 TPU cores) on the Meena dataset containing 40B words (or 61B BPE tokens). Interestingly, the 2.6B-parameter model can overfit In the sense that validation loss increases as train loss decreases. on a 61B-token dataset which suggests a surprisingly large model capacity. Therefore, we add a small amount of 0.1 attention and feed-forward layer dropout. Additionally, to save memory, we chose the Adafactor optimizer Shazeer and Stern (2018) with 0.01 as the initial learning rate, keeping it constant for the first 10k steps and then decaying with the inverse square root of the number of steps. We use the Tensor2Tensor codebase Vaswani et al. (2018) for training Meena.https://github.com/tensorflow/tensor2tensor

A TPU-v3 core has 16GB of high-bandwidth memory. We maximized memory usage for model parameters and stored only 8 training examples per core. Each training step took about 1 second. In the full TPU-v3 Pod, this meant we learned over 4M tokens per training second. Therefore, by the end of training, the model had traversed the full training set 164 times (or epochs) and observed a total of about 10T tokens (including repeated ones).

4 Decoding

Generating generic (i.e., not specific) and bland responses (Li et al., 2016a) has always been a major challenge in existing neural conversational models. A common approach to mitigating this problem is to use more sophisticated decoding algorithms, for instance with different forms of re-ranking (Li et al., 2016a; Shao et al., 2017) or conditioning on profiles, topics, and styles (Li et al., 2016b; Wang et al., 2017; Xing et al., 2017; Zhang et al., 2018b). Recent works also explore new frameworks such as adversarial learning (Li et al., 2017; Zhang et al., 2018c), variational autoencoding (Zhao et al., 2017; Gu et al., 2019), or both (Gao et al., 2019b) at the cost of added complexity and less scalability.

In contrast, we show that given a model with sufficiently low perplexity, a simple sample-and-rank decoding strategy achieves both diverse and high-quality responses. Sample-and-rank, works as follows: First, we sample $N$ independent candidate responses using plain random sampling with temperature $T$ . Second, we select the candidate response with the highest probability to use as the final output.

Temperature $T>0$ is a hyper-parameter that regulates the probability distribution $p_{i}$ of the next token during decoding. We divide the logits $z_{i}$ by $T$ before computing the “softmax” as in Hinton et al. (2015):

$T=1$ yields the unmodified distribution. We observe that large values of $T$ favor contextually rare tokens, such as relevant entity names, but might also assign too much probability to incorrect tokens depending on the model’s predictions. Meanwhile, smaller values of $T$ favor more common words such as articles or prepositions, which are safer but less specific.

Tables 2 and 3 show responses for the arbitrary probing input “Why do you like the ocean?” under sample-and-rank and beam-search, respectively. As we can see, beam-search decoding generates repetitive and uninteresting responses. On the other hand, sample-and-rank provides us with diverse and content-rich responses. The key here is to have a model with low perplexity so samples can be taken at high temperature to produce human-like content.

For all the results in Section 4, we use sample-and-rank with $N=20$ and $T=0.88$ . Additionally, as shown in Figure 1, for this fixed decoding strategy, sensibleness and specificity improve as model test set perplexity falls. For additional decoding results and samples, see Section 5.

5 Sample conversations

Below are cherry picked conversations that Meena has had with humans. We selected these conversations after they were completed. That is, the Meena responses within the conversations were not cherry picked; they were produced automatically using sample-and-rank (Section 3.4). Conversations B and C are excerpts from conversations number 43 and 48, respectively, of the conversations dataset published on GitHub.https://github.com/google-research/google-research/tree/master/meena/

Appendix A shows random samples of conversations.

Results

In this section, we will first demonstrate the correlation between test perplexity and the human evaluation metric, SSA, defined earlier. We also include human-level upperbound estimates for both static and interactive evaluations, beside performances of other chatbots, such as XiaoIce, Mitsuku, DialoGPT, and Cleverbot. Lastly, we provide sample responses for different models given the same contexts to understand how Meena qualitatively compares to others.

We trained models with different hyper-parameter settings and architectures on the dataset described in Section 3.1. We vary the number of layers, attention heads, total training steps, whether we use Evolved Transformer or regular Transformer and whether we train with hard labels or soft labels/distillation Hinton et al. (2015). The trained models are then measured with an automatic metric, test perplexity (Section 2.7), and also with human metrics (Sections 2.2 and 2.3). Our results indicate most of the variance in the human metrics can be explained by the test perplexity. The end-to-end trained Meena model with lowest perplexity is referred to as Meena (base). In addition, we also include an improved version of Meena (detailed in Section 5) and refer to this as the Meena (full) model, or just Meena model for short.

The correlation was $R^{2}=0.93$ for static sensibleness vs perplexity and $R^{2}=0.94$ for static specificity vs perplexity indicating this might be a good automatic metric for measuring sensibleness and specificity. Static SSA vs perplexity has $R^{2}=0.94$ . The static evaluation results are shown in Figure 5. The correlation is close to linear, but it is unclear whether the trend will continue for even lower values of perplexity.

In interactive evaluation (Section 2.3) crowd workers could chat about anything they wanted. We observe similarly strong correlation with perplexity (see Figures 4, 4 and 1) and very similar sensibleness and specificity values as the static evaluation. This indicates that the static evaluation correlation with perplexity is not due to dataset bias.

Regarding consistency, the lowest perplexity model was evaluated 7 times with static evaluations and also 7 times with interactive evaluations. Each time, we obtained a different set of randomly sampled responses. Across the evaluations the standard deviation is $2\%$ for static SSA and is $1\%$ for interactive SSA, indicating that both metrics are consistent enough for our purposes.

2 Human-level Estimates

As expected, human sensibleness is very high, but it is not perfect. Human sensibleness was estimated at 94% static and 97% interactive. People have misunderstandings, miss attempts at humor and sometimes lack shared context or background. Also aligned with intuition, humans are sometimes not specific due to momentary lack of ideas, interest or knowledge. The human specificity scores are 69% static and 75% interactive. The resulting SSAs are 82% static and 86% interactive.

2 Addressing Cross-turn Repetitions

In interactive evaluation, about one third of the conversations with Meena (base) contain cross-turn repetitions toward the end. Cross-turn repetition means that one turn somewhat repeats an earlier turn. For illustration, we cherry picked particularly problematic examples of cross-turn repetition shown in Tables 5 and 6.

It is worth mentioning that there also exist in-turn contradictions and repetitions, where the contradiction or repetition is contained in the response turn itself (e.g., “I like pizza, but I don’t like it”). This type of artifact is often observed in Meena versions with worse perplexities, but is far less frequent in the Meena (base), which has the lowest perplexity as reflected in the samples shared in the appendix and the higher sensibleness scores.

We wrote a rule that detects if any two turns contain long common sub-sequences. We automatically remove candidates that are detected as repetition. This rule seems to have addressed most of the cross-turn repetition. We, therefore, further improve on the above interactive SSA of $74\%\pm\%1$ to $79\%\pm 1\%$ .

3 Safety Layer

It is important to mention that the evaluation and conversation collection for the full Meena version was conducted with an additional classifier layer at serving time as part of the filtering mechanism to automatically filter out potentially sensitive or toxic response candidates for publication.

Finding a good automatic metric that correlates with human evaluation has been an important goal of open-domain conversational modeling. BLEU Papineni et al. (2002), ROUGE Lin (2004), or other related metrics in translation and summarization, while popular and easy to compute, have been shown to be unsuitable for dialog (Liu et al., 2016) or more broadly language generation systems Novikova et al. (2017).

Past works have attempted to build learnable metrics, either in a supervised fashion Lowe et al. (2017), which requires human labels, or with unsupervised approaches Tao et al. (2017); Ghazarian et al. (2019), that are more complex and need separate training, e.g., of a ranking system. In our work, we show that perplexity, which is readily available to any neural seq2seq model, exhibits a strong correlation with human evaluation. Our work is therefore also related to past attempts to correlate perplexity with other automatic metrics in other tasks, e.g., perplexity vs. BLEU in translation Luong et al. (2015).

Another interesting line of work is to combine human evaluation with either automatic metrics Chaganty et al. (2018) or with model likelihood Hashimoto et al. (2019). While theoretically motivated, these metrics are too complex to be practical, requiring both human judgments and training separate models, e.g., an estimator Chaganty et al. (2018) to reduce bias in automatic evaluation or a discriminator Hashimoto et al. (2019) to distinguish between human- and model-generated samples.

In terms of designing of human evaluation metrics, existing literature differs in what attributes are used to assess the quality of a neural conversational model. Many works, e.g., Zhao et al. (2017); Xu et al. (2018); Ippolito et al. (2019b), have focused solely on the diversity aspect to counter the commonly observed problem of models generating generic responses Li et al. (2016a). Others have attempted to improve and evaluate multiple aspects at once. For example, Venkatesh et al. (2018) aim to unify many metrics, such as diversity, engagement, and user experience; Gao et al. (2019b) jointly optimize for both diversity and relevance; See et al. (2019) control decoding attributes (such as repetition, specificity, response-relatedness, and question-asking) to improve engagingness and interestingness; and Hashimoto et al. (2019) design metrics to capture human likeness and diversity.

In contrast, we focus on sensibleness and specificity for our human evaluation. While human likeness and relevance used in aforementioned works are related to sensibleness, we specifically use sensibleness as it leads to better agreement among crowd workers (see $\S$ 2.1). Similar reasoning applies to specificity, which is related to other attributes such as engagingness and interestingness, as measured in previous works.It is worth pointing out that we do not explicitly measure diversity as it requires judging a set of responses; whereas, for conversation, what is most important is the first reply that a chatbot produces. As our decoding method is sampling, it implies that our generation is diverse. However, there remains a question of whether the sampled response is of high quality. The fact that our model has low perplexity and achieves high SSA score indicates that the generation is meaningful. A limitation of our work is that it does not cover aspects such as empathy Zhou et al. (2018); Rashkin et al. (2018).

While we do not explicitly control for specificity, existing works, such as Zhang et al. (2018a); Ko et al. (2019), attempted to do so by augmenting the decoder of seq2seq models with specificity-control components. These added complexities sometimes lead to implausible responses as analyzed by Ko et al. (2019). In contrast, the specificity of our model improves as perplexity decreases.

Recent work on DialoGPT Zhang et al. (2019) compares the conversation quality of chatbots with that of humans but their evaluation settings are limited to single-turn dialogs. We instead conduct our evaluation on conversations of up to 3 turns in the static MTB benchmark and 14 turns in the interactive setup.

Our results suggest perplexity on public domain social media conversations might be a good automatic proxy for human judgement of fundamental attributes of human-likeness, such as sensibleness and specificity. The results also suggests that optimizing the probability of the next token on larger volumes of social media conversations could lead to human-like sensibleness in an open-domain setting. However, our static evaluation dataset only contains one to three-turn contexts and is biased by the sources of the first turn and the fact that the two-turn and three-turn contexts build on the shorter contexts. Moreover the contexts in this dataset are predominantly Turing test and social conversation style, including common sense, basic knowledge, asking/sharing about personality, likes/dislikes, opinions, feelings, hobbies, pleasantries, etc. This dataset does not include contexts like deeper question answering (e.g., how fast is a cheetah), basic math (e.g., how much is 1+1) and common sense tests designed to challenge machines, but not humans Levesque et al. (2011). Human-likeness is an incredibly broad and abstract concept. The interactive evaluation addresses some of the bias and scope limitations in static evaluation while still providing a consistent score to quantify a given chatbot. Nevertheless, unlike static evaluation it does not allow for granular comparison between different chatbot responses. In addition, it may be too short (14 to 28 turns), and may assign too much weight to typical beginning and ending of conversations. It may also be too short to cover deeper topics and exercise longer term memory.

Furthermore, it may be necessary to expand the set of basic human-like conversation attributes being measured beyond sensibleness and specificity. Some directions could include humor, empathy, deep reasoning, question answering and knowledge discussion skills. One could also break down sensibleness into its implicit sub-components: logical and personality consistency, common sense, relevance, basic factual correctness and so on. Future work may also explore the continued optimization of sensibleness via the optimization of test set perplexity.

Thanks to the people who gave feedback on drafts of the paper: Anna Goldie, Abigail See, Yizhe Zhang, Lauren Kunze, Steve Worswick, Jianfeng Gao, Daphne Ippolito, Scott Roy, Ilya Sutskever, Tatsu Hashimoto, Dan Jurafsky, Dilek Hakkani-tur, Noam Shazeer, Gabriel Bender, Prajit Ramachandran, Rami Al-Rfou, Michael Fink, Mingxing Tan, Maarten Bosma and Adams Yu. Also thanks to the many volunteers who helped collect conversations with each other and with various chatbots. Finally thanks to Samy Bengio, Noam Shazeer, Anna Goldie, Rami Al-Rfou, Khoa Vo, Trieu H. Trinh, Ni Yan, Kyu Jin Hwang and the Google Brain team for the help with the project.

Appendix A Additional Sample Conversations

With the help of many internal company volunteers we collected a total of about 100 conversations with Mitsuku, XiaoIce and Meena (full). The conversations are available on Github https://github.com/google-research/google-research/tree/master/meena/. This section contains samples obtained by random shuffling these sets, and taking the first 10. Conversations were collected following the standard instructions for interactive evaluation where the human starts. Therefore, conversations are supposed to start with “Hi!”, contain between 16 and 32 turns in total, and are open-domain with no particular topic. Nevertheless, some participants did not follow the first-turn rule strictly, so some conversations may start with for instance “hi there” instead of “Hi!”. Also, a few conversations are under or over the length limits.

Unlike in Section 3.5, which contains cherry picked samples, we present random samples of everything that was collected after a few potentially sensitive conversations have been removed from the original sets. We also redacted potential personally identifiable information and indicated that with the word “REDACTED”. Finally, please note that both XiaoIce and Mitsuku sometimes include an image in their reply and occasionally, volunteers include text descriptions of the images they see.

The following are a sample of the conversations with the Meena (full) ( $79\%\pm 1\%$ interactive SSA).

A.2 Mitsuku

The following are a sample of the conversations with Mitsuku.

A.3 XiaoIce

The following are a sample of the conversations with XiaoIce.

A.4 Human

The following are a sample of the conversations between humans only.