Don't Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training

Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, Jason Weston

Introduction

Open-ended tasks such as dialogue reveal a number of issues with current neural text generation methods. In more strongly grounded tasks such as machine translation and image captioning, current encoder-decoder architectures provide strong performance, where mostly word-level decisions are often taken correctly by the model. However, critical failings are exposed in less constrained generation: reliance on repetitive copying and overuse of frequent words, and an inability to maintain logical coherence. The former shows the learning objective is faulty in that it cannot match simple statistics of the training data, while the latter touches more to the heart of artificial intelligence: these models do not understand what they are saying. For example, Figure 1 shows how the 345M-parameter GPT2 model (Radford et al., 2019) can give high probability to contradictory generations.

In this work, we show how the recently introduced unlikelihood objective (Welleck et al., 2019a) can be generalized to remedy these problems. Unlikelihood is a technique developed for removal of repetition in language model completions, and works by adding an extra term to the objective that forces repetitions to have low probability, alleviating the degenerative problems highlighted in Holtzman et al. (2019). In fact, unlikelihood can be seen as a much more general framework, as we will see.

We first generalize unlikelihood to a different domain: dialogue, where we measure statistics of the training distribution in terms of contextual copies, within-utterance repeats, and vocabulary usage. We then develop loss functions that control these statistics, providing improved metrics on several tasks. Secondly, we show how the same tools can be used to address deeper semantic issues in such models. By leveraging existing natural language inference (NLI) data (Welleck et al., 2019b) as supervision against poor quality generations, we train models that assign low probability to generating incoherent and contradictory text. Overall, our approach yields more consistent dialogue models across several axes, and provides a promising framework for further advances.

Code and pre-trained models will be made available.†https://parl.ai/projects/dialogue_unlikelihood/

Dialogue Unlikelihood Training

Dialogue generation consists in predicting an utterance $\mathbf{y}=(y_{1},\ldots,y_{|y|})$ given a context $\mathbf{x}=\{s_{1},\ldots,s_{k},u_{1},\ldots,u_{t}\}$ that consists of initial context sentences $s_{1:k}$ (e.g., scenario, knowledge, personas, etc.) followed by dialogue history utterances $u_{1:t}$ from speakers who take consecutive turns.

Likelihood Training

Given a dataset $\mathcal{D}=\{(\mathbf{x}^{(i)},\mathbf{y}^{(i)})\}$ derived from a collection of human-human interactions, the standard approach to generative training for dialogue tasks is maximum likelihood estimation (MLE), that minimizes:

where $\mathbf{x}^{(i)}$ is a gold context (dialogue history and initial context sentences) and $\mathbf{y}^{(i)}$ is a gold next-utterance, and $y^{(i)}_{t}$ is the $t$ -th token of $\mathbf{y}^{(i)}$ .

Likelihood-based (greedy or beam) decoding applied after training a model with this objective yields sequences with statistics that do not match the original human training sequence distribution.

Unlikelihood Training

To control for such distribution mismatches, we employ the unlikelihood loss (Welleck et al., 2019a), generalizing it to our setting, and developing a particular form of the loss function for each type of mismatch.

The general form of the unlikelihood loss penalizes a set of tokens $\mathcal{C}_{t}$ at each time-step, $\mathcal{L}^{(i)}_{\text{UL}}(p_{\theta},\mathcal{C}_{1:T},\mathbf{x},\mathbf{y})=$

where $\mathcal{C}_{t}\subseteq\mathcal{V}$ is a subset of the vocabulary, and $\beta(y_{c})$ is a candidate-dependent scale that controls how much the candidate token should be penalized. The overall objective in unlikelihood training then consists of mixing the likelihood and unlikelihood losses,

Likelihood tries to model the overall sequence probability distribution, while unlikelihood corrects for known biases. It does this via the set of negative candidates $\mathcal{C}_{t}$ calculated at each step $t$ , where we are free to select candidate generation functions depending on the biases to be mitigated. Likelihood pushes up the probability of a gold token $y^{(i)}_{t}$ while unlikelihood pushes down the probability of negative candidate tokens $y_{c}\in\mathcal{C}_{t}$ .

In Welleck et al. (2019a) the context $\mathbf{x}$ consists of a ground-truth sequence ( $\mathbf{x}=\mathbf{x}^{(i)}$ ), the target $\mathbf{y}$ is either a ground-truth sequence ( $\mathbf{y}=\mathbf{y}^{(i)}$ ) or a model-generated sequence ( $\mathbf{y}=\hat{\mathbf{y}}$ ), and the per-token scale parameter $\beta(y_{c})$ is $1$ .

In this paper, we demonstrate how unlikelihood can be used as a general framework by applying it to the dialogue domain. We show how varying the contexts $\mathbf{x}$ , targets $\mathbf{y}$ , candidates $\mathcal{C}$ and scaling $\beta$ can be used to improve the coherence and language modeling quality of dialogue models. To do this, we now consider the different biases we wish to mitigate, and construct a specific unlikelihood loss for each in turn.

1 Repetition and Copying

Generative dialogue models are known to both (i) rely too much on copying existing context knowledge or dialogue history; and (ii) repeat themselves within individual utterances. To address this with unlikelihood, we define two types of negative candidate tokens which either appear in a repeating n-gram from the context or from the generated label itself,

where $y_{t}$ is a token in a repeating context n-gram when $y_{t}$ is part of an n-gram that already appeared in the context tokens $x$ , and is in a repeating label n-gram when $y_{t}$ is part of an n-gram that already appeared in $y_{<t}$ . Given a ground-truth context $\mathbf{x}^{(i)}$ , we apply these two forms of unlikelihood to a model-generated sequence $\hat{\mathbf{y}}^{(i)}$ . In summary, we either apply the per-example loss

for controlling label repeats. We also consider mixing the two losses to mitigate both issues.

2 Vocabulary Usage

Neural sequence models trained with maximum likelihood generate sequences with token distributions that differ from those of human text Dinan et al. (2020); Holtzman et al. (2019). In particular, these models tend to produce high frequency tokens too often and low frequency tokens too rarely, where frequency is defined by the human token distribution.

We address this with unlikelihood by penalizing tokens according to the mismatch between the model and ground-truth unigram distributions. Specifically, we first maintain an empirical estimate of the model’s unigram distribution $p_{\text{model}}(y_{t})$ and the human distribution $p_{*}(y_{t})$ :

where $Y$ is a collection of token predictions on a subset of training data $\mathcal{D}^{\prime}$ (e.g. the preceding $k~{}=~{}256$ batches), and $\text{count}(y_{t})$ is the number of occurrences of $y_{t}$ in $Y$ . This is computed using model sequences $(\mathbf{y}=\hat{\mathbf{y}})$ , defining $Y$ as the collection of all tokens in all $\hat{\mathbf{y}}$ .

We wish to push down the probability of tokens appearing too often, i.e. when $p_{\text{model}}(y_{t})>p_{*}(y_{t})$ . For the unlikelihood loss, each step’s candidate is thus the current token, $\mathcal{C}_{t}^{\text{identity}}=\{y_{t}\}$ , and each token’s unlikelihood loss is scaled according to the mismatch between the approximated model and human distributions,

The unlikelihood loss for a token $y_{c}$ is non-zero when the token occurs more often in the model’s estimated unigram distribution. In summary, the resulting per-example loss is

where $\mathbf{y}$ is a model-generated sequence.

3 Contradictions

Neural generation models appear fluent, especially when pre-trained on large datasets, but are still poor at understanding the language they produce. That is, they can produce logically or factually inaccurate, or contradicting statements (Welleck et al., 2019b; Zhang et al., 2018; Hayashi et al., 2019; Petroni et al., 2019). Here, we show how the unlikelihood objective can be used to train such models to assign low probability to inconsistent and contradictory utterances.

To do so, we assume the existence of training data of both positive and negative examples of coherent behavior. There is a raft of recent large-scale, high quality data that can be massaged into this form, from natural language inference (NLI) tasks (Bowman et al., 2015; Williams et al., 2018; Welleck et al., 2019b) to commonsense reasoning tasks (Zellers et al., 2019; Qin et al., 2019). Two collections of data can be derived from the labels of such a supervised task:

where ${\cal D}^{+}$ is coherent behavior, e.g. neutral or entailing data in NLI, and ${\cal D}^{-}$ is incoherent behavior, e.g. contradictions. In general, many forms of this type of data can be collected, not just NLI, and it is also not necessary for the contexts $\mathbf{x}^{(i)}$ to overlap as we have written here.

Standard likelihood training can then be performed on coherent data ${\cal D}^{+}$ , while the unlikelihood objective is applied to ${\cal D}^{-}$ as we wish to push down the probability of generating the incoherent response $\mathbf{y}^{-}$ given a context $\mathbf{x}$ . That is, given an incoherent pair $(\mathbf{x},\mathbf{y}^{-})$ we use the loss

where we penalize each token in the target ( $\mathcal{C}_{t}^{\text{identity}}=\{y^{-}_{t}\}$ ). Hence, the loss makes generating the contradicting sentences less likely.

Related Work

Our work provides new applications of unlikelihood training (Welleck et al., 2019a), showing that unlikelihood offers a general framework for improving generative models, and in particular dialogue models. Outside of that work, the use of negative training in dialogue retrieval, rather than generation, has been previously extensively studied, see e.g. (Humeau et al., 2019; Nugmanova et al., 2019). In the area of generative dialogue, a number of works have focused on improving the standard likelihood training approach. Closer to our work is that of He and Glass (2019) which developed the approach of negative training to prevent generic and malicious responses in dialogue models. In terms of improving repetition and specificity, a recent alternative approach is that of control (Fan et al., 2018; Ficler and Goldberg, 2017; Ghazvininejad et al., 2017; See et al., 2019). Nucleus sampling Holtzman et al. (2019) can help to remove generic or repetitive utterances at the expense of accuracy, but was shown to be inferior to beam blocking, which in turn was shown to be inferior to unlikelihood in Welleck et al. (2019a).

In terms of dialogue coherence, Welleck et al. (2019b) showed that retrieval, but not generative models, could be improved with NLI as a re-scorer, while Yang et al. (2018) multi-tasked with NLI. The work of Gabriel et al. (2019) has also studied improving narrative flow with a discriminative rescorer, but in that case for generated language. In our work, the improvements are tightly integrated into the training of the model itself.

Experiments

In all of our experiments we employ a large pre-trained seq2seq Transformer Vaswani et al. (2017) as our base model, which we then fine-tune for particular tasks with the objectives outlined in Section 2 and specified in each experiment below. Following previous work (Humeau et al., 2019), we pre-train our model on dialogue data, using a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io, training to generate a comment conditioned on the full thread leading up to the comment, spanning $\sim 2200M$ training examples. Our Transformer model consists of an 8 layer encoder, 8 layer decoder with 512-dimensional embeddings and 16 attention heads, and is based on the ParlAI implementation of Miller et al. (2017). The model was trained with a batch size of 3072 sequences for approximately 3M updates using a learning rate of 5e-4, and an inverse square root scheduler. This pre-training took approximately two weeks using 64 NVIDIA V100s.

We use the ConvAI2 persona-based dialogue (Zhang et al., 2018), Wizard of Wikipedia knowledge-grounded dialogue Dinan et al. (2019) and ELI5 long-form question answering (Fan et al., 2019) datasets to evaluate the effect of using unlikelihood to reduce copying and repetition in model generated utterances. On each dataset, we fine-tune the pre-trained pushshift.io Reddit model, then evaluate by generating next-utterances for dialogue contexts from the test set (or validation in ConvAI2, as the test set is hidden). We use greedy decoding in our main experiments for simplicity and scalability, but we also obtained similar results with beam search, shown in Appendix A.

To measure label repetition in a sequence y, we use the portion of duplicate n-grams:

and report the metric averaged over the examples. Label repetition increases from zero as the model generates more repeated n-grams. To measure context repetition, we measure the fraction of generated n-grams that appear in the original context:

and report the metric averaged over the examples. Context repetition increases when the model ‘copies’ n-grams from the context. To quantify language modeling quality, we use standard perplexity and F1 metrics.

We use the pre-trained model fine-tuned with MLE as the baseline, and compare it against the pre-trained model fine-tuned with copy and repetition unlikelihood (§2.1).

Results for ConvAI2 are shown in Table 2. We see that training unlikelihood using only-contexts or only-labels reduces their corresponding metrics dramatically compared to the MLE baseline. Training with both context- and label-repetition unlikelihood reduced both context repetitions (by 69%, .0352 vs. .1131) and label repetitions (by 89%, .0023 vs .0210) compared to the MLE baseline, much closer to human levels, while keeping perplexity essentially constant.

Comparatively, the Wizard of Wikipedia MLE baseline experiences a much larger problem with context repetition, due to its tendency to copy grounded knowledge verbatim (Table 2).

Results for ELI5, shown in Table 3, show that it has an especially large problem with label repetition, and that label-unlikelihood is able to reduce the repetitions by 91% (.055 vs .617), while significantly boosting F1 (.130 to .182).

Figures 3 and 3 show perplexity as a function of label and context repeats respectively using unlikelihood on ELI5. The parameter $\alpha$ can clearly control repeats smoothly, with only very high values resulting in increased perplexity.

Human Evaluation

Finally, we perform a human evaluation using the same pairwise evaluation scheme as (Fan et al., 2019) performed on ELI5, comparing the MLE baseline to UL (Label only) which asks: Which response answers the question better? The evaluators are asked to consider both the readability and accuracy of the answer. Results are given in Figure 4 (left), showing a statistically significant improvement over the baseline (150 trials, two tailed binomial test, $p<0.01$ ). Further details are given in Appendix C.

2 Vocabulary Usage

We evaluate the ability of vocabulary unlikelihood (§2.2) to reduce the mismatch between model and human token distributions.

We use the ConvAI2 dataset, where our baseline is again trained using maximum likelihood. Starting with the baseline model, we then fine-tune several models using vocab unlikelihood at logarithmically interpolated values of $\alpha\in$ .

We partition the vocabulary into ‘frequent’, ‘medium’, ‘rare’, and ‘rarest’ using the human unigram distribution computed with the ConvAI2 training set, corresponding to the sorted token sets whose cumulative mass accounts for the top 40%, the next 30%, the next 20% and the final 10% of usage, respectively. We evaluate a model by generating utterances given contexts from the ConvAI2 validation set, and compute the fraction of tokens within each class.

Figure 5 shows how the vocabulary distribution obtained after unlikelihood training is affected by the choice of mixing hyperparameter $\alpha$ (Eq. 1): it can smoothly transition between the human training distribution and the MLE trained distribution (‘Baseline’), which is far from the human one.

Table 4 compares the MLE baseline with unlikelihood with increasing $\alpha$ values in terms of distribution and F1 score. The vocabulary unlikelihood fine-tuning shifts probability mass from the over-represented frequent words towards under-represented medium and rare words, with the effect strengthening as $\alpha$ increases. At a small cost to perplexity and F1, the unlikelihood tuning reduced the overuse of common tokens by 9 points, matching the human rate, while improving the production of rare tokens by 3 percentage points.

Human Evaluation

Finally, we perform a human evaluation using the ACUTE-EVAL framework (Li et al., 2019), comparing the MLE baseline to UL for various $\alpha$ . First, 252 human-bot conversations (8 turns each) are collected, and then models are compared pairwise by asking the question: Who would you prefer to talk to for a long conversation? For these experiments we compare with both methods generating using beam with context blocking of trigrams. Results are given in Figure 4 (right), showing a statistically significant improvement over the baseline according to humans (two tailed binomial test, $p<0.01$ ). Further details are given in Appendix C.

3 Contradictions

We use the dialogue natural language inference (NLI) task of Welleck et al. (2019b) to obtain labeled non-contradicting and contradicting dialogue sentence pairs to use in unlikelihood training (§2.3). Dialogue NLI contains utterances labeled as entailing (E), neutral (N) or contradiction (C), given a premise that is either a persona sentence (an initial context sentence describing a dialogue agent’s personality) or another dialogue utterance from the Persona-Chat dialogue task (Zhang et al., 2018). We show examples from Dialogue NLI in Figure 6. The original data consists of sentence pairs $(s_{1},s_{2})$ along with a label (E, N, or C), and was constructed by developing a schema and employing crowdworkers to label utterances with relation triples. The labels are then inferred from the triple representation.

We first transform the original classification dataset into a form useful for unlikelihood training of a generative dialogue model. We consider two setups: (i) a two utterance generation task; and (ii) a full dialogue generation task.

We adapt the initial dialogue NLI dataset by using entailing and neutral training sentence pairs as plausible positive utterances, and contradicting pairs as negatives. That is, if a pair $(s_{1},s_{2})$ from Dialogue NLI has label E or N, the example $(\mathbf{x},\mathbf{y})=(s_{1},s_{2})$ is added to $\mathcal{D}^{+}$ , otherwise (label C) it is added to $\mathcal{D}^{-}$ .

We consider two types of entailment: entailing sentence pairs that appear together in a dialogue in the original Persona-Chat dataset and are therefore natural (‘entailment’), and those that only entail via their triple relations (‘triple-entailment’). The latter are more challenging, noisier targets. Evaluation is performed by measuring the test set perplexity over the four target label types, where contradictions should have relatively higher perplexity. We additionally evaluate a selection accuracy task, where for each test example there are two candidate responses: a positive and a negative (contradicting) statement. The candidate response with the lowest perplexity is considered to be the model’s selection, and we measure the selection success rate. Evaluation is broken down by positive type (entailment, triple-entailment, neutral). Dataset statistics are given in Table 5.

Full Dialogue Task

To evaluate in a more realistic setup that involves full dialogue rather than a single utterance, we take full Persona-Chat dialogues (Zhang et al., 2018) similar to Figure 6, and map back the dialogue NLI data to provide positive and negative continuations of the dialogue. We consider continuations as either triple entailing utterances, neutral utterances or contradictions – where the relation triple is used to match the existing persona or dialogue turns by the same speaker to induce the label. That is, an example $(\mathbf{x},\mathbf{y})$ consists of a dialogue history $\mathbf{x}=\{p_{1},\ldots,p_{k},u_{1},\ldots,u_{t}\}$ and utterance $\mathbf{y}=s_{2}$ , where $(s_{1},s_{2})$ is a sentence pair from Dialogue NLI, and at least one sentence in $\mathbf{x}$ has the same relation triple as $s_{1}$ . When the pair $(s_{1},s_{2})$ is labeled as E or N in Dialogue NLI, the example $(\mathbf{x},\mathbf{y})$ is added to $\mathcal{D}^{+}$ , and otherwise it is added to $\mathcal{D}^{-}$ .

Results

Our MLE baseline obtains a perplexity of 11.4, in line with current best systems on this task (Lewis et al., 2019). Unfortunately, despite being good on such standard metrics, our baseline models fail at our coherence task. As seen in Table 6 for the two utterance task, the perplexity of contradicting utterances (12.5) is on average lower than for neutral (36.7) or triple-entailing utterances (17.5), although it is higher than entailing utterances. We believe this is due to contradicting utterances having high word overlap with the premise utterance, coupled with an inability to judge incoherence. Viewed as a selection task between utterances, picking the utterance with the lowest perplexity, this means the selection rates of non-contradicting utterances are very low, e.g. picking neutral utterances over contradicting utterances only 18% of the time. Even fully entailing utterances are only picked 73% of the time. Similar results are found on the full dialogue task as well, see Table 7.

Unlikelihood training brings large improvements in coherence metrics, whilst minimally impacting overall dialogue perplexity. After applying unlikelihood, perplexity for contradicting utterances has a clear signature, with very large average values compared to entailing or neutral utterances, e.g. 248.9 vs. 9.1 for contradict vs. entail on the two utterance task. This converts to corresponding large increases in selection accuracy across all types on both tasks, e.g., an increase from 18% to 78% on neutral statements on the two utterance task, and from 37.4% to 69.8% on the full dialogue task.

Some example model predictions are given in Figure 7, comparing the MLE baseline and unlikelihood model perplexities of generating the given hypotheses. The likelihood model cannot differentiate between contradicting and entailing statements easily, while there are large perplexity differences for the unlikelihood model in these cases.

Conclusion

Generating consistent and coherent human-like dialogue is a core goal of natural language research. We studied several aspects that contribute to that goal, defined metrics to measure them, and proposed algorithms that improve them, mitigating some of the failings of maximum likelihood training, the current dominant approach. Our method defines objective functions under the umbrella of unlikelihood: during training, we wish to make inconsistent dialogue unlikely by lowering the probability of such events occurring. This makes generative models repeat themselves less, copy the context less, and use more rare words from the vocabulary – closer to matching human statistics. Further, utilizing supervised datasets with labeled coherent and incoherent utterances and applying unlikelihood yields measurably improved levels of coherence with respect to the aspect measured, in this case contradiction. Future work could apply this same technique with other supervised data, e.g. correcting causal or commonsense reasoning errors (Zellers et al., 2019; Qin et al., 2019).

References

Appendix A Repetition Control with Beam Search

The experiments on repetition and copying in the main paper were carried out with greedy decoding for simplicity. In this section we show that similar results hold with beam decoding as well. Using a beam size of 5, we take the same 4 models from Table 2 and compute metrics with beam instead. The results are given in Table 8 which show similar trends to before, except the baseline model using beam tends to suffer more from repetition, which is a known result Holtzman et al. (2019). Note that we simply evaluated the same unlikelihood models as before, but we expect that better results could be obtained by performing sequence level unlikelihood training with beam search in the training loop, as well as choosing hyperparameters specifically with this kind of decoding being used to measure validation performance.

Appendix B Nucleus Sampling for Vocabulary control

Table 9 compares the MLE baseline, unlikelihood with increasing $\alpha$ values, and Nucleus sampling Holtzman et al. (2019) with hyperparameter $p$ in terms of distribution and F1 score. The vocabulary unlikelihood fine-tuning shifts probability mass from the over-represented frequent words towards under-represented medium and rare words, with the effect strengthening as $\alpha$ increases. At a small cost to perplexity and F1, the unlikelihood tuning reduced the overuse of common tokens by 9 points, matching the human rate, while improving the production of rare tokens by 3 percentage points.

Nucleus sampling is a popular method that can also produce generations closer to the human vocabulary distribution. It does this by sampling from the model’s probability distribution rather than using beam search, where the sampler restricts to the smallest set of tokens with total mass above a threshold $p\in$ . Small values of $p$ are similar to greedy sampling. Increasing $p$ yields distributions closer to human, but with large losses in F1 score, e.g. $p=0.5$ has a similar distribution to unlikelihood with $\alpha=10^{2}$ but the F1 scores are $0.160$ vs. $0.190$ . This can be understood because maximizing likelihood during decoding yields better token accuracy than sampling (Welleck et al., 2019a), so the unlikelihood training approach to both use likelihood decoding and match the human distribution can obtain the best of both worlds.

Appendix C Human Evaluation

We follow (Li et al., 2019) and perform a pairwise comparison with full-length model conversations. We first collected 252 model-human conversations with each of the models (MLE baseline, and weights for $\alpha$ of Unlikelihood, examples in 8). We then set up a pairwise-comparison using the software of (Li et al., 2019), using the same question (“Who would you prefer to talk to for a long conversation?”) and use the exact same quality control question (a baseline greedy model without repetition control, versus a human). We collected approximately 200 preferences per model comparison and filtered annotators who failed quality control.

Description of ELI5 repetition setup

We follow (Fan et al., 2019) and perform a pairwise evaluation where human annotators were asked “which response answers the question better?” A screenshot of the UI is shown in Figure 9. Human evaluators were asked to rate a total of 5 questions, two of which were quality control annotations. The quality control examples contained the real human responses, along with model predictions: one question contained a baseline model, and one contained an unlikelihood model. Annotators which did not pick humans in quality controls were removed from the final setups. We collected 200 annotations comparing the baseline and the unlikelihood model.

Results

Evaluation results from all evaluated matchups are shown in Figure 10. We find our repetition-controlled ELI5 model significantly outperforms the MLE baseline. We find that two of the vocabulary repetition significantly outperform the MLE baseline. We compute significance with a two-tailed binomial test ( $p<.01$ ).