Controlling Style in Generated Dialogue

Eric Michael Smith, Diana Gonzalez-Rico, Emily Dinan, Y-Lan Boureau

Introduction

Conversational models have shown vastly improved performance through large scaling efforts (Zhang et al., 2019; Adiwardana et al., 2020; Boyd et al., 2020; Roller et al., 2020b), paralleling trends observed in non-conversational text generation (Radford et al., 2019; Keskar et al., 2019; Shoeybi et al., 2019). A challenge of language generation in general, and dialogue in particular, is that there is more than one valid way to respond to a given context, e.g., depending on the unobserved goal of the speaker, the usual tone of their language, their current mood. Training models over vast amounts of data pooled from millions of users with a wide range of opinions and styles means that the resulting generations seem more of a chameleon than of a single, consistent conversational agent. To address this, researchers have explored ways to give generation stable grounding (e.g., a persona Zhang et al. (2018); Dinan et al. (2020), knowledge Dinan et al. (2019); Ghazvininejad et al. (2018), personal situations Rashkin et al. (2019), internet websites Keskar et al. (2019), previous conversations from same actor Boyd et al. (2020)), which provide the model with a specific set of talking points, from a potentially huge set. Forms of generation control that are less specific in terms of content and have a much smaller dimension, e.g., sentiment, tone, style, have also been proposed (Keskar et al., 2019; Shuster et al., 2018; Dathathri et al., 2020).

In this work, we aim to achieve control over a medium-sized (217) set of styles from Shuster et al. (2018), which still allows for much response variety. We train a classifier from the data in Shuster et al. (2018) and show how to adapt three previously proposed promising approaches to this task with large-scale state-of-the-art conversational models: (1) a retrieve-and-style-transfer approach modified from Weston et al. (2018), (2) an inference-time iterative refinement procedure proposed in Dathathri et al. (2020), and (3) a conditioned generation approach that fine-tunes the model with augmented inputs tagged with the target style, similar to Keskar et al. (2019). Comparing trade-offs in terms of performance, cost at training and inference time, and complexity of use, we find that fine-tuned conditioned generation yields the strongest performance, in terms of successful detection of the target style in the generation. Its inference speed is also considerably more tractable compared to the inference-time iterative refinement procedure. Automated and human evaluations show that the resulting conversational models can convincingly match target tones, while largely preserving other conversational metrics. This work thus makes the following contributions: (1) we adapt three different approaches for style control to state-of-the-art conversational architectures with a mid-size (217) style space and compare their trade-offs; (2) we propose a practical pipeline combining style-labelled data and unlabelled in-domain conversational data that can be generalized to any style space for which a reasonable classifier can be trained, and empirically validate that the resulting model can convincingly alter the style of conversations without substantially damaging other conversational metrics. Our best style control model and code have been made available through the ParlAI framework;\urlhttps://github.com/facebookresearch/ParlAI/blob/master/parlai/zoo/style_gen/c75_labeled_dialogue_generator.py additional models and classifiers mentioned in this paper will be made available soon, also through ParlAI.

The remainder of the paper is organized as follows. Sec. 2 lists related work. Sec. 3 details the datasets we use and the building blocks of the systems we compare. Sec. 4 shows the results of our comparisons. Sec. 5 summarizes the takeaways from this work and proposes future directions.

Related work

We first present open-domain conversational architectures that form the foundation of our models, then review previous work for controlling styles.

Conversational models can be based on generation, retrieval, or a combination of both (e.g., see Roller et al. (2020b)). Models including a retrieval component had previously been found to perform better in human evaluation (e.g., Weston et al. (2018); Rashkin et al. (2019)). A recent development has been the dramatic scaling of transformer-based architectures for language and dialogue generation to billions of parameters (Radford et al., 2019; Keskar et al., 2019; Shoeybi et al., 2019; Adiwardana et al., 2020; Roller et al., 2020b; Boyd et al., 2020; Brown et al., 2020). Combining such very large models with optimized beam search, Roller et al. (2020b) have obtained higher engagingness and humanness ratings with generative models, compared to retrieval or retrieve-and-refine approaches Weston et al. (2018). Approaches that might help smaller models may become moot when higher-capacity models trained on large amounts of data are used instead. In this work, we use variants of 2.7B-parameter generative models released by Roller et al. (2020b), to ensure that the methods work well with state-of-the-art conversational models and produce generations that are as fluent as possible.

2 Styles in conversation

The evaluation of an open-domain conversational model often relies on asking humans to rate whether the way the model responds to a given conversational context is fluent, relevant, specific, human-like Adiwardana et al. (2020); Roller et al. (2020b), which is a fairly minimal set of very generic desirable qualities in generated dialogue. There are many other attributes that could describe a given response, e.g., whether it is polite Niu and Bansal (2018), formal Rao and Tetreault (2018), empathetic Rashkin et al. (2019), knowledgeable Dinan et al. (2019), or associated with a variety of styles Shuster et al. (2018). Controlling finer aspects of generation could enable a more consistent experience when interacting with a model, and provide ways to focus on styles that tend to produce generations with desirable qualities such as being pleasant to talk to, less toxic, more empathetic, or generally better-behaved See et al. (2019); Roller et al. (2020a); Rashkin et al. (2019). In this work, we use the set of styles proposed in Shuster et al. (2019, 2018), since it has been shown to result in engaging chats when focusing on positive or neutral styles and it is a set of relatively large size (217). Shuster et al. (2019) proposes a dataset of images captioned with a target style, while Shuster et al. (2018) provides short conversations with target styles.

3 Controlling style

A style control method that has appealing advantages is the plug-and-play method (PPLM) from Dathathri et al. (2020), an iterative generation method using a classifier on top of a pre-trained generation model. For each token, the mean hidden representation of all tokens so far is fed into a style classifier. A backward pass through the classifier and generator is performed, and the gradients are used to update the activations in the generator’s attention layers. These forward and backward passes are repeated several times per time step, and the following token is then sampled (Dathathri et al., 2020). This work has impressive results and the desirable property that it can be used without having to fine-tune the underlying base model. We adapt this approach for our purpose, and compare it with conditioned generation approaches. Another work that is achieving fine-grained control using a very large architecture is the CTRL model (Keskar et al., 2019). The style conditioning relies on control codes obtained from the training data (meta-data). This work is however not tailored to dialogue. We previously proposed (Lample et al., 2019) a style transfer architecture using noisy encoders and decoders and style conditioning through an additional token, and adapted it for use with reasonably large style spaces (Smith et al., 2019). The style control through an additional context token is similar to the best-performing model in this paper, however the models underlying both these works are much smaller, non-conversational architectures for which generations are considerably less fluent than those of the models we consider here, and the task of rewriting with a given style is more constrained than the conversational generation task that this work focuses on.

Controlling styles with state-of-the-art conversation architectures

Pieces from previous work can be combined into new architectures that can reply to a dialogue context to match a target style. We present the conversational datasets we use, then introduce the three methods we adapt and compare in this work, and their advantages and shortcomings. They differ on whether they use retrieval, and whether they require fine-tuning of the whole architecture.

We use different datasets for providing the style space and fine-tuning most models.

Image-Chat (Shuster et al., 2018) is a dataset of 3-turn conversations discussing an image, totalling about 400k utterances. Each partner in each conversation conveys a given style (e.g., Curious, Casual) from among a set of 217. These styles are split into “positive,” “neutral,” and “negative” (see Table 10 in the Appendix). The distribution of styles in the dataset is reasonably balanced and the set of styles results in colorful, diverse conversation (Shuster et al., 2018). However, this dataset is not a purely conversational dataset because the conversations are referring to an image that both conversation partners are seeing. The dataset can be used to teach a model to condition on a style, but produces conversations that are not self-contained (e.g., ”the dog next to the statue seems bored”). Therefore, we also use purely textual datasets to ensure natural conversations without reference to images (see next paragraph). Unfortunately, these textual datasets were collected without providing target styles from the IC style space. We thus also use Image-Chat to train a classifier which assigns style labels to utterances. We then use this classifier to augment the purely conversational datasets with style labels (see Sec. 4.2).

Dialogue datasets (D).

Following Roller et al. (2020b), we start from models pre-trained on a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io (Baumgartner et al., 2020) , then fine-tune our models on four public open-domain conversational datasets from previous work, collectively denoted by D: (1) ConvAI2, comprising about 140k conversations in which two partners discuss themselves based on a given persona, e.g., “I have four sisters.” (Zhang et al., 2018; Dinan et al., 2020); (2) Wizard of Wikipedia (WoW), a dataset of 22k conversations displaying knowledge grounded on Wikipedia articles (Dinan et al., 2019); (3) EmpatheticDialogues (ED), comprising 25k conversations in which one speaker talks about a situation associated with a target sentiment (e.g., Surprised, Proud, Angry), and the other responds empathetically (Rashkin et al., 2019); and (4) BlendedSkillTalk (BST), comprising 5k conversations blending skills of the three previous datasets: conveying a consistent persona, displaying knowledge, and responding empathetically (Smith et al., 2020).

2 Retrieve-and-style-transfer (RnST)

Based on results in Weston et al. (2018) showing a superiority of retrieve-and-refine models over either retrieval or generation methods for dialogue, we considered an intuitive approach combining retrieval and style-controlled generation.

The best retrieve-and-refine model from Roller et al. (2020b) first uses a retrieval model to retrieve a response from the training set, then appends that retrieved response to the dialogue context (after a separator), to generate a better response. Our retrieve-and-style-transfer method (RnST) additionally appends a target style after a second separator. The model is then trained to generate the gold truth response from that augmented input. The retriever is far from perfect, which creates enough noise to prevent the model from simply copying the retrieved response. The two elements of (1) noise, and (2) pairing of a noisy un-styled first guess with a target style to generate the desired response, are both present in our recent style transfer approaches based on added noise and back-translation (Lample et al., 2019; Smith et al., 2019). Another approach that we did not try would consist in first training a retriever conditioned on style, and then training a vanilla retrieve-and-refine model using that style-conditioned retriever to provide the first guess.

3 Iteratively refining generations to match a target style during inference (PPLM)

The second method (thereafter PPLM) adapts the plug-and-play, minimal-training approach from Dathathri et al. (2020) to dialogue and a different set of styles. Dathathri et al. (2020) is based on GPT2, which is not a dialogue model and would therefore be at a disadvantage when evaluated on dialogue. In order to provide a fairer comparison between methods, we replace GPT2 by our C0 model which has been pre-trained on pushshift.io Reddit and fine-tuned on several dialogue datasets (see Sec.4.3), so that all models we compare share a common base. We also change the guiding classifier head to accommodate the style space from Image-Chat. Given a base model, the PPLM generative method requires no fine-tuning beyond that of the classifier. Additionally, controlling the style through iterative steps affords direct fine-grained, gradual control at inference time over the style. Lastly, the use of a classifier to directly guide the refinement allows to go not only “towards” a desired style, but also “away” from it, which is not straightforward with the other conditioning methods. However, the inference is also much more costly and might be prohibitive with very large models. It is also unclear whether the good results demonstrated over a small number of classes in text generation would generalize to a much larger set of styles, and to dialogue.

4 Training a conditioned generator on inputs appended with style tags (C)

The last family of methods that we include in our comparison simply relies on conditioning tokens appended to the dialogue context. We thereafter denote these models by C to reflect their conditioned nature. We fine-tune the 2.7B pushshift.io Reddit pre-trained generative model from Roller et al. (2020b), appending target styles to the dialogue context (after a separator). While purely generative models had long been inferior to retrieval variants in dialogue (Weston et al., 2018; Rashkin et al., 2019), very recent generative models have been shown to perform better when combined with beam search with a minimum output length (Roller et al., 2020b), making them an attractive base. This method requires whole-architecture fine-tuning to learn to use the augmented input, but inference is then straightforward. Although we do not test this here, fine-grained control over the degree of intensity of the target style could be achieved by qualifying the appended style with a degree (e.g., a numerical output from a style classifier capturing intensity of the style of the training example), as in Keskar et al. (2019), with the limitation that the degree of control would rely on the available training data and might not directly generalize to other intensities the way the iterative inference PPLM method promises.

Experiments

All experiments are conducted using the ParlAI toolkit (Miller et al., 2017). In order to fairly compare the approaches in Sec. 3, we build them as enhancements over the same underlying model. That model is a pushshift.io Reddit pretrained 2.7B model from Roller et al. (2020b), thereafter denoted by R, which was pretrained on a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io (Baumgartner et al., 2020).

As described in the previous section, Retrieve-and-style-transfer combines a retriever and a generator. Having a retriever take care of finding a relevant response might free up more model capacity to focus on good style conditioning. We do not change the retriever system from Roller et al. (2020b), but we modify the generator.

In order to teach the model to condition generations on styles, we fine-tune the generator R on IC with the ground-truth style appended to the dialogue context. However, conversations in IC are all grounded on an image, which is not provided to the model in our architecture (this architecture does not have image features). Fine-tuning solely on IC results in conversations that clearly refer to some unknown image, rather than self-contained conversations. To avoid this problem, we also fine-tune on D, which does not contain style labels.No styles from IC were given to workers at data collection time for the datasets in D. ED dialogues were collected with a grounding in a set of 32 sentiments, however these are different from the styles used in IC, and pertain to situations rather than tones as in IC. We do not make use of the sentiment labels from ED in this work. They might be somewhat predictive of IC styles, given that the style spaces are related (e.g., Anxious appears in both sets), but relying on these labels would lead to treating ED differently from the other datasets in D.

We compare models fine-tuned either with a retrieved reply appended to the input (RnST-IC+D), or without (C100-IC+D). The C100-IC+D notation captures the fine-tuning on labeled IC and unlabeled D, and the fact that the architecture and training are the same as for C100 in Sec. 4.3. Full experimental details and an example conversation are given in Appendix B.

Automatic evaluation of the accuracy of style control is conducted for generation using either IC or BST contexts, by running a trained style classifier on the model generations and reporting the percentage that get classified into the target style. The classifier is trained on IC conversations (classifier details given in the Appendix, Sec. C). Average accuracy on the IC test set itself on turns 2 and 3 is 13.0% across the 217 classes. This classifier uses both the utterance to be classified and the previous utterance as context (as something might only be, e.g., “sarcastic”, in the context of what was said before. A classifier using only the utterance itself achieves 12.6%). .

Results in Table 2 show that conditioning on retrieved utterances hurts style control. This weaker style control could still be an acceptable trade-off if the generated reply was of sufficiently higher quality (e.g., more relevant to the dialogue context, which we do not test here), given the superior results long obtained with retrieval over generation in previous work (e.g., Weston et al. (2018); Rashkin et al. (2019)). However, recent results in Roller et al. (2020b) have instead obtained better performance from purely generative models when using larger architectures and optimized decoding, which we adopted here. Therefore, we expect that other conversational metrics would also not favor conditioning on a retrieved reply. A retrieve-and-style-transfer system could still be attractive as a way to use one or several out-of-the-box style transfer models without having to fine-tune the whole model for every style space, by simply forming a pipeline from the retriever followed by the style transfer model.

Another observation is that style control is not transferring very well from IC to BST. We also noted when interacting with the models that the image-grounded nature of the IC training conversations resulted in some conversations referring to some unavailable image, which is jarring, even though the model was also fine-tuned on the imageless datasets from D. In the remainder of this paper, we thus experimented with using IC only to train the style classifier,and then using that trained style classifier to label D with styles, as detailed in Sec. 4.2. We denote by D+ the dataset thus augmented. Once D has been labelled, we fine-tune R exclusively on D+ and drop IC from the fine-tuning step.

2 Labeling D with styles (D+)

The method we outline here provides a way to use an unlabelled dataset with desirable qualities, by leveraging another labelled dataset purely to train a label classifier. Here, the advantage of the other dataset is that it is conversational and self-contained without reliance on an image, but other advantages could be sheer larger magnitude, as in semi-supervised learning.

In practice, we augment each utterance from the four datasets from D with style labels, obtained by running the style classifier trained on IC, yielding weakly labeled dataset D+. D+ is used to provide style conditioning during fine-tuning in the remainder of this paper. A classifier trained on this newly labeled dataset, using only the previous utterance as input, obtains 2.1% accuracy, above the chance level of 0.5%. This confirms the intuition that the previous utterance has some predictive power over the tone of the next utterance in a natural dialogue. This cannot be done on Image-Chat, where the labels were random targets provided to workers instead of organic conversational choices.

The empirical distribution of style types fed to models during training consists of 51% positive, 20% neutral, and 29% negative styles (see Table 10 in the Appendix for a breakdown of style types).

The top 12 styles in each of the datasets in D+ are shown in Table 3 (utterances of ED are split by whether they were said by the Speaker, who talks about an emotional situation, or the Listener, who responds with empathy). Top styles show patterns that reflect the intended focus of each dataset. For instance, the ConvAI2 dataset instructed workers to get to know each other (“Curious”, “Questioning”, “Open”); the Speakers in ED were instructed to talk about emotional situations (“Emotional”, “Appreciative”,“Miserable”, “Anxious”, etc), and the Listeners to respond with empathy (“Sympathetic”, “Empathetic”, “Kind”, “Caring”, etc); and the WoW utterances were grounded on knowledge from Wikipedia (“Knowledgeable”, “Scholarly”, “Intelligent”, etc). The BST dataset, designed to incorporate the conversational skills of the above three datasets, contains top styles from all of them (“Open” from ConvAI2, “Curious” and “Questioning” from ConvAI2 and ED, “Fickle” and “Sympathetic” from ED, “Knowledgeable” and “Obsessive” from WoW, etc.). This provides some empirical validation of the intended focus of each of these datasets, and shows that the trained classifier can usefully tease apart styles that correlate with specific conversational qualities, despite the overall relatively low accuracy on IC itself.

3 Conditioned generator fine-tuning (C)

We fine-tune R on D+, with a kind of ”style drop-out:” the style label of each example is sometimes joined to the end of the example’s context with a STYLE string, similar to the conditioning in Weston et al. (2018). Starting from the same pre-trained model, we fine-tune three versions, C0, C75, and C100, which are given the appended style for 0%, 75%, and 100% of the training examples, respectively. We generate with beam search with the best setting from Roller et al. (2020b). We do not alter the natural empirical distribution of styles in D+ (e.g., by upsampling under-represented styles) in order to better match natural unconstrained dialogue; however, upsampling could be used for better performance on less frequent styles. Appendix D and E give more details. A random sample of generations is shown in Table 4, with many more generations shown in Appendix I. The examples shows that style can be controlled with a clear differentiation between different styles, while keeping the responses both fluent and relevant to the dialogue context. As for what the ”style” qualitatively captures, it appears to be a mixture of persona traits, moods, tones, and word choice biases.

4 PPLM inference

Dathathri et al. (2020) exclusively presents results on a binary sentiment generation task for demonstrating how PPLM can steer GPT2, using very positive and very negative classes trained on movie reviews from the SST5 dataset (Socher et al., 2013). In order to check that our implementation performs similarly to the original implementation in (Dathathri et al., 2020), we first run experiments using that 2-class sentiment space. We then run experiments with our space of 217 IC styles.

The PPLM approach requires a generative model to plug in, with a classifier head on top. We use R fine-tuned on unlabelled data in the relevant domain space – i.e., on SST5 when working on the binary sentiment generation task, or on D when working in the space of open-domain conversations. The classifier head is a linear layer with an input dimension of 2560, and as many output units as classes, fine-tuned either on SST5 or on turns 2 and 3 of IC, with the decoder output averaged across time as in Dathathri et al. (2020). We also fine-tune C75 on SST5 for comparison in the SST5 space (C75-SST5). Additional details and more extensive results are given in Appendix F.

Table 5 shows that the PPLM approach is much more attractive in terms of resource requirements at fine-tuning time, especially for the binary SST5 space, with much faster convergence and lower memory use. Table 6 shows generation times and percentages of the time when the generation of the model is classified as having the target style. Dathathri et al. (2020) measure accuracies of matching the target style for SST5 using an external classifier fine-tuned on the Large Movie Review Dataset (Maas et al., 2011). Therefore this section provides our experimental results using a classifier fine-tuned on that same dataset, solely for comparison purposes. When conditioning generation on SST5 movie-review ratings, our PPLM results are comparable to the accuracy in Dathathri et al. (2020), while our C75 results are slightly above. In the larger space of styles from the Image-Chat dataset, PPLM inference results in accuracies closer to chance and considerably longer inference time.

Based on this performance differential for our style space and base models, we only consider our C models in the rest of the paper.

5 Automated metrics evaluation for C

Table 7 displays the accuracies of C models’ generations at matching target styles, and the perplexities of those generations. We test generations with contexts from the test sets of both BST and IC, and for each generation we condition on one of the IC styles present in the training set of D+. See Table 10 in the Appendix for more details. We choose the distribution of target styles for these generations in two ways: matching the empirical distribution of styles that the models were fine-tuned on, and uniformly across all styles. For both distributions, we produced roughly 21,500 generations, or roughly 100 generations per target style on average.

For C75 and C100 conditioned on style, accuracies of matching the target style range from 23% to 32% on the training distribution of styles and from 11% to 19% uniformly across styles. C0 performs at chance on the uniform distribution, and a bit over chance when following the empirical style distribution. Note that the 11.7% accuracy result on BST for C75 tested on a uniform style distribution differs from the 7.1% result in the comparison with PPLM (Table 6). The generation settings are different: in particular, in the comparison with PPLM, the first few words of the generated text are directly copied from the gold response, which has a random, arbitrary style (and not the target style), before the model starts generating from C75.

Perplexity of generations.

For the C100 and C75 models, we report accuracies and perplexities both with and without style conditioning during generation. Perplexities of generations were computed using a separate 90M-parameter generative model pretrained on pushshift.io Reddit and fine-tuned on the four dialogue datasets listed in Section 3.1 (Roller et al., 2020b). Perplexity gets slightly worse from training with style conditioning, but this effect is mitigated by the style drop-out used for training C75, for which perplexities are very close to C0 when no style conditioning is used.

Perplexity of BST test set.

To gauge whether predicting the style of an utterance can help with generation, we compare perplexities of the BST test set as measured by our models, as a function of whether generation is conditioned on a label, and what classifier was used to produce that label. Results are shown in Table 8. For the Previous utterance result, we fine-tune R on D+ The styles used to train this classifier were obtained as described in Sec. 4.2 using only the previous utterance as context. BST test-set examples labeled with the styles predicted with this classifier have higher perplexities than using no styles at all, reflecting the fact that a single utterance is only a weak predictor of the style of the following utterance. However, perplexities are lower when style labels are predicted using classifiers trained on (utterance, style) pairs from turns 2 and 3 of Image-Chat, implying that these style labels convey meaningful information about the utterances. The perplexities drop slightly lower the classifier uses both current and previous utterance, indicating that the previous utterance may contain a bit of contextual information that is useful for predicting the appropriate style label.

6 Human evaluation

Table 9 gives the results of crowdsourced human ratings of our models. In line with our automated metrics from Section 4.5 showing our models’ ability to effectively use style labels during generation, evaluators correctly identify the target style of our models 34% to 42% of the time when the model is conditioned on that style label, but only 14% to 19% of the time when the style label is not used during generation. Scores on other metrics (empathy, relevance of response, humanness, and engagingness) are largely unchanged when conditioning on styles or not, except for humnanness, which decreases somewhat when conditioning. Accuracy differences are statistically significant for every possible pairing of an style-conditioned and an unconditioned model. The difference in humanness score between C75 with and without conditioning is significant, as is the difference in humanness between C75 with conditioning and C0 without. All other differences are not significant. Additional experimental details can be found in Section G of the Appendix.

Discussion

This work explored ways to combine state-of-the-art open-domain conversational architectures with style control for a reasonably large set of styles (217). These methods have different advantages. The retrieve-and-style-transfer approach we tried yielded weaker style control compared to conditioned generation without retrieval, however combining retrieval with style transfer would allow to use out-of-the-box style transfer methods without fine-tuning and transfer into many different style spaces. The PPLM-style approach is considerably cheaper at train time, however it does not perform very well for larger style spaces, and inference is a lot slower. The conditioned generation approaches we tested can convincingly generate sets of varied conversational replies that display the desired style, with barely any cost in terms of other conversational metrics, as shown through automatic and human evaluation, and evident in sample generations. While we focused on a specific set of styles, our approach should generalize to any set of styles for which a classifier is available, by following the procedure of labeling dialogue datasets with the classifier and fine-tuning on that weakly labeled set. Future work will extend this approach to unsupervised style spaces and styles directly inferred from a conversational partner. Another promising direction is to investigate whether certain utterance-level style trajectories in conversations are particularly appealing in a conversational agent or to maximize a specific conversational goal, for example by using reinforcement learning techniques to learn optimal policies in the space of style sequences.

References

Appendix A Image-Chat styles

The partition of Image-Chat styles by type is given in Table 10.

Appendix B Retrieve-and-style-transfer architecture and training

The retrieve-and-style-transfer architecture we use is the retrieve-and-refine architecture from Roller et al. (2020b), but fine-tuning with Image-Chat examples with their style tag. The architecture consists of (1) a retriever model used to select an appropriate response given candidates, and (2) a generator model in which the retriever’s response and the attribute of the gold response are appended to the context string during training. The retriever model is a 660M-parameter Poly-encoder, consisting of two Transformer encoders for context strings and candidate responses, whose outputs are attended over to produce a ranking of candidates (Humeau et al., 2020). The model has 24-layer encoders, 16 attention heads, an embedding size of 1024, a feed-forward size of 4096, and 64 Poly-encoder context codes. The model is pretrained on a previously existing third-party Reddit dump that was hosted by pushshift.io (Baumgartner et al., 2020) and fine-tuned on ConvAI2, ED, WoW, BST, and turns 2 and 3 of Image-Chat. For ConvAI2, ED, and WoW, we fine-tune on versions of the datasets to which persona strings (like in ConvAI2) and conversational topics (like in WoW) have been added if they are not already present, as in Smith et al. (2020). This is done to better match the contexts of these three datasets to each other and to those of BST, which includes persona strings and often WoW topics in its contexts.

To fine-tune the retriever, we tune both the learning rate and the relative training weights of the datasets, and we use accuracy at retrieving the gold response as our validation metric. After retriever fine-tuning, we cache retriever responses for all datasets that we wish to fine-tune our generator on, to speed up generator fine-tuning.

Our generator model uses the same architecture as in 4.3. During training, the cached retriever response for each example is joined to the end of the context string with a “ RETPRED ” string, and the attribute of the gold response is joined to the end of that with a “ STYLE ” string, similar to how retriever responses are handled in Weston et al. (2018). We fine-tune the generator on the same five datasets as with the retriever; however, among the datasets only Image-Chat contained attribute labels, and so the model did not see attribute strings appended to contexts when training on examples from the other four datasets.

To fine-tune the generator, we sweep the learning rate and the relative training weights of the datasets, as well as the fraction of the time that the gold response is appended to the context in place of the retrieved response in order to teach the generator to sometimes copy that response. During generation, candidates from ConvAI2, ED, WoW, and BST are ranked by the retriever, and the top retrieval candidate and target attribute are appended to the end of the context. An example conversation from our retrieve-and-style-transfer generator is shown in Table 11.

Appendix C Attribute classifier architecture

The attribute classifier trained on the Image-Chat attribute space consists of R (a 2.7B-parameter Transformer model pre-trained on a previously existing third-party Reddit dump that was hosted by pushshift.io (Baumgartner et al., 2020) from Roller et al. (2020b)), with an added linear layer with a hidden dimension of 2560 on top of the decoder output. We fine-tune all weights on turns 2 and 3 of the Image-Chat training set, using the provided labels. (We do not train on turn 1, which relies more centrally on the image, according to Shuster et al. (2018)).

Appendix D Conversation context given to the model during dialogue fine-tuning

For ConvAI2, ED, and WoW, we fine-tune on versions of the datasets in which persona strings and conversational topics have been added to all contexts, as in Section B. These contexts are better matches to the contexts used in the human evaluations of Section 4.6, in which two persona strings are assigned to both the human and to the bot during conversation. During training, examples are sampled from the ConvAI2, ED, WoW, and BST datasets with a ratio of 1:2:1:1, adopted from models trained on these datasets in Smith et al. (2020).

Appendix E Training the style-conditioned model

We fine-tune 3 models, for which the style label is randomly appended to the context string 100%, 75%, and 0% of the time during training. For each training example, a random number in the unit interval is drawn to determine whether to append that example’s style label to its context string, given the specified probability. The 0%-probability model (C0) serves as a baseline for the 100%-probability model (C100), and the 75%-probability model (C75) allows for generation in which an style label is useful but not required, because the C75 model has been exposed to both cases during fine-tuning. Models were trained with a batch size of 128 and 8 GPUs, and the learning rate was swept in the range of 3e-6 to 7e-5, with perplexity used as validation metric. The C100 model converged in 8.7 hours, the C75 in 9.6 hours, and the C0 in 22.5 hours; however, the C0 model had a slightly lower learning rate, which likely resulted in the longer training time.

For style-controlled generation with our fine-tuned models, we use beam search with a beam size of 10, a minimum beam length of 20, and $n$ -gram blocking of size 3 in both the beams and the context, following Roller et al. (2020b). Generations take roughly 2.0 seconds per generation, with a batch size of 32 across 4 GPUs, and generation speeds are roughly equivalent with and without style conditioning.

Appendix F PPLM comparison

The metrics and evaluation datasets in this section follow Dathathri et al. (2020). Since the SST-5 dataset (Socher et al., 2013) consists of review/rating pairs without any context strings, only a “__SILENCE__” string is passed into the encoder during fine-tuning of the generators and during classifier-head tuning. The 15 2-to-5-word prefixes in Dathathri et al. (2020) are used at the beginnings of generations, as was done in that work. The learning rate is swept from 2e-6 to 3e-5 during generator fine-tuning and from 2e-3 to 3e-1 during classifier-head tuning. Like Dathathri et al. (2020), we pick tokens at each timestep for C75 and PPLM by sampling the token distribution with top- $k$ filtering ( $k=10$ , Fan et al. (2018)); unlike Dathathri et al. (2020), however, we stop a generation when it hits an end-of-sentence token, as in Roller et al. (2020b). For PPLM, we find that varying the step size of gradient updates leads to a trade-off between increased attribute control and degeneration of the output utterances; we tune the step size in the range of 0 to 0.1 (where step size is defined by $\alpha$ in Dathathri et al. (2020)), and we find that a step size of 0.07 leads to maximum average accuracy of the target attribute. Final numbers come from re-running generations with that step size and a different seed.

More complete results comparing our C75 and PPLM models, using the SST-5 movie-reviews dataset, are shown in Table 12. We report metrics under different experimental conditions, taken from Dathathri et al. (2020) (with the exception of B*):

B: take the mean over 10 generations for each target attribute (implemented here by “perturbing” attention activations with a step size of 0)

B*: produce 10 groups of 10 generations for each target attribute. For each generation, calculate the mean of the Dist-1, Dist-2, and Dist-3 scores, which measure token diversity (Li et al., 2016); throw out the generation if this mean (Dist) is below a certain threshold (here, 0.75, in order to retain at least one generation per groupDathathri et al. (2020) used a threshold of 0.9 to filter generations by Dist score. One hypothesis for why our generations tended to have lower Dist scores is because our generations’ average token length is much shorter than that found in Dathathri et al. (2020), and the Dist- $n$ metric is weakly length-dependent: it consists of a numerator enumerating unique $n$ -grams and a denominator counting total number of generated tokens (Li et al., 2016).); and average over the first remaining generation from each group

BR: after Dist filtering, rank the remaining generations in each group according to classifier-head loss (for PPLM), and average over the lowest of each group

BC: use iterative tweaking of latent activations (Section 2.3, for PPLM only) to produce 10 generations per target attribute

BCR: produce 10 groups of 10 generations per target attribute, all using tweaking of latent activations; filter by Dist score; and pick the generation with the lowest classifier loss score in each group

Following Dathathri et al. (2020), we compute the mean across 90 generations for each row of the table: 3 generations each for 15 possible generation prefixes, for both target attributes (“very positive” and “very negative”). As in Dathathri et al. (2020), we measure accuracies of matching the target attribute using an external classifier fine-tuned on the Large Movie Review Dataset (Maas et al., 2011), which we use solely for comparison purposes. The model conditioned on attribute labels during fine-tuning (C75) achieves higher accuracies and smaller generation times than the model employing generation-time modification of activations (PPLM), but ranking generations by classifier-head loss improves PPLM accuracies quite a bit.

F.2 Experiment with Image-Chat attributes and BST contexts

Dathathri et al. (2020) uses GPT-2 (Radford et al., 2019) as its base generator model, and because GPT-2 has no encoder, there is no context string passed to the model during inference (Radford et al., 2019, 2018). However, our encoder/decoder-based Transformer generator was pretrained with Reddit context strings always passed into the encoder (Roller et al., 2020b). Thus, during generation, we pass BST test-set context strings into the encoder of our C0-based generator that we use for our PPLM-style baseline (PPLM), as well as into the encoder of our generator fine-tuned with attribute conditioning (C75). When performing inference-time attribute-controlled generation, Dathathri et al. (2020) prefixes all generation strings with one of 15 phrases, each consisting of a few words (“Once upon a time”, “The painting”, etc.); however, since such phrases would typically be unexpected given context strings from the BST dataset, we instead use the first three 3 words of the gold BST response as a prefix to generate the rest of the utterance from, for both C75 and PPLM. For both models, we loop through the same randomly shuffled list of 217 Image-Chat styles as our target attributes, so that both models see the same combinations of BST context and target attribute. The step size $\alpha$ is swept from 0 to 0.24, and 0.06 has the maximum average accuracy of the target attribute. We then re-generate with a different seed in order to get final numbers.

Table 13 gives the results of the comparison between the C75 and PPLM models when generating using target attributes from Image-Chat and contexts from the BST test set, and when starting generations with the first three words from the gold BST response. We see that the C75 model exhibits higher accuracies at matching the target attribute and a much faster mean generation time, due to not iteratively shifting activations during generation, at the cost of having to label dialogue datasets with an attribute classifier and then fine-tune on those datasets. However, accuracies for the PPLM model are improved when ranking generations by classifier loss, matching the analogous results found in Dathathri et al. (2020). Here, a threshold of 0.85 is used to filter generations by Dist score. The mean total number of tokens per generation is fairly similar for both models, as are the mean Dist scores, implying that both sets of generations have roughly the same amount of repetition.

Appendix G Details of human evaluations

For our human evaluations, shown in Table 14, human evaluators are asked to answer the following questions:

“Which of the following personalities do you think your partner was trying to emulate?” (Evaluators are shown 5 style labels, one of which is the one that the model is conditioned on.)

“Did the responses of your partner show understanding of your feelings?”

“How relevant were your partner’s responses to the conversation?”

“How human did your conversation partner seem?”

“Overall, how much would you like to have a conversation with this partner?”

For all questions other than the first one, evaluators answer on a Likert scale from 1 to 5. Target styles are randomly selected from the following list of 10 styles, 5 from the “positive” category and 5 from the “neutral” category: Knowledgeable, Sympathetic, Businesslike, Rustic (Rural), Absentminded, Complex, Appreciative (Grateful), Youthful, Emotional, and Casual. These 10 styles were chosen because they are very frequent in the generator training data, are not synonymous, and cannot simply be understood as capturing question-asking (Curious, Questioning). When asking evaluators to identify the correct target style out of a list of 5 options, “Knowledgeable” and “Complex” are never shown together because they were judged to not be sufficiently distinguishable. Between 110 and 130 HITs were run per model. Ratings have standard errors of the mean in the range of 0.07 to 0.11. Accuracy differences between each of the two style-conditioned models and each of the two non-style-conditioned models are statistically significant ( $p<0.05$ , two-tailed Fisher’s exact test), as are differences in being human-like between the C75 model with style labels and the C75 and C0 models without style labels ( $p<0.05$ , $t$ -test for the means of two independent samples). Differences in being human-like between the C100 model and other models are not significant, nor are any differences in the empathy, relevance, and engagingness metrics among models.

Appendix H Results for positive, neutral, negative styles

When cutting style accuracies and perplexities of the models’ generations by the category of the target style (Table 15), we see that “positive” styles in aggregate have higher accuracies than “neutral” or “negative” ones, likely owing to the positive styles’ slim majority in the distribution of styles seen during fine-tuning (Section 4.2).

Appendix I Random model generations

The following pages show generations for randomly selected contexts from the BST test set.