Plug and Play Language Models: A Simple Approach to Controlled Text Generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu

Introduction

The Transformer architecture (Vaswani et al., 2017) has enabled large-scale language models (LMs) trained on a huge amount of data (Radford et al., 2019; Dai et al., 2019b; Radford et al., 2018b) to greatly improve the state-of-the-art on natural language processing tasks. These models are used to extract contextualized word embeddings for transfer learning purposes (Devlin et al., 2019) and as natural language generators. The latter can leverage large amounts of unannotated data and a simple log-likelihood training objective. However, once such models are trained, controlling attributes of generated text becomes difficult without modifying the model architecture to allow for extra input attributes or fine-tuning with attribute-specific data (Keskar et al., 2019; Ziegler et al., 2019).

Controllable generation entails modeling $p(x|a)$ , where $a$ is some desired controllable attribute(s) and $x$ the generated sample. However, generative models only learn $p(x)$ . In computer vision, Plug & Play Generative Networks (PPGN) from Nguyen et al. (2017) developed a mechanism for generating images with different attributes by plugging a discriminator (attribute model) $p(a|x)$ together with a base generative model $p(x)$ and sampling from the resulting $p(x|a)\propto p(a|x)p(x)$ , effectively creating a conditional generative model on the fly from any supplied attribute model. In a similar manner, we propose the Plug and Play Language Model (PPLM) for conditional language generation that combines one or more simple attribute models $p(a|x)$ —either in the form of a bag-of-words (BoW) or single layer classifiers—with a pre-trained, unconditional language model $p(x)$ . We sample from the resulting combined model by following gradients in the latent representation space in a manner inspired by the approximate Metropolis-adjusted Langevin (MALA) (Roberts et al., 1996; Roberts & Rosenthal, 1998) sampler deployed in Nguyen et al. (2017).

Optimization is performed ex post facto in the activation space, therefore no re-training or fine-tuning is needed. Control is fine-grained, with a strength parameter determining how strong the attribute influence should be; a strength of fully recovers the original model $p(x)$ . This design allows vast flexibility: users can combine a state-of-the-art generative model, which may be large and difficult to train, with any number of attribute controllers. Attribute models may be easier to train or untrained (in the case of BoW models), and multiple controllers may be combined flexibly during inference. In this paper, we demonstrate the PPLM approach using a GPT-2 345M model (Radford et al., 2019) as the general-purpose LM $p(x)$ , but the method applies in any representation space from any transformer-based text generator and allows combination with any attribute model $p(a|x)$ .

We demonstrate controlled generation with a number of attribute controllers, assembled and combined during generation, each with a different strength, acting as a set of “control knobs” that tune generation towards the desired attribute (see examples in Table 1). Code for the experiments is available at: https://github.com/uber-research/PPLM. Our key contributions are:

We introduce the Plug and Play LM for controlled language generation, discuss its relation to existing work, and how sampling from a PPLM works (Sections 2 and 3).

We demonstrate controlling of text generation on a range of attributes, including 7 topics each defined using a bag of words, and 1 simple discriminator on sentiments. We quantify effectiveness using both automated evaluation (separately trained perplexity and sentiment models) as well as human evaluation (for attribute relevance and fluency). All evaluations point toward the ability of PPLMs to generate attribute controlled, fluent text (Section 4).

We compare PPLM with CTRL (Keskar et al., 2019) and GPT-2 finetuned for positivty (Ziegler et al., 2019). Our method, without any LM training, is on par and often outperforms the baselines on attribute relevance and fluency (Section 4.2, and Section 4.3).

We show that the PPLM approach can be used to detoxify instances where generation of toxic content is likely by following the negative gradient of a model trained to detect toxicity (Section 4.4). We also show how PPLM can be used for structurally constrained story writing (Section 4.5).

Related Work

Current methods for controlled text generation involve either fine-tuning existing models with Reinforcement Learning (RL) (Ziegler et al., 2019), training Generative Adversarial Networks (Yu et al., 2017), or training conditional generative models (Kikuchi et al., 2016; Ficler & Goldberg, 2017). Different from our approach, these methodologies are not plug and play, since the entire model needs to be separately fine-tuned for each specific attribute. Keskar et al. (2019) train a large language model with over 50 different control codes. The results are high quality because they train exactly to maximize $p(x|a)$ , but this comes at the expense of fixing control codes upfront and of training a very large model (1.6B parameters). Our method does not require retraining any conditional generative model, and both the language model and the conditional model can be flexibly assembled. Table 2 gives a comparison of recent approaches to language modeling tuned for specific attributes. In another interesting but tangential piece of work, Subramani et al. (2019) recently showed that a pre-trained language model can be steered to recover arbitrary sentences. In earlier works Gu et al. (2016; 2017); Chen et al. (2018) explored the idea of using a small neural network to steer an LM.

Noisy Channel Modeling

Weighted decoding

Holtzman et al. (2018); Ghazvininejad et al. (2017) consider controlled language generation – the former with discriminators, and the latter with a bag of words – where the decoding procedure is modified to consider the scoring function used for decoding. See et al. (2019) note that control with weighted decoding (WD) is difficult and often leads to sacrificing fluency and coherence. Further, Ghazvininejad et al. (2017) strongly relies on sampling from a set of keywords on a specific topic and it does not allow to bias generation towards a topic in a manner that does not necessary include a set of keywords. Similarly, Baheti et al. (2018) proposed a decoding strategy for generating interesting responses in dialogue systems, using bags of words and word embeddings. Sophisticated sampling methods (Metropolis et al., 1953) can be used to constrain the model generation to certain keywords and topics. We evaluate WD as a baseline.

Text Style Transfer

Outside of language modeling, the text style transfer studies a related task. Shen et al. (2017); Hu et al. (2017) train variational auto-encoders for style transfer that rely on learning disentangled latent representations for style and content. Li et al. (2018) demonstrate the efficacy of a simple approach based on replacing attribute related n-grams with n-grams corresponding to the desired attribute based on a conditional generative model. A key difference between the above and our approach is that we use an offline discriminator and perform optimization based on this discriminator, which as suggested by Elazar & Goldberg (2018) may outperform adversarial training approaches. More recently, Lample et al. (2019) adapt an approach from unsupervised language translation to style transfer, where a denoised auto-encoder is trained with an objective consisting of a weighted combination of a re-construction loss and a back-translation loss. While the above approaches have shown impressive success on style transfer tasks, the main focus is not controlled language generation, and further, the methods are not plug and play.

Plug and Play Language Models

Given a sequence of tokens $X=\{x_{0},\cdots,x_{n}\}$ , LMs are trained to compute the unconditional probability of the sequence $p(X)$ . This probability can be rewritten in terms of product of conditional probabilities by recursively applying the chain-rule (Manning et al., 1999; Bengio et al., 2003) as:

In this paper, we use a transformer (Vaswani et al., 2017) to model the distribution of natural language. To present our approach clearly, we first briefly summarize the transformer using recurrent notation. Let us define the history matrix $H_{t}$ to consist of the key-value pairs from the past i.e $H_{t}=[(K_{t}^{(1)},V_{t}^{(1)}),\cdots,(K_{t}^{(l)},V_{t}^{(l)})]$ , where $(K_{t}^{(i)},V_{t}^{(i)})$ corresponds to the key-value pairs from the $i$ -th layer generated at all time-steps from 0 to $t$ . Efficient implementations of the transformer (Wolf et al., 2019) use the cached $H_{t}$ to generate $x_{t+1}$ , given $x_{t}$ . This recurrent interpretation of a transformer can be summarized as:

where $W$ is a linear transformation that maps the logit vector $o_{t+1}$ to a vector of vocabulary size, and then $x_{t+1}$ is sampled as $x_{t+1}\sim p_{t+1}=\text{Softmax}(Wo_{t+1})$ . This allows for efficient language generation without repeated forward passes corresponding to the prior conditioning text $x_{0},\ldots,x_{t-1}$ .

2 Steering generation: ascending logp(a|x\log p(a|x)

In order to control the output of the language model, at every generation step $t$ , we shift the history $H_{t}$ in the direction of the sum of two gradients: one toward higher log-likelihood (LL) of the attribute $a$ under the conditional attribute model $p(a|x)$ and one toward higher LL of the unmodified language model $p(x)$ . Combining these factors with a variable multiplier provides us with a controllable “knob” to guide generation in a given direction with a specified strength. The updates are restricted to $H_{t}$ and not the other model activations because future predictions depend on the past only via $H_{t}$ (note that $H_{t}$ is composed of all transformer key and value pairs generated up to time $t$ ). Taking steps in $H_{t}$ space leads to gradual changes to model activations — which may be thought of as gradual reinterpretations of the past — that guide future generation in the desired direction.

Let $\Delta{H}_{t}$ be the update to $H_{t}$ , such that generation with $(H_{t}+\Delta{H}_{t})$ shifts the distribution of the generated text such that it is more likely to possess the desired attribute. $\Delta{H}_{t}$ is initialized at zero and updated with gradients from an attribute model that measures the extent to which the generated text possesses the desired attribute (e.g. positivity). We rewrite the attribute model $p(a|x)$ as $p(a|H_{t}+\Delta{H}_{t})$ and then make gradient based updates to $\Delta{H}_{t}$ as follows:

where $\alpha$ is the step size, $\gamma$ is the scaling coefficient for the normalization term.One normalization term is computed for each layer of the transformer. This update step can be repeated $m$ times; in practice we use $3$ to $10$ . Subsequently, a forward pass through the LM with the updated key-value pairs is performed to obtain the updated logits $\widetilde{o}_{t+1}$ as $\widetilde{o}_{t+1},H_{t+1}=\text{LM}(x_{t},\widetilde{H}_{t})$ , where $\widetilde{H}_{t}=H_{t}+\Delta{H}_{t}$ . The perturbed $\widetilde{o}_{t+1}$ is then used to generate a new distribution $\widetilde{p}_{t+1}$ as in Equation 2.

3 Ensuring fluency: ascending logp(x\log p(x)

The approach described in the previous section is able to generate text tuned for a particular discriminator, but left unchecked it will quickly result in unrealistic adversarial or fooling examples (Szegedy et al., 2013; Nguyen et al., 2015) as the text moves into low probability regions. To combat this, we use the unconditional language model in two ways that ensure the fluency is maintained at or near the level of the unconditional language model (here GPT-2).

We update $\Delta{H}_{t}$ to minimize the KL divergence between the output distribution of the modified and unmodified language models in addition to the step above. In practice, this is accomplished by adding the quantities together before taking a gradient, though it can be visualized as two separate steps as in Figure 2. We scale the KL coefficient by a scalar $\lambda_{KL}$ , and in practice, setting this hyperparameter to 0.01 works well in general across tasks.

Post-norm Geometric Mean Fusion

In addition to minimizing KL divergence, which affects the past via $\Delta{H}_{t}$ , we perform post-norm fusion similarly to Stahlberg et al. (2018). This does not directly affect $\Delta{H}_{t}$ ; rather, it just serves to constantly tie the generated text to the unconditional $p(x)$ LM distribution. We accomplish this by sampling from $x_{t+1}\sim\frac{1}{\beta}\left(\widetilde{p}_{t+1}^{\gamma_{gm}}\,p_{t+1}^{1-{\gamma_{gm}}}\right)$ , where $p_{t+1}$ and $\widetilde{p}_{t+1}$ are the unmodified and modified output distributions, respectively, and $\beta$ is a normalizing factor such that it forms a valid distribution. As $\gamma_{gm}\to 1$ this converges to the distribution from the updated LM, and as $\gamma_{gm}\to 0$ it converges to the unconditional LM distribution. We find that in practice values for $\gamma_{gm}$ in the range $0.8-0.95$ work well.

4 Sampling and Ranking

The attribute model $p(a|x)$ in PPLM provides two functionalities: first, a score that can be used to rank samples based on the LL of the desired attribute (forward pass only; Step 1, Figure 1), and second, a gradient ascent direction to perform an update in the latent space (Step 2 & 3; Figure 1). The former can be used to generate $r$ samples and rank them to choose the best one. This can serve as an additional method for attribute control in addition to sampling with updated latents. Further, to avoid the problem of repetitive, low quality text (Holtzman et al., 2018), we compute the mean over the Dist-1, Dist-2 and Dist-3 scores (for the generated passage), which is an indicator of repetitiveness (Li et al., 2015), and then discard samples with a mean score below a threshold $\tau$ .

Experiments, Results, and Evaluation

In this section, we describe our evaluation methodology and then show controlled generation results under various attribute models. We also show use cases of PPLM in language detoxification and in controlled story telling. For all results reported in this section, we use top-k sampling (Fan et al., 2018) with $k=10$ to draw from the softmax distribution over the vocabulary.

We evaluate to assess two properties: whether PPLM generates text that satisfies the desired attribute (topic or sentiment) and whether the quality of its text deteriorates as we intensify control of the attribute. Note we can always turn the control knob down to zero to disable control of attributes and reach the fluency of the original model. If desired, a user can tune the knobs at inference until a chosen tradeoff between attribute strength and fluency is reached. We evaluate using both automated methods and human annotators:

Automated Eval. Perplexity is an automated measure of fluency, though its effectiveness has been questioned in open-domain text generation (Liu et al., 2016). We measure perplexity using a different pre-trained language model, GPT (Radford et al., 2018b). The diversity of text in the passages is measured using the number of distinct n-grams (normalized by the length of text) as in Li et al. (2015). We report Dist-1, Dist-2, and Dist-3 scores for the distinct 1-2-3-grams (measured across all samples generated for a given attribute control task, e.g. a specific topic for topic control). Such scores are an indicator of the diversity of the samples generated (Li et al., 2015). We aslo use external sentiment classifiers for sentiment evaluation.

Human Eval. We consider two types of human annotation: fluency and A/B testing on attribute relevance. Annotators are asked to evaluate the fluency of each individual sample on a scale of 1-5, with 1 being “not fluent at all” and 5 being “very fluent,” as done in Lample et al. (2019). In the A/B testing for attribute relevance, we consider all combinatorial pairs of all four variants: B, BR, BC, and BCR (6 combinations). We then ask annotators to rank the pair on the desired attribute (e.g. topic relevance, sentiment strength), while allowing “neither” and “both” options to account for equally good/bad generations (Lample et al., 2019). We obtain annotations from nine external occupational annotators. Each pair of samples is evaluated by three individuals and we use majority-voting to compute attribute relevance. For fluency, we use average of the three annotations. The method of generation is completely hidden and the order of samples in A/B testing is randomized.

Ablation study and baselines. We conduct an ablation study with four variants: B: the baseline, unchanged GPT-2 LM, sampled once; BR: B but sampled $r$ times, with best sample chosen based on the LL ranking and filtering based on Dist score; BC: update the latent representations $(\widetilde{H}_{t})$ and then sample once; and lastly BCR: update the latent representations $(\widetilde{H}_{t})$ and generate $r$ samples, choose the best sample based on the LL score (after filtering out samples with low Dist scores). As baseline approaches we consider CTRL: (Keskar et al., 2019), a recent language model; GPT2-FT-RL: a GPT-2 LM fine-tuned for human evaluated positivity with RL (Ziegler et al., 2019); and WD: a weighted decoding baseline in which the B LM’s outputs are weighted directly toward maximizing $p(a|x)$ (Ghazvininejad et al., 2017); see Section S7 for details, and Section S11 for hyperparameters.

2 BoW attribute models

The simplest attribute model we use gives the log of the sum of likelihoods of each word in some predefined Bag of Words (BoW). Given a set of keywords $\{w_{1},\cdots,w_{k}\}$ that specify a topic of interest and the output distribution of the language model $p_{t+1}$ , the log likelihood is:

We construct BoWs that represent seven distinct topics: science, military, legal, computers, space, politics, and religion (see Section S17 for complete word lists). Samples are shown in Table 3, generated from a single prefix, while being controlled towards each topic. Interestingly, we find that increasing the probability of generating the words in the bag also increases the probability of generating related topical words not in the BoW (e.g. in the [Science] sample shown in Table 3, note that question and philosophers are sampled before the first BoW word, laws). Table S17 shows the gradual change of topic intensity under fine-grained control. We found that the optimization procedure works better with updating representations from the past over a finite window and using an adaptive normalization scheme (see Section S11.3).

For automatic and human evaluation, we generate 420 samples evenly distributed among seven BoW attribute models and 20 prefixes (see the full list in Section S15), for each of the four variants described in the ablation study. See Section S8 for further details on evaluation and results. Table 4 shows that human annotators find text from BCR (51.7%) and BC (46.9%) to be significantly more on topic than B (15.8%) and BR (11.1%). With only a slight degradation in fluency scores, passages generated with manipulated latents (BCR and BR) are significantly on topic, demonstrating the desired attribute control on this task. The Dist-1, Dist-2 and Dist-3 scores, which accounts for diversity of text across the generated passages, are similar across all four ablation approaches. Further, BCR slightly outperforms CTRL (51.7% & 50.0%), and significantly outperforms WD (36 %). BC itself outperforms WD (36 %). BCR, CTRL and WD all score similarly on the fluency metric.

We note that gradient-based latent updates have significantly greater influence on topic relevance (R with or without C) than reranking based on the score (C with or without R), showing that shifting meaning in latent space is more effective than shifting the output distribution directly through reweighting. The effectiveness of shifting latents is further corroborated by the WD’s relatively worse performance. WD directly controls the output distribution, which will not lead to increased probability of sampling words from outside the bag that are related to the topic.

Finally, there is a large variance in the extent of controllability across topics (Table S8). We find that some topics (religion, science, politics) are easier to control for compared to others (computers, space). Section S9 considers unusual or nonsensical combinations of prefixes and attributes (e.g. prefix ‘potato’ and topic ’religion’), and we find that even for these settings PPLM is able to successfully control for the desired attribute, often with hilarious twists!

3 Discriminator attribute models

While BoW models have been demonstrated to be able to control text attributes such as sentiment (e.g., Li et al. (2018) rely on extracting a set of attribute-based phrases to control the sentiment during style transfer), being able to control attributes using more sophisticated discriminators is desirable when it is difficult to express the attribute with a simple bag of words.

We train a discriminator on a dataset with input sentences $x$ and corresponding labels $y_{x}$ . For an input $x$ of length $t$ , we compute $o^{x}_{:t}$ and train $f$ on the mean ( $\bar{o}^{t}$ ) of the embeddings across time. All discriminators in this work consist of a single layer classifier that predicts the target label from $\bar{o}^{x}_{t}$ . The number of parameters in this layer is (embedding-dimension ( $e$ ) $\times$ number of attributes ( $a$ ) + number of attributes ( $a$ )), which is negligible compared to the number of parameters in the LM model itself (Table 2). Although the loss is a function of the entire sequence, here we adopt a greedy approach, similar to Ebrahimi et al. (2018); Wallace et al. (2019), in which we optimize for a higher-probability of the sequence having a specific attribute by considering changes only to the next token to be generated. This objective can be described as follows, where $f$ is the discriminator:

The sentiment discriminator here distinguishes sentiment between Positive and Negative and is trained on the SST-5 dataset (Socher et al., 2013). Table 5 shows PPLM-Discrim generated samples in triplets: uncontrolled, controlled for positive sentiment, controlled for negative sentiment. For automatic and human evaluation, we use 15 prefixes (see the full list in Section S15) to generate 45 samples for each of two sentiment classes: very positive and very negative. Note that even though the sentiment discriminator is trained with movie review data, the prefixes (e.g. “The painting”, “The potato”, “The country”) we used are not necessarily associated with movie reviews. This supports the generality of our approach: an attribute model trained with data from a different domain can still provide meaningful gradients.

Table 6 shows evaluation results. For human evaluation, we obtain 1620 annotations for the ablation study and 495 for baseline comparisons from the annotators distributed across the samples and sentiments. Unlike the topic control setting, sampling and ranking results in a considerable increase in attribute accuracy ( $19.3\%\rightarrow 41.5\%$ ), because the prior probability of sampling, say, a negative sentence, is relatively high. BC results in a decrease in fluency when compared to B, while being significantly more consistent with the desired attribute ( $19.3\%\rightarrow 39.6\%$ ). With latent manipulation and ranking (BCR), we see a significant increase in attribute control accuracy ( $73.7\%$ ) while retaining fluency similar to B and BR. Further, the gain in sentiment accuracy from re-sampling is larger in the case of manipulated latents vs non-manipulated ( $34.1\%$ increase from BC to BCR $>$ $22.2\%$ increase from B to BR), indicating that these two approaches may be profitably combined. We also evaluate attribute control with an external sentiment classifier trained on IMDB movie reviews (Maas et al., 2011), which is a different dataset from the one used to train the attribute model (Socher et al., 2013), and the same rough story holds, albeit with smaller gaps between approaches. We compare to baselines CTRL, GPT2-FT-RL, and WD. BCR performs comparably to CTRL (73.7% and 80.0%), and BR, BC and BCR all outperform GPT2-FT-RL, the GPT-2 LM fine tuned for positivity, and WD.

4 Language Detoxification

Language models trained with large corpora of Internet data reflect biases and discrimination existing in the data. A recent paper by Wallace et al. (2019) conducted adversarial attacks that make GPT-2 produce racist output when given a carefully optimized trigger string as prefix. They also find that when simply using “Blacks” as prefix, 2% of GPT-2 samples contain explicit racism. Other prefixes (e.g., “Asians” or “Jews”) are mentioned but no percentage is reported. We conduct experiments and report the baseline toxicity percentages to be 10% (“Asians”), 12% (“Jews”) and 8% (“Blacks”). With adversarial triggers generated from the released codebase by Wallace et al. (2019) the average toxicity percentage is 63.6%. Further details can be found in Section S13.

PPLMs can be easily adapted for language detoxification by plugging in a toxicity classifier as the attribute control model and update latents with the negative gradient. We train a single layer classifier on the toxicity data from the Toxic Comment Classification Challenge (Jigsaw, ) and show that with a similar hyper-parameter setting as other PPLM-Discrim methods, it works well on both natural prompts and adversarial triggers. For natural prompts percentages of toxicity are 6%, 4% and 10%, respectively, and for adversarial triggers it drastically dropped to 4.6% on average, with statistical significance. Details on the annotation procedure and full table of percentage and p-values can be found in Table S23 and Section S13. Note that a model for detoxifying language can also potentially be maliciously used for generating toxic language, a topic we briefly discuss in Section S6.

5 Controlled Story Writing

We explore controlled generation for assistive story writing (Peng et al., 2018; Luo et al., 2019; Yao et al., 2019; Fan et al., 2018). Using uncontrolled LMs for assistive art creation can be difficult. To help with the structure, we use predefined story skeletons often used in improvisation (Adams, ). We fill in the blank between these prefixes with a PPLM. See examples in Table S20 and Table S21.

Conclusion

We have presented PPLM, a plug and play method for controlled language generation that flexibly combines a large, pre-trained LM and a BoW or a small, easy-to-train discriminator. In Section S6 we discuss the ethics of controlled LMs. PPLM achieves fine-grained control of attributes via a simple gradient-based sampling mechanism. Because PPLMs can flexibly control generation while maintaining fluency, they hold great promise for enabling the next generation of language models.

Acknowledgements

The authors are grateful to Bryan McCann for providing samples for the CTRL baseline, Joel Lehman for discussion regarding the ethical implications for this work, Jiale Zhi for help with the computational framework, Colan Chen for creating associated artwork for the blog, Avishek Joey Bose for helpful discussions, Julien Chaumond, Lysandre Debut, Thomas Wolf, and the Hugging Face team for co-producing the PPLM demo and helping integrate the code into their transformers repository, all the annotators at Uber, HKUST and Caltech for their labeling, and members of the Deep Collective research group for helpful discussion, ideas, and feedback on experiments.

References

S6 Ethics of Controlled Language Models

There has recently been a substantial discussion around the ethics of capable language models (Radford et al., 2019; Keskar et al., 2019), both in their potential to recapitulate problematic social biases and for them to be directly abused for societal harm (e.g. to generate disinformation). While one aim of this paper is to suggest a mechanism to detoxify language models (Section 4.4), we also acknowledge that nearly the same mechanism could be exploited to instead create more toxic language. Such possibilities are inherent to general-purpose technologies such as machine learning, and we believe that on balance this work creates more value than risks.

S7 Details on Baseline methods

We consider three baselines: CTRL, GPT2-FT-RL, and WD. The first two are strong baselines where large language models are trained (or fine-tuned) specifically to generate texts conditioned on certain attributes, while WD is considered a weak baseline based on a direct integration of the conditioning into the decoding.

For each baseline, we generate data from their method, and conduct the same human and automated evaluations. For human evaluation of attribute relevance, we match baseline data with our method (BCR in the ablation study), and pass to human annotators for an A/B testing style annotation. As in the ablation study, human annotators are given a pair of texts, one from baseline, one from ours, with orders randomized and source hidden, and asked to rank which one is more topic or sentiment relevant, while having the options of “both” and “neither”.

On top of that, we have human annotators to give the fluency score of each text sample under each method individually. And automated evaluations of perplexity, sentiment, etc. are also done individually.

The recent conditional language model, CTRL, from Keskar et al. (2019), trains a 1.6B LM conditioned on around 50 control codes. We use the official released codebase CTRL codebase: https://github.com/salesforce/ctrl and their open-sourced model to generate samples for the CTRL baseline. Out of the 7 topics considered in PPLM-BoW, we found that 5 can be matched with a specific control code in CTRL. We append a secondary code "Text:" to each primary control code, per the author’s suggestion, to encourage more fluent and longer passages. The 2 topics missing a match with CTRL are: Military, Space. For positive and negative sentiments in PPLM-Discrim, we match with the Reviews control code and append a high and low rating score.

The matched attributes and control codes are listed in Table S7.

Under this setting, for each control code we generate texts prompted by the same prefixes used for corresponding PPLM attribute model (20 for PPLM-BoW, 15 for PPLM-Discrim). For example, “In summary” and “To review,” for PPLM-BoW, and “The chicken”, “The lake” for PPLM-Discrim.

Due to the near-greedy sampling method CTRL uses, for each prefix it generates one sample. Hence we have 20 samples for each matching topic with PPLM-BoW, and 15 samples for positive and 15 for negative.

S7.2 GPT2-FT-RL

A recently released GPT-2 model fine-tuned using human feedback, from Ziegler et al. (2019), showed success in summarization and text continuation in desired styles. To compare with PPLM, we run GPT2-FT-RL GPT2-FT-RL codebase: https://github.com/openai/lm-human-preferences to generate positive texts on the same prefixes used in our PPLM-Discrim experiment. For each prefix, we generate three GPT2-FT-RL samples, and pair them with those generated from PPLM (BCR in the ablation study) randomly.

S7.3 Weighted decoding (WD)

We consider a simple baseline based on a direct integration of the conditioning into the decoding procedure, similar to the approach from Ghazvininejad et al. (2017).

In Ghazvininejad et al. (2017), the authors consider increasing the likelihood of sampling from a bag of key-words by performing beam-search with a modified scoring function.

Sentiment Control with Discriminator

$p(a=\hat{a}|x_{0:t},w_{i})$ is the probabilty of the sequence $x_{0:t},w_{i}$ possessing attribute $\hat{a}$ as assigned by the attribute model. By Bayes’ rule, $p(a=\hat{a};w_{i}|x_{0:t})=p(a=\hat{a}|x_{0:t},w_{i})p_{t+1}(w_{i})$ , and we do top-5 sampling from this distribution. Recall that $p_{t+1}(w_{i})=p(w_{i}|x_{0:t})$ under the language model.

S8 Further details on human and automated evaluation

We conduct evaluations on attribute relevance and language fluency, both including human and automated evaluation.

For topic relevance (a.k.a attribute relevance where the attribute is a topic, in our case represented by a BoW), we rely entirely on human annotation. For sentiment relevance, we rely on human annotation as well as a separately trained sentiment classifier. We also performed a “clickbait” style control, for which the effectiveness relies on human annotation.

For fluency, we use human annotations (between 1 to 5) and automated methods: perplexity, Dist-1, Dist-2, and Dist-3 scores.

The number of human evaluations are as below:

PPLM-BoW. For the ablation study, we have 20 prefixes $\times$ 7 topics $\times$ 6 combinations $\times$ 3 samples $\times$ 3 labels each, resulting in 7560 total annotations. For baseline comparisons, we have (20 prefixes $\times$ 5 topics) for CTRL and (20 prefixes $\times$ 7 topics $\times$ 3 samples) for WD, each then with 3 labels, resulting in 1560 total annotations.

PPLM-Discrim, sentiments. For the ablation study, we have 15 prefixes $\times$ 2 sentiments $\times$ 6 combinations $\times$ 3 samples $\times$ 3 labels each, resulting in 1620 total annotations. For baseline comparisons, we have (15 prefixes $\times$ 2 sentiments) for CTRL and (15 prefixes $\times$ 3 samples) for GPT2-FT-RL and (15 prefixes $\times$ 3 samples $\times$ 2 sentiments) for WD which each have 3 labels, resulting in 495 total annotations.

PPLM-Discrim, clickbait. We include in this section an additional discriminator attribute model, clickbait classifier. For this we use the same setting as sentiment, 15 prefixes $\times$ 6 combinations $\times$ 3 samples $\times$ 3 labels each, resulting in 810 annotations.

In ablation studies, the generation procedure for BCR, BR and BC is always initiated from the same random seeds. The same set of random seeds that lead to the samples chosen with BCR are stored and used to generate the samples with B.

The full table of all these measures, human and automated, on PPLM-BoW, seperated by sentiment and style, is in Table S8. Included also are strong baselines (CTRL and WD) for each sentiment. The human annotated topic relevance is further visualized in Figure S3. The fluency scores, while being across {B, BC,BR, BCR,} methods in the table, when shown in distribution are very similar, as seen in Figure S5.

The full table of all these measures, human and automated, on PPLM-discrm sentiments, is in Table S9. Included also are strong baselines (CTRL, WD and GPT2-FT-RL) for each topic. The human annotated sentiment and style (e.g. “Clickbait”) relevance is further visualized in Figure S4, along with congregated measures: all sentiments, all discriminators, all topics. The fluency scores again have similar distributions across {B, BC,BR, BCR,} methods, as seen in Figure S6.

S9 Odd combination of topics and prefixes

It is interesting to see how PPLM can steer the text generation when the topic and prefix combination appears odd or illogical. For example, will “The potato” still prompt sensible text generation under the topic Religion? In this study we design a set of odd combinations, as bellow.

Prefixes of {“The chicken”, “The horse”, “The pizza”, “The potato”, “The lake”}, each controlled by topics of {Military, Legal, Computers, Politics, Religion};

Prefixes of {“My dog died”, “The food is awful”}, each controlled by the sentiment of Positive;

Prefixes of “The food is amazing”, controlled by the sentiment of Negative.

We found that PPLM control is easy even under those scenarios. We had to increase the strength $\alpha$ two or three fold (to $0.02$ or $0.03$ as opposed to $0.01$ in most studies) to allow for a stronger influence of attribute, but this is as expected: the strength parameter is a knob that user can tune to reach fine-grained control. The resulting generation is included in Table S10 - Table S16.

S10 Fine-grained Control with PPLM-BoW

Table S17 shows the subtle effect when you turn the step size $\alpha$ up, while keeping everything else (hyperparameters, text prefix) the same.

S11 Hyperparameters

We list, in Table S18, the full set of hyperparameters used in each task in the experiments section, corresponding to results in Table 4 and Table 6, as well as in Section 4.4. In addition, we explain in details three hyperparameters and their effect, below.

Degeneration (the occurrence of repetitive words) is a known issue with language generation (Holtzman et al., 2019), and we found it to be a case in PPLM-BoW when the update step size $\alpha$ is too large. The model tends to degenerate towards repeating certain keywords targeted in the optimization (e.g. words in the BoW). In this case, we can either reduce $\alpha$ , or use the trick of early stopping latent updates.

Examples shown in Table S19. With the exact same setting, but just stopping latent updates after 20 time steps, the samples show much less degeneration.

S11.2 Finite Horizon Update

As opposed to updating the entire vector $H_{t}$ , which consists of key-value pairs corresponding to every token in the prefix, we consider modifying the key-value pairs corresponding to the most recent $w$ tokens. At each time-step $t$ , we only modify $H_{t}{[t-w:t]}$ . This means that we modify $H_{i}$ at most $w$ times, and requires lesser computation that updating the whole past. We find that $w=5$ produces more fluent passages for control with the bag of words. For control with the neural attribute model, we update the entire latent history.

S11.3 Adaptive Gradient Normalization

For the bag-of-words based attribute model, what we wish to enforce is that a word from the bag appears at least once in the generated passage and not at every time-step. To account for this, instead of normalizing directly by the gradient norm as in Equation 3, we normalize by the maximum gradient norm over time. This implies that we make smaller updates when it is less likely for a word from the bag of words to appear. Formally, the normalization constant at time-step $t$ is: $\underset{i=0\ldots t}{\max}\|\nabla_{H^{(i)}}\mathcal{L}(o_{i+1})\|$ .

S12 Additional details and samples of controlled story telling

Table S20 and Table S21 show examples of the controlled story telling with skeleton. Table S22 shows samples of combinging multiple (three to four) attribut models, across different types (PPLM-BoW and PPLM-Discrim).

S13 Further Details on Language Detoxification

We ask a group of four volunteers to assign binary toxicity labels to texts generated from both the original GPT-2 and toxicity-controlled GPT-2 and to texts prompted with natural words as well as adversarial triggers. Natural prompts include “Blacks”, “Asians”, “Jews” as suggested in Wallace et al. (2019). 9 Adversarial triggers are generated using the released codebase from the same work. The percentages of toxicity, before (with uncontrolled GPT-2) and after (with toxicity-controlled GPT-2), are reported in Table S23.

S14 More examples

We include more PPLM controlled generation examples in Table S24 – Table S27.

S15 Prefixes used in PPLM evaluation

We consider 20 prefixes as sentence starters for evaluating PPLM-BoW generation, chosen randomly from www2.eit.ac.nz/library/ls_guides_sentencestarters.html. For PPLM-Discrim, we use 15 prefixes. The entire set is below.

‘‘In summary’’, ‘‘This essay discusses’’, ‘‘Views on’’, ‘‘The connection’’, ‘‘Foundational to this is’’, ‘‘To review,’’, ‘‘In brief,’’, ‘‘An illustration of’’, ‘‘Furthermore,’’, ‘‘The central theme’’, ‘‘To conclude,’’, ‘‘The key aspect’’, ‘‘Prior to this’’, ‘‘Emphasised are’’, ‘‘To summarise’’, ‘‘The relationship’’, ‘‘More importantly,’’, ‘‘It has been shown’’, ‘‘The issue focused on’’, ‘‘In this essay’’.

PPLM-Discrim

‘‘Once upon a time’’, ‘‘The book’’, ‘‘The chicken’’, ‘‘The city’’, ‘‘The country’’, ‘‘The horse’’, ‘‘The lake’’, ‘‘The last time’’, ‘‘The movie’’, ‘‘The painting’’, ‘‘The pizza’’, ‘‘The potato’’, ‘‘The president of the country’’, ‘‘The road’’, ‘‘The year is 1910.’’ .

S16 Combining multiple controllers for inspiration

Earlier we demonstrated attribute control using a single attribute model or two attribute models of the same type (e.g. BoW from two separate topics). Here we mix different types of attribute models (BoW and discriminator). For example, we can control the generation toward a mixed topic about Winter, Politics, Kitchen, while turning positive. See examples in Table S22.

S17 Word lists for Bag of Words approaches

We curate word lists from www.enchantedlearning.com/wordlist.

astronomy, atom, biology, cell, chemical, chemistry, climate, control, data, electricity, element, energy, evolution, experiment, fact, flask, fossil, funnel, genetics, gravity, hypothesis, lab, laboratory, laws, mass, matter, measure, microscope, mineral, molecule, motion, observe, organism, particle, phase, physics, research, scale, science, scientist, telescope, temperature, theory, tissue, variable, volume, weather, weigh

Fantasy/Magic:

beast, Cerberus, demon, dragon, fairy, Frankenstein, ghost, Godzilla, giant, horror, hydra, imp, monster, mummy, ogre, orc, savage, spirit, sprite, titan, troll, undead, unicorn, vampire, witch, zombie

Space:

planet, galaxy, space, universe, orbit, spacecraft, earth, moon, comet, star, astronaut, aerospace, asteroid, spaceship, starship, galactic, satellite, meteor

Politics:

affirm, appropriation, aristocracy, authoritarian, authority, authorization, brief, capitalism, communism, constitution, conservatism, court, deficit, diplomacy, direct, democracy, equality, exports, fascism, federation, government, ideology, imports, initiative, legislature, legitimacy, liberalism, liberty, majority, order, political, culture, politics, power, primary, property, ratification, recall, referendum, republic, socialism, state, subsidy, tariff, imports, tax, totalitarian

Military:

academy, advance, aircraft, ally, ammo, ammunition, armor, arms, army, arrow, arsenal, artillery, attack, attention, ballistic, barracks, base, battalion, battery, battle, battlefield, bomb, bombard, bombardment, brig, brigade, bullet, camouflage, camp, cannon, captain, capture, carrier, casualty, catapult, cavalry, colonel, combat, command, commander, commission, company, conflict, conquest, convoy, corps, covert, crew, decode, defeat, defend, defense, destroyer, division, draft, encode, enemy, engage, enlist, evacuate, explosive, fight, fire, fleet, force, formation, fort, front, garrison, general, grenade, grunt, guerrilla, gun, headquarters, helmet, honor, hospital, infantry, injury, intelligence, invade, invasion, jet, kill, leave, lieutenant, major, maneuver, marines, MIA, mid, military, mine, missile, mortar, navy, neutral, offense, officer, ordinance, parachute, peace, plane, platoon, private, radar, rank, recruit, regiment, rescue, reserves, retreat, ribbon, sabotage, sailor, salute, section, sergeant, service, shell, shoot, shot, siege, sniper, soldier, spear, specialist, squad, squadron, staff, submarine, surrender, tactical, tactics, tank, torpedo, troops, truce, uniform, unit, veteran, volley, war, warfare, warrior, weapon, win, wound

Religion:

Absolute, Affect, Aid, Angel, Anthem, Apostle, Archangel, Archbishop, Balance, Ban, Belief, Benefit, Bible, Bishop, Bless, Blessing, Bliss, Bond, Bow, Buddhism, Canon, Cantor, Cathedral, Celestial, Chapel, Charity, Choice, Christianity, Church, Comfort, Community, Conflict, Connection, Conquest, Conservative, Control, Conversion, Convert, Core, Counsel, Courage, Covenant, Creative, Creator, Creed, Cross, Crusade, Darkness, Decision, Deity, Destiny, Devil, Disciple, Discipline, Discussion, Divine, Divinity, Doctrine, Duty, Effect, Elder, Energy, Essence, Eternal, Ethics, Event, Evidence, Exile, Exodus, Faith, Family, Fate, Father, Favor, Fundamental, Gift, Glory, God, Gospel, Grace, Growth, Guru, Habit, Hallow, Halo, Happiness, Harmony, Healing, Heaven, Hebrew, Holy, Honor, Hope, Host, Humane, Immortal, Influence, Insight, Instruction, Issue, Jesuit, Jesus, Joy, Judaism, Judgment, Justice, Karma, Keen, Keystone, Kingdom, Latin, Life, Light, Love, Loving, Marriage, Meaning, Mercy, Messiah, Minister, Miracle, Mission, Mortal, Mosque, Movement, Music, Mystery, Nature, Nun, Official, Oracle, Order, Organ, Orthodox, Outlook, Pacific, Pagan, Parish, Participation, Pastor, Patriarch, Peace, Perception, Personal, Perspective, Petition, Pilgrim, Politics, Power, Practice, Prayer, Prelude, Presence, Priest, Principle, Privacy, Prophet, Protection, Purpose, Query, Quest, Question, Quiet, Radiant, Radical, Rally, Rebirth, Redemption, Refuge, Relationship, Relative, Religion, Religious, Revelation, Ritual, Role, Sacrament, Sacred, Sacrifice, Sage, Saint, Salvation, Sanctuary, Savior, Scripture, Scriptures, Sect, Security, Sense, Serious, Serve, Service, Sharia, Shepherd, Shrine, Silence, Sin, Society, Soul, Source, Spirit, Spiritual, Split, Statue, Sunday, Support, Supreme, Teaching, Temple, Tests, Text, Torah, Tradition, Traditional, Trust, Unique, Unity, Unknown, Value, Vanity, Virtue, Vision, Voice, Voices, Watch, Weight, Whole, Wisdom, Wonder, Yang, Yin, Zeal

Computers:

algorithm, analog, app, application, array, backup, bandwidth, binary, bit, bite, blog, blogger, bookmark, boot, broadband, browser, buffer, bug, bus, byte, cache, caps, captcha, CD, client, command, compile, compress, computer, configure, cookie, copy, CPU, dashboard, data, database, debug, delete, desktop, development, digital, disk, document, domain, dot, download, drag, dynamic, email, encrypt, encryption, enter, FAQ, file, firewall, firmware, flaming, flash, folder, font, format, frame, graphics, hack, hacker, hardware, home, host, html, icon, inbox, integer, interface, Internet, IP, iteration, Java, joystick, kernel, key, keyboard, keyword, laptop, link, Linux, logic, login, lurking, Macintosh, macro, malware, media, memory, mirror, modem, monitor, motherboard, mouse, multimedia, net, network, node, offline, online, OS, option, output, page, password, paste, path, piracy, pirate, platform, podcast, portal, print, printer, privacy, process, program, programmer, protocol, RAM, reboot, resolution, restore, ROM, root, router, runtime, save, scan, scanner, screen, screenshot, script, scroll, security, server, shell, shift, snapshot, software, spam, spreadsheet, storage, surf, syntax, table, tag, template, thread, toolbar, trash, undo, Unix, upload, URL, user, UI, username, utility, version, virtual, virus, web, website, widget, wiki, window, Windows, wireless, worm, XML, Zip

Legal:

affidavit, allegation, appeal, appearance, argument, arrest, assault, attorney, bail, bankrupt, bankruptcy, bar, bench, warrant, bond, booking, capital, crime, case, chambers, claim, complainant, complaint, confess, confession, constitution, constitutional, contract, counsel, court, custody, damages, decree, defendant, defense, deposition, discovery, equity, estate, ethics, evidence, examination, family, law, felony, file, fraud, grievance, guardian, guilty, hearing, immunity, incarceration, incompetent, indictment, injunction, innocent, instructions, jail, judge, judiciary, jurisdiction, jury, justice, law, lawsuit, lawyer, legal, legislation, liable, litigation, manslaughter, mediation, minor, misdemeanor, moot, murder, negligence, oath, objection, opinion, order, ordinance, pardon, parole, party, perjury, petition, plaintiff, plea, precedent, prison, probation, prosecute, prosecutor, proxy, record, redress, resolution, reverse, revoke, robbery, rules, sentence, settlement, sheriff, sidebar, standing, state, statute, stay, subpoena, suit, suppress, sustain, testimony, theft, title, tort, transcript, trial, trust, trustee, venue, verdict, waiver, warrant, will, witness, writ, zoning