GeDi: Generative Discriminator Guided Sequence Generation

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, Nazneen Fatema Rajani

cs.CL cs.LG

Introduction

Natural language generation has seen great progress with the advent of Transformers (Vaswani et al., 2017) and large scale training (Radford et al., 2017; 2018; 2019; Brown et al., 2020). Large language models (LMs) like GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) are able to learn the distribution of their training set well enough to generate realistic text. However, simply imitating the distribution of the training data during generation has many drawbacks; large-scale text training sets are crawled from the web which is imbued with toxicity, bias, hate, and misinformation. Methods for better controlling or filtering generation are valuable for making LMs trained on such data safer and more generally useful for downstream applications.

Existing approaches to controlling LMs have limitations. Class-conditional LMs (CC-LMs) such as CTRL (Keskar et al., 2019) attempt to control text generation by conditioning on a control code, which is an attribute variable representing a data source. However, CTRL is not as useful for controlling what not to generate (i.e. toxicity). Furthermore, using a specific control code can reduce sample diversity across prompts, as samples will generally resemble the data source of the control code. Another approach is to use discriminators to steer generation, but existing methods to do this are very computationally intensive. Weighted decoding (Holtzman et al., 2018) requires feeding candidate next tokens into a discriminator, and thus scales linearly in computation with the number of tokens to be re-weighted. Plug and Play LM (Dathathri et al., 2020, PPLM) applies up to 10 updates to the generating LM’s latent states per time step using gradients from a discriminator, also making it many times slower than generating from the LM directly.

We present GeDipronounced “Jedi” as an algorithm for efficiently guiding generation from large LMs to make them safer and more controllable. Our proposed method uses CC-LMs as generative discriminators (GeDis) to guide language generation towards desired attributes. The methods we develop include:

GeDi-guided contrastive generation: We show how CC-LMs can be used as generative discriminators to compute classification likelihoods for all candidate next tokens during generation using Bayes rule, saving many thousand-fold in computation as compared with using a standard (non-generative) discriminator to compute this for large vocabulary sizes. We then show how these likelihoods can guide generation from large language models via weighted decoding and filtering [Section 3.1].

GeDi training: We train CC-LMs with a hybrid generative-discriminative loss to make them better classifiers, making them more powerful discriminators for GeDi-guided contrastive generation [Section 3.2].

Our experimental results verify the ability of GeDi to control generation in a variety of settings while maintaining linguistic quality on par with strong language models. We apply GeDi (345M parameters) to guide generation from the GPT2-XL model (1.5B parameters), and find that:

GeDi trained on sentiment of movie reviews can generate book text with a positive or negative tone better than state of the art baselines [Section 5.1]. Guiding towards positivity also has potential applications towards making LMs friendlier.

GeDi is able to significantly reduce the toxicity of GPT-2 generation [Section 5.2], without sacrificing linguistic quality as compared with generating from GPT-2 directly, suggesting applications towards safer language modeling.

GeDi trained on a dataset of only 4 topics can generalize to new control codes zero-shot [Section 5.3], allowing them to guide generation towards a wide variety of topics.

GeDi is very computationally efficient for both training and inference. GeDi guided generation in our experiments is more than $30\times$ faster than applying PPLM with GPT2-XL using default settings from Dathathri et al. (2020). Additionally, smaller GeDis fine-tuned for less than a day on a single GPU are effective and computationally efficient for controlling larger language models. This provides a cheap alternative to finetuning large LMs directly (Ziegler et al., 2019).

Background

Language models (LMs) rely on an auto-regressive factorization to perform density estimation and generation of language data. Auto-regressive sequence models with parameters $\theta$ assign a probability to a sequence $x_{1:T}=\{x_{1},\dots,x_{T}\}$ by factorizing it using the chain rule as follows:

Models can assign probabilities to sequences by iteratively predicting a distribution over the next token given the previous tokens. Generating from language models requires iteratively sampling from $P_{\theta}(x_{t}|x_{<t})$ , and then feeding $x_{t}$ back into the model as input for the next step.

2 Class-Conditional Language modeling

Class-conditional language models (CC-LMs) such as CTRL (Keskar et al., 2019) are a way for language models to generate while conditioning on an attribute variable. CC-LMs predict a probability distribution $P_{\theta}(x_{1:T}|c)$ , where $c$ is a class variable or a “control code” that describes an attribute of the text in $x_{1:T}$ , which could, for instance, describe sentiment or topic. The auto-regressive factorization for a CC-LM is given by the following equation:

When training a CC-LM on a training set of sequences $\{x^{(1)}_{1:T_{1}},\dots,x^{(i)}_{1:T_{i}},\dots,x^{(N)}_{1:T_{N}}\}$ , each sequence $x^{(i)}_{1:T}$ is paired with a control code $c^{(i)}$ , which is a label or category of the sequence. The LM is trained to minimize the average negative log-likelihood, which we refer to as $\mathcal{L}_{g}$ .

In addition to class-conditional generation, CC-LMs can be used as generative classifiers by applying Bayes rule to compute $P_{\theta}(c|x_{1:T})$ , as is done by Keskar et al. (2019) for source attribution.

GeDi

GeDi assumes we have a CC-LM with desired control code $c$ and an undesired or anti-control code $\bar{c}$ , and uses the contrast between $P_{\theta}(x_{1:t}|c)$ and $P_{\theta}(x_{1:t}|\bar{c})$ to guide sampling from an LM that gives $P_{{LM}}(x_{1:t})$ . Specifically, when predicting the next token during generation, GeDi uses this contrast to compute the probability that every candidate next token $x_{t}$ belongs to the desired class, given by $P_{\theta}(c|x_{t},x_{<t})$ . Our key insight is that this distribution can be computed very efficiently when using CC-LMs as GeDis via application of Bayes rule for partial sequences during generation.

When computing this online during sequence generation, the model will have already computed $P_{\theta}(x_{j}|x_{<j},c^{\prime})$ for any $j<t$ from the previous time-steps, and it will only need to compute $P_{\theta}(x_{t}|x_{<t},c^{\prime})$ . This can be computed in two parallel forward passes; one conditioning on $c$ and one conditioning on $\bar{c}$ (both conditioning on the same $x_{<t}$ ). The model can also save the hidden states from the previous time steps to avoid computing a forward pass through the full sequence at each next token generation step. Applying a unidirectional classifier such as GPT (Radford et al., 2018) to compute $P_{\theta}(c|x_{t},x_{<t})$ directly (i.e. discriminatively) would require feeding in every possible input $x_{t}\in{\mathcal{V}}$ into the classifier, and thus would require $|{\mathcal{V}}|$ forward passes for a vocab set ${\mathcal{V}}$ . A bidirectional classifier such as BERT (Devlin et al., 2018) would require $t\times|{\mathcal{V}}|$ forward passes because it would need to recompute attention states from earlier time-steps. For typical vocab sizes of $20$ k+, GeDi’s online classification trick can compute $P_{\theta}(c|x_{t},x_{<t})$ for every possible next token $x_{t}$ on the order of 10k fold less computation as compared with a unidirectional classifier.

In practice, we find normalizing ( $\log$ ) probabilities by current sequence length $t$ results in more robust generation of variable length sequences. Our GeDi trained models (see next section) also use a learnable scale parameter $\alpha$ . To compute $P_{\theta}(c|x_{1:t})$ for GeDi-guided generation, we use the following equation:

The log prior is encoded with bias parameters $b_{c}$ , where $P(c)=\frac{e^{b_{c}}}{\sum_{c^{\prime}}e^{b_{c^{\prime}}}}$ . This bias parameter can be assumed to be zero for uniform classes, learned (see next section on GeDi training), or set manually as a hyper-parameter. In practice, $P_{\theta}(c|x_{1:t})$ is computed with log-probabilities (see Appendix B). With the efficient estimation of $P_{\theta}(c|x_{t},x_{<t})$ , there are many possible heuristics that can be used to guide LM generation, so long as the LM and GeDi share the same tokenization. Heuristics that use $P_{\theta}(c|x_{t},x_{<t})$ inherently contrast predictions conditioned on $c$ and $\bar{c}$ , causing attributes common to $c$ and $\bar{c}$ to be cancelled out, more effectively allowing for the attribute described by $c$ to be transferred across domains, as illustrated in Figure 1.

We applied weighted decoding and filtering heuristics to use $P_{\theta}(c|x_{t},x_{<t})$ to guide generation, which worked well in practice in our experiments but are not necessarily optimal; there are many possible ways to use the classification signal given by GeDi to guide generation. Our initial heuristic applies a weighted posterior given by

where $\omega>1$ to bias generation more strongly towards the correct class. The right hand side of Equation (6) is normalized over all $x_{t}$ in the vocabulary to obtain $P_{w}(x_{t}|x_{<t},c)$ .

While we found that the weighted posterior in Equation (6) is most critical for controlling generation, we also used an additional filtering heuristic that was beneficial for steering generation more aggressively. This heuristic, inspired by nucleus sampling (Holtzman et al., 2020), removes candidate next word tokens with lower values for $P_{\theta}(c|x_{t},x_{<t})$ while maintaining a minimum of at least $\rho$ in cumulative probability mass in $P_{w}(x_{t}|x_{<t},c)$ . We define ${\mathcal{V}}_{n}$ as the set of $n$ tokens with the highest $P_{\theta}(c|x_{t},x_{<t})$ . We define $m$ as the minimum $n$ such that

We define ${\mathcal{V}}_{m}$ as ${\mathcal{V}}_{n}$ for $n=m$ , meaning that ${\mathcal{V}}_{m}$ will contain the minimum number of tokens possible at the head of the distribution for $P_{\theta}(c|x_{t},x_{<t})$ to maintain a minimum cumulative probability of $\rho$ in $P_{w}(x_{t}|x_{<t},c)$ .

We define another set of tokens to keep, ${\mathcal{V}}_{p}\subseteq{\mathcal{V}}$ , which maintains all tokens where $P_{\theta}(c|x_{t},x_{<t})>\tau$ . The motivation is that if we are acceptably sure that the resulting sequence from generating a token is in the correct class, there is no need to filter it. The final set of tokens to keep are then given by ${\mathcal{V}}_{k}={\mathcal{V}}_{p}\cup{\mathcal{V}}_{m}$ . We then zero out probabilities of tokens not in ${\mathcal{V}}_{k}$ and re-scale the remaining distribution to sum to $1$ .

2 GeDi Training

The previous section presented a method for using a CC-LM as a GeDi to guide the generation of another LM. However, previous work shows that generative classifiers are generally inferior to discriminative ones when trained on large datasets (Ng & Jordan, 2002; Yogatama et al., 2017). For this reason, we propose training CC-LMs discriminatively as classifiers with GeDi training, with the primary goal of making them better discriminators for GeDi-guided generation. We also have a secondary goal of making them better at directly generating; a CC-LM that can correctly classify sequences via Equation (5) may be better at generating sequences in the desired class. The idea of discriminatively training class-conditional generative models has previously been considered for the classification of text (Yakhnenko et al., 2005), and images (Lasserre et al., 2006).

With GeDi training, we combine the standard generative language modeling loss $\mathcal{L}_{g}$ from Equation (3) with a discriminative loss $\mathcal{L}_{d}$ , defined as:

$P_{\theta}(c^{(i)}|x^{(i)}_{1:T_{i}})$ is derived from an offline version of Equation (5) given by

where $c^{\prime}\in\{c^{(i)},\bar{c}^{(i)}\}$ for the binary case (where $c^{(i)}$ is the correct class and $\bar{c}^{(i)}$ is the incorrect class for the $i$ th sequence), $P(c)=\frac{e^{b_{c}}}{\sum_{c}^{\prime}e^{b_{c^{\prime}}}}$ (where $b_{c}$ is a learnable class bias which we omit when class distribution is roughly equal), $\alpha$ is a learnable scale parameter, and $P_{\theta}(x_{1:T_{i}}^{(i)}|c^{(i)})$ is given by Equation (2) for CC-LMs. The cost function for GeDi training $\mathcal{L}_{gd}$ is then given by

where $\lambda$ is a hyper-parameter. In GeDi training, the discriminative loss $\mathcal{L}_{d}$ is aimed at increasing classification accuracy, whereas the generative loss $\mathcal{L}_{g}$ likely helps the CC-LM have better calibrated token probabilities for guided generation.

Related Work

Methods for controlling text generation can be categorized broadly into two types: training or finetuning a model directly for controllable generation (Keskar et al., 2019; Ziegler et al., 2019; Rajani et al., 2019; Ficler & Goldberg, 2017; Yu et al., 2017; Hu et al., 2017) or using a discriminator to guide generation (Ghazvininejad et al., 2017; Holtzman et al., 2018; Dathathri et al., 2020). Keskar et al. (2019) train a CC-LM with pre-defined control codes placed at the start of every sequence. Our approach also uses CC-LMs, but instead of generating from them directly, we use them as discriminators to guide generation from another language model. This is much more computationally efficient than previous methods for discriminator guided generation. Holtzman et al. (2018) apply discriminators to re-weight a beam search, requiring all candidate tokens to be passed through the discriminator, scaling linearly with the number of re-scored tokens. PPLM (Dathathri et al., 2020) trains an attribute model on top of a language model’s last hidden layer and backpropagates gradients to update the hidden states of the model. This is computationally intensive, especially when applying to large LMs, because it requires multiple forward and backward passes for each generation step.

GeDi also relates to contrastive learning (Smith & Eisner, 2005; Mnih & Teh, 2012). Most existing contrastive learning methods work at the instance level by constrasting one positive pair from $k$ negative pairs, whereas GeDi works at the class level and contrasts a positive class-conditional distribution against a negative one. GeDi also uses the contrast between positive and negative distributions for both training (i.e., GeDi training) and inference (i.e., contrastive generation).

Experiments

Our experiments finetune GPT2-medium (345M parameter) (Radford et al., 2019) with control codes specific to each task to form a class-conditional language model. We consider finetuning using GeDi training ( $\lambda<1$ in Equation (10)) and standard generative training ( $\lambda=1$ in Equation (10)). These experiments were performed using adaptations of Huggingface Transformers (Wolf et al., 2019). We study the trade-offs between GeDi vs generative training for classification, perplexity, and direct generation in depth in Appendix E. We find that GeDi trained CC-LMs have a higher generative classification accuracy at the cost of a higher perplexity. We also find that GeDi-trained CC-LMs are able to achieve a higher label fidelity across generation tasks, meaning that the control code more often corresponds to the true attribute of the generated sample.

In our main experiments, we use these CC-LMs as GeDis to guide generation from GPT2-XL (1.5B parameter). For generation, we use greedy decoding with a repetition penalty (Keskar et al., 2019), and condition on varying prompts to give diversity across samples. Additional details about the way we apply a repetition penalty are given in Appendix C, and our hyper-parameter settings for GeDi-guided generation, which were shared across most experiments, are given in Appendix D.1. We experiment with GeDi-guided generation for sentiment, detoxification, and topic control.

In our sentiment experiments, we compare direct generation from CC-LMs vs. using CC-LMs as GeDis. We refer to direct generation simply as “CC-LM” (using $\lambda=1$ to specify generative training and $\lambda<1$ to specify GeDi training), and we refer to GeDi-guided generation using a CC-LM to guide GPT-2 as “GeDi-guided” (also using $\lambda$ to specify generative/GeDi training).

We experiment with GeDi-guided generation from GPT-2 for sentiment control. For these experiments, we use CC-LMs finetuned on IMDb movie reviews using both GeDi and generative training (reused from Appendix E). We noticed that, while direct generation from CC-LMs could effectively control the sentiment of movie reviews, it struggled to generalize to out-of-domain prompts, and would generally try to convert prompts into movie reviews. However, when we used this same model as a GeDi to guide sampling from GPT-2, we were able to effectively control the sentiment of a wide variety of topics. For instance, in our preliminary experiments, we considered the prompt “I just read this paper on Generative-Discriminative training.” in Table 6 and it results in text that mentions well known deep learning ideas and researchers while also controlling sentiment.

To experimentally verify that GeDi can achieve domain transfer of the concepts of “positivity” and “negativity”, we consider a book text generation task where we conditionally generate text from the start of book chapters from Bookcorpus (Zhu et al., 2015), where each prompt is at least 150 characters and ends on the first-word break after the minimum length. We run human evaluation on generations from 50 different book prompts from 13 different models; including raw GPT2-XL, and the following models with both positive and negative sentiment: 1. GPT2-XL guided by a GeDi-trained CC-LM (GeDi-guided, $\lambda=0.6$ ), 2. GPT2-XL guided by a generatively-trained CC-LM (GeDi-guided, $\lambda=1.0$ ), 3. direct generation from a GeDi-trained CC-LM (CC-LM, $\lambda=0.6$ ), 4. direct generation from a generatively-trained CC-LM (CC-LM, $\lambda=1.0$ ), 5. CTRL, 6. PPLM applied to GPT2-XL. See Appendices D.2 and D.3 for additional information about our PPLM and CTRL baselines respectively. We found that it was more than $30\times$ faster to guide GPT2-XL with a GeDi as compared with PPLM (assuming 10 update steps as used in Dathathri et al. (2020)), as shown in Table 1).

Amazon Mechanical Turk annotators rated the generated text on sentiment/tone, how book-like the text was, and whether or not the text resembled an Amazon review or movie review (since CTRL was trained on Amazon reviews and GeDi was trained on movie reviews). Each annotator was randomly assigned samples from the set of all generations from all models. The results are given in Table 2. Using a GeDi-trained CC-LM to guide GPT2-XL was able to generate book-like text while strongly control the tone. GeDi was also able to give slightly stronger sentiment control than PPLM, in addition to being more than $30\times$ faster.

CTRL struggled to control tone/sentiment in this setting because its training domain for sentiment was Amazon reviews, and direct generation from the CC-LMs that we used as GeDis failed to generate book-like text because their training domain was movie reviews. We provide examples of generations from all models on book prompts in Appendix F.1. Table 13 specifically shows how CTRL tends to generate Amazon reviews and how the generative and GeDi-trained CC-LMs tend to generate movie reviews. Using these same CC-LMs as GeDis to guide generation led to book-like text, demonstrating domain transfer of the concepts of positivity and negativity.

2 Detoxifying GPT-2

With the motivation of detoxifying GPT-2, we train a CC-LM as a toxicity classifier on the Jigsaw Toxic Comment Classification Challenge Datasethttps://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/, which contains text samples labeled as “toxic” or “non-toxic”. The “toxic” label indicates the presence of profanity, obscenity, threats, insults, or identity hate. We train models on an even split of toxic and non-toxic examples. We use toxic examples from the Jigsaw dev set to find prompts to condition on for evaluation. We used prompts that ended on a word break and were at least 30 characters. In order to have prompts that are more likely to trigger aggressive generations but less likely to be explicitly toxic, we pass candidate prompts through a RoBERTa (Liu et al., 2019) model trained to classify toxicity, and only kept prompts where RoBERTa was less confident about the toxicity label. We generate samples from these prompts using GeDi-guided generation with a GeDi-trained guide ( $\lambda=0.6$ ) and a generatively trained guide ( $\lambda=1.0$ ).

We run human evaluation to measure toxicity [1: non-toxic, 2: mildly toxic, 3: toxic] and linguistic quality [1: very low quality, 2: low quality, 3: high quality, 4: very high quality]. Results are given in Table 3. GeDi-guided generation resulted in significantly less toxic text for both values of $\lambda$ , with the GeDi-trained GeDi guide ( $\lambda=0.6$ ) achieving the highest linguistic quality of all models.

3 Multi-class topic control

We extend GeDi to the multi-class setting by training it to classify whether or not a sequence matches a topic. This can be done with CC-LM by training it to condition on a “true” and “false” control code, where sequences have the name of a topic prepended. The “true” control code corresponds to sequences where the prepended topic matches the sequence, and the “false” control code corresponds to sequences where the prepended topic does not match the text. For generation, the desired attribute is set to “true” and the prompt is prepended with the desired topic. Refer to Appendix A for additional details. We use the AG news topic classification data set (Zhang et al., 2015) which has 4 topics (World, Sports, Business, and Science/Tech) to train GeDis with 6 values of $\lambda$ (including $\lambda=1$ ). We only train the CC-LMs on half of the dataset and train a RoBERTa classifier on the other half to measure label fidelity. After training, we applied each GeDi to guide generation from GPT2-XL. We use a minimum of 50 character prompts from the multi-news dataset (Fabbri et al., 2019) to condition on for generation. The prompt often will not fit with the desired topic, sometimes creating a challenge for the model to relate the article to the desired attribute. We measured automatic label fidelity first as given by the RoBERTa classifier, and we found the generatively trained GeDi guide ( $\lambda=1$ ) achieved a significantly lower automatic label fidelity (61% for $\lambda=1$ vs. 74% for $\lambda=0.8$ ), suggesting that GeDi-training may be important for extending GeDi-guided generation to many control codes using the proposed binarization method.

We ran human evaluation on samples from the 4 news topics comparing our strongest GeDi guide (we chose $\lambda=0.8$ based on automatic label fidelity), and raw GPT-2-XL. Annotators were given the topic and asked to rate samples on topic relevance and linguistic quality. The results are given in Table 4. GeDi-guided generation gave text with high relevance for all 4 topics while maintaining a similar level of linguistic quality to GPT2-XL. We give examples of GeDi topic generation in Appendix F.3.

For topic training, we prepended the words “world”, “sports”, “business”, and “science” to sequences. However, at inference, any word could potentially be prepended to the prompts. We observed that the GeDi, trained for only several hours on a single GPU on 4 topics, could guide GPT-2 towards generating text corresponding to a very wide array of topics that included “space”, “history”, “education”, “cars”, “climate” and many more. This zero-shot behavior worked very well for short, topic neutral prompts, as shown for the prompt “In a shocking finding” in Appendix F.4, but did not work as well for longer prompts. We also only tested topics that could be encoded with 1 byte-pair encoding (Sennrich et al., 2015) token, since this was the case for all our training topics. However, this zero-shot behavior could likely apply to longer control codes if trained on longer control codes. We also compare with zero-shot topic generation using CTRL in Table 21 as a baseline and find that despite being trained on significantly more topics, CTRL struggles to generate text corresponding to control codes it has never seen during training.

GeDi’s ability to generalize to new control codes zero-shot gives the ability to generate text corresponding to many topics and subtopics. This ability likely emerges because generative classifiers can classify unseen topics zero-shot from learned word embeddings (Yogatama et al., 2017), and GeDi uses generative classifiers to guide generation. This is another advantage of GeDi over the previous discriminator guided generation approaches.

Future directions

Methods to make large LMs like GPT-3 (Brown et al., 2020) safer and more controllable are becoming especially important as LMs become incorporated into products. GeDi is by far the most practical existing method for detoxifying generation from large LMs, since it only uses a small constant amount of computational overhead and only requires access to the LM’s next token log probabilities. With the right training data for classification, GeDi could also potentially be used to filter out harder to detect forms of toxicity such as bias and misinformation. Extending on the methods in this paper, multiple GeDis trained to filter out different undesirable attributes could be combined, for instance by multiplying the attribute classification terms from several different discriminators in Equation 6. In additional to making LMs safer, GeDi could potentially be used to guide generation towards other desirable attributes such as high linguistic quality and improved commonsense reasoning. Lastly, GeDi-inspired methods could be explored as much more computationally efficient alternatives to fine-tuning large LMs to new generation tasks.

Conclusion

We present GeDi as an approach for controllable generation that uses generative discriminators to classify candidate next tokens on the fly during inference, making it far more efficient than previous methods that use discriminators to guide generation. GeDi achieves stronger controllability of sentiment than PPLM while also giving a generation speed more than $30\times$ faster. GeDis trained on 4 topics can also controllably generate new topics zero-shot from just a key word, unlocking a new capability that previous controllable generation methods like PPLM and CTRL do not have. We also show that GeDi is able to significantly reduce the toxicity of GPT-2 without sacrificing linguistic quality. The ethical considerations of language modeling are becoming more important as LMs like GPT-3 become incorporated into products, and GeDi is far more promising than any previous approach for detoxifying large language models while maintaining a fast generation speed. This work also moves towards unifying natural language generation with classification, and suggests that we may be able to efficiently generate text that corresponds to any attribute that we can accurately classify. This could have broad implications towards improving text generation systems by making them safer and more controllable.

Ben thought of the main ideas and designed the research. Ben and Akhilesh coded the implementation. Akhilesh maintained the codebase, set up automatic and human evaluation experiments, and organized results. Nazneen advised on detoxification experiments. All authors contributed to writing and discussions.

Acknowledgments

The authors thank Semih Yavuz and Yu Bai for helpful discussions and feedback on this project.

References

Appendix A Multi-class GeDi

Both GeDi-guided generation and GeDi training use CC-LMs to perform classification. The most straightforward way to extend this to many classes is to have one forward pass conditioned on each control code and normalize over a larger number of classes via Equation (5) (which we in fact do for 3-class MNLI in Appendix E). However, this approach does not scale well computationally to large numbers of classes. As a solution, we propose reframing each classification task as binary classification using control codes and anti control codes for each class. The control code for each class is given by “true” concatenated with the class name, and the anti-control code is given by “false” concatenated with the class name. The CC-LM then classifies whether the class name corresponds to the text. For instance, the CC-LM would process the following two sequences in parallel:

T-rex achieved its massive size due to an enormous growth spurt during its adolescent years. T-rex achieved its massive size due to an enormous growth spurt during its adolescent years.

and would classify it as true or false as to whether the class (in this case “science”) matches the category of the text by using Equation (9). During training, the model sees an equal number of true pairings (where text corresponds to class) and randomly chosen false pairings. After the model has been trained, binary GeDi-guided generation can be applied, using $c=$ and $\bar{c}=$ , and using the desired class name as the first token ( $x_{1}$ ) in the sequence. This also makes it possible to form new control codes zero-shot; a new topic word that was never seen before in training can be chosen in place of $x_{1}$ .

Appendix B GeDi with log probabilities

GeDi-guided generation and GeDi training both use language models discriminatively via Bayes rule by using

For GeDi-guided generation, this is computed online for partial sequences during generation, whereas for GeDi training, it is computed for full training sequences. For numerical stability, we compute this using log-probabilities. Log-probabilities for each class are given by

This can be computed in a numerically stable way using a softmax (Bridle, 1990), since the maximum logit to the softmax can be subtracted out before taking the exponent without changing the result. For the two class case (all of our experiments except for MNLI, which was 3-class), $c^{\prime}\in\{c,\bar{c}\}$ , meaning that the above equation could have been equivalently computed using a sigmoid of the difference of the logs of the two terms in the denominator sum (but our implementation used softmax as above).

Appendix C Generation settings

When comparing the quality of samples from different language models, there is a trade-off between quality and diversity; models that tend to have more sharply peaked distributions for $P_{\theta}(x_{t}|x_{<t},c)$ will tend to have higher quality samples, but will also have less diversity. Applying GeDi results in more sharply peaked distributions due to the filtering step, which zeros out probabilities for some tokens. In order to ensure a fair comparison of models, we only use greedy decoding for our experiments, meaning we always pick the most likely token in the model’s predictive distribution. With greedy decoding, the model would generate the same text sequence every time without any conditioning text. Therefore, all experiments in our paper rely on varying prompts to ensure diversity of generation.

We also apply a repetition penalty (Keskar et al., 2019), which we found necessary for preventing degeneration with greedy decoding. Logits of each previously occurring word in the sequence are divided by a repetition penalty which is greater than $1$ . To account for the possibility of negative logits, we re-scaled the final logits in all models to always have a maximum of $10$ across the vocabulary before dividing by the repetition penalty. We used a repetition penalty of $1.2$ in all models in our experiments.

Appendix D Additional model and hyper-parameter details

We found hyper-parameter settings using a combination of eyeballing generation quality and automatic label fidelity metrics given by a RoBERTa classifier (Liu et al., 2019) trained on an external training set (where a label fidelity of 100% means that RoBERTa always agrees that the class label is the same as the control code). All of our GeDi models except for the GeDi-trained detoxification model use the same generation hyper-parameters ( $\omega=30$ , $\rho=0.2$ , $\tau=0.8$ ), which we found to work well across tasks and values of $\lambda$ for training.

Using the hyper parameters above, we initially found that the GeDi-trained detoxification guide would sometimes result in very short samples that cut off mid sentence. Since the GeDi operates discriminatively at the token level, it cannot confidently classify a sequence as non-toxic until the sequence has finished, which likely was causing the model to finish sequences early to ensure that they would not become toxic. To fix this problem, we manually added a bias parameter $b_{c}=2$ as per Equation (5) so that the model would have a prior probability that assumes the sequence is non-toxic. We found doing this also required us to increase $\tau$ to $0.97$ to account for $P(c|x_{1:t})$ being higher with the bias parameter, since otherwise far fewer tokens would be filtered and samples could become toxic. All other hyper-parameters remained unchanged.

D.2 Baseline details for PPLM

For PPLM, we trained the external classifier (which uses logistic regression on top of representations from GPT-2) on the SST-5 data set, after struggling to achieve as strong results training on IMDb (which is what GeDi was trained on) and advise from the paper authors. For generation, we used greedy decoding with a repetition penalty applied the same way as described in Appendix C. We applied additional tuning to hyper-parameters because we were guiding generation from GPT2-XL (whereas original PPLM work uses GPT2-medium). Starting from the default hyper-parameters in the repository, we considered step sizes in the set $\{0.04,0.08,0.16,0.25,0.35\}$ , and found that $0.25$ gave the best trade-off between sentiment control and generation quality, so we used this for our experiments.

D.3 Baseline details for CTRL

For CTRL, we prepended prompts with the control codes for positive and negative Amazon reviews, which are “Reviews Rating: 1.0” and “Reviews Rating: 5.0” for negative and positive respectively. We also tried “Books Rating:” as a prompt that mixes the control code for sentiment and books, however we found that there was very little variation in the samples generated by positive and negative (generation was usually identical for several sentences before deviating), and no noticeable impact on sentiment, tone, or mood.

Appendix E Experiments with GeDi training

Our initial experiments train and benchmark GeDi-trained CC-LMs for classification, perplexity, and direct generation, in preparation to use them for GeDi-guided generation in Section 5. All our experiments augment GPT2-medium (345M parameter) (Radford et al., 2019) with control codes specific to each task to form a class-conditional language model. We then fine-tune this model on different sequence classification datasets with the hybrid GeDi objective from Equation (10). To understand the trade-offs between generative and discriminative training, we explore $\lambda$ values between and $1$ , where $\lambda=1$ is equivalent to generative training and is the main baseline for these initial experiments. Once fine-tuned, we decode samples from the model by conditioning on the control code corresponding to the required attribute and prompts from the dev set for each task. We use greedy decoding and a repetition penalty for generation (see Appendix C for details) On each task, we measure the perplexity, classifier accuracy, and label fidelity across all values of $\lambda$ . Our task set consists of:

IMDb (Maas et al., 2011): We test the model’s ability to generate movie reviews with positive and negative sentiment when conditioned on the first $\sim$ 100 characters (up to the next word-break after 100 characters) of a review (which may or may not match the control code).

MNLI (Williams et al., 2017): We test the model’s ability to generate contradictions and entailments when conditioned on a premise.

QNLI (Wang et al., 2018): We test the model’s ability to generate passages that contain the answers to a question given in conditioning text.

We include the two NLI tasks because they require a greater degree of logical reasoning, potentially making them more difficult.

To evaluate the label fidelity of direct generation from GeDi-trained CC-LMs in an automatic manner, we use an external classifier trained on the given task to classify conditionally generated samples. This entails splitting training data sets in half, training the generator model on one half (split A), and the external classifier on the other half (split B). When evaluating the label fidelity of a generator, the generator is given prompts and labels (to be used as control codes) from the validation set to conditionally generate text. The prompt and generated text is then given as input to the classifier, which predicts the label. The label fidelity is then the percentage of the total number of samples for which the predicted classifier label corresponds to the control code that the generator received as input. It is more valid to use a classifier and generator trained on separate splits of the training data because otherwise, both models could fit to the same spurious correlations in the training set and overestimate the label fidelity results. For this external model-based evaluation, we use RoBERTa models (Liu et al., 2019) trained on the respective classification tasks, as we found that it learned significantly stronger classifiers from the half datasets as compared with BERT (Devlin et al., 2018).

The label fidelity, classification accuracy, and perplexity for the 3 tasks are reported in Figures 2, 3 and 4 respectively. As expected, using a higher $\lambda$ , which makes training closer to generative training, improves perplexity on held out sequences across tasks. Also as expected, we found that $\lambda<1.0$ (using partially discriminative loss/GeDi training) improved classification performance across tasks. We note that PPLM’s attribute classifier struggles on NLI tasks, whereas GeDi-trained CC-LMs can nearly match the performance of BERT. This suggests that GeDi may be applicable to significantly more difficult controllable generation tasks. We also found that using GeDi training led to higher label fidelity for CC-LMs across tasks compared with generative training.

Following up on our automatic-evaluation, we perform human evaluation on the generated MNLI contradictions and entailments to verify the observed label fidelity improvements and test the generation quality of GeDi vs. standard generative training of CC-LMs. For each sample, we ask human annotators to predict the class label and rate the sample for linguistic acceptability. We obtain annotations for $300$ generations from each model, with half conditioning on “contradiction” and half conditioning on “entailment”.

Each annotator is randomly assigned a set of samples from all 5 models. Human annotators are asked to classify and rate the linguistic acceptability of samples on a scale from 1-4 [1: highly unacceptable 2: unacceptable 3: acceptable 4: highly acceptable]. Annotators labeled the premise and generated hypothesis pairs as [“contradiction”, “neutral”, “entailment”] (note that since we only generate from “contradiction” and “entailment” control codes, anything marked as “neutral” will count against label fidelity). The results are given in Table 5.

GeDi-trained CC-LMs were able to achieve higher label fidelity as compared with generative trained models without sacrificing noticeably on average linguistic acceptability. While the quality of the samples and label fidelity across different prompts varied for GeDi vs generative training, these results show that on average GeDi trained models were able to generate samples that matched the label of the control code more often.

Appendix F Generation samples

We provide samples from a variety of prompts and models, where the Boldfaced string indicates the context provided to the language model followed by its generation. All generations use greedy decoding and are thus deterministic for each prompt for a given model.