DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, Yejin Choi

Introduction

Controlling the output of pretrained language models (LMs) is crucial for achieving useful and safe language generation applications, such as non-offensive sentence completion or friendly conversation generation See et al. (2019); Sheng et al. (2020); Gehman et al. (2020). For example, a safe completion to the prompt “When she rejected his advance, he grabbed…” requires avoiding word choices that could lead to continuations with gender-based violence (e.g., “her”; Figure 1).

Without such steering, these language models risk generating mindless and offensive content Sheng et al. (2019); Holtzman et al. (2020) which hinders their safe deployment Brockman et al. (2020); Bender et al. (2021). Importantly, as the scale of pretrained LMs increases (e.g., 175B and 1.6T parameters; Brown et al., 2020; Fedus et al., 2021), finetuning or re-training approaches are becoming increasingly computationally infeasible for most researchers.

We propose DExperts,DExperts stands for Decoding-time Experts. Our code is available at https://github.com/alisawuffles/DExperts. a decoding-time method for controlled text generation based on a product of experts Hinton (2002). Our method combines an out-of-the-box pretrained (“base”) LM with “expert” LMs and/or “anti-expert” LMs, which model text with desirable and undesirable attributes, respectively. By generatively modeling text with particular attributes and directly combining the output distributions from each LM, DExperts leverages subtle signals expressible by language models for effective attribute control, without sacrificing generation fluency or diversity. Moreover, because it operates only on the output of the base LM, DExperts can steer with (anti-)experts of smaller size, even in cases where we do not have full access to the base model (e.g., GPT-3 through an API).

We first apply DExperts to the task of language detoxification (§3), by finetuning an expert and an anti-expert on public comments that are human-annotated for toxicity. Our experimental results show that DExperts can successfully avoid toxicity in language generation while preserving output fluency, outperforming existing detoxification methods on both automatic and human evaluations. Moreover, we find that DExperts continues to outperform baselines when employing only an anti-expert and re-using the base model as the expert, making it one of the only methods that can avoid toxicity without annotated examples of non-toxic content. In analysis, we also show that our method successfully avoids toxic degeneration while using just $\sim$ 650 toxic comments, opening avenues for easily customizable anti-experts.

We then showcase the generalizability of DExperts by tackling the task of controlling the sentiment of LMs’ output (§4). To this end, we combine a pretrained LM with (anti-)experts modeling positive and negative sentiment. As with language detoxification, DExperts outperforms existing sentiment steering methods on both automatic and human evaluations. Additionally, we show our method is especially effective in the adversarial setting of steering negative prompts toward positive continuations, and vice versa. Finally, we demonstrate a preliminary proof-of-concept using DExperts for stylistic rewriting (§5).

Our work demonstrates the effectiveness of tuning small LMs on text with desirable and undesirable properties for efficient and effective steering of larger pretrained LMs, and highlights the promise of decoding-time methods for controlled language generation.

Experts and Anti-Experts for Controlled Generation

Given input text as a prompt, the task of controlled text generation is to generate a continuation that flows naturally from the prompt while having the desired attribute (e.g., positive sentiment) but not an undesired one (e.g., toxicity).

and the next token is generated by sampling $x_{t}\sim P(X_{t}\mid\bm{x}_{<t})$ .

DExperts operates on a pretrained language model $M$ by combining its predictions with an expert $M^{+}$ , which models text with a desirable attribute, and an anti-expert $M^{-}$ , which models text with an undesirable attribute. At time step $t$ , we condition each language model $M$ , $M^{+}$ , and $M^{-}$ on the prompt $\bm{x}_{<t}$ to obtain $\mathbf{z}_{t},\mathbf{z}^{+}_{t}$ , and $\mathbf{z}^{-}_{t}$ , respectively. The product-of-experts ensemble is given by:Though not explored in this paper, this formulation readily accommodates multiple experts and anti-experts, whose logits can be respectively added or subtracted.

where $\alpha$ is a hyperparameter that controls the amount of modification to $\mathbf{z}_{t}$ , and can be interpreted as the strength of control over the base model. Equivalently,

Intuitively, a token will only have high probability if it has high probability under both $P$ and $P^{+}$ , and low probability under $P^{-}$ . We can interpret the ratio $\frac{P^{+}(X_{t}\mid\bm{x}_{<t})}{P^{-}(X_{t}\mid\bm{x}_{<t})}$ as a scaling coefficient for each token, which is used to modify the original probability predicted for that token.

2 Sampling from DExperts

Sampling fluent output from language models commonly requires truncating the unreliable tail of the probability distribution, as in top- $k$ Fan et al. (2018) or nucleus sampling Holtzman et al. (2020). We adapt this intuition to our method by truncating the logits $\mathbf{z}$ output by the base model prior to combining with the experts. Formally, let $\mathcal{V}^{\prime}\subset\mathcal{V}$ denote the set of tokens that are a part of the top- $k$ /top- $p$ vocabulary of the base LM at time step $t$ . The truncated logits $\mathbf{z}^{\prime}$ are given by

By substituting $\mathbf{z}$ with $\mathbf{z}^{\prime}$ in Equation 2, we have

Toxicity Avoidance

Given that large pretrained LMs are at risk of producing toxic content Sheng et al. (2019); Gehman et al. (2020), steering away from toxic “degeneration” is crucial for their safe deployment. Our approach uses an anti-expert that models overt toxicity, as well as an expert that is finetuned on nontoxic data from the same domain.

Note that while obtaining an LM that is truly free from social biases is impossible Fiske (1993); Lakoff (1973), the “non-toxic” expert serves the purpose of modeling the same domain of comments as the toxic anti-expert, providing more effective contrast. Nonetheless, we provide an ablation using only a toxic anti-expert and show that it remains effective above all previous baselines.

We use GPT-2 Large as our base LM. For our expert and anti-expert, we finetune several sizes of GPT-2 (Small, Medium, Large) on a dataset of human-annotated comments from the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge.https://bit.ly/3cvG5py We consider an example toxic if $\geq 50\%$ of annotators marked it as toxic, and nontoxic if none of the annotators mark it as toxic. This toxic dataset has $\sim$ 160K comments, and the nontoxic dataset $\sim$ 1.4M comments. Note that our toxic dataset is human-annotated and out-of-domain with respect to the pretraining corpus (WebText for GPT-2).

We report results for $\alpha=2.0$ , chosen after observing the tradeoff between detoxification and fluency, but show results for other values of $\alpha$ in Appendix D.

2 Evaluation

To evaluate the problem of toxic degeneration where a user might unexpectedly receive harmful output from a model, we use a random sample of 10K nontoxic prompts from the RealToxicityPrompts dataset Gehman et al. (2020).

2.2 Baselines

We further pretrain the base model on the non-toxic subset of OpenWebText. This dataset is obtained by scoring the full OpenWebText corpus with the toxicity classifier from Perspective APIhttps://github.com/conversationai/perspectiveapi and keeping the least toxic 2 percent of documents, a corpus of about 150K documents, or 63M tokens, following the implementation of this baseline from Gehman et al. (2020).

PPLM uses gradients from a toxicity classifier to update the LM’s hidden representations. We retrain the classifier to be compatible with our larger base model size, on the same toxicity data used in the original paper.https://bit.ly/3yQiCIo Due to the extreme computational expense of PPLM (runtimes are shown in Appendix A.4), we evaluate PPLM on a random subset of 1K prompts.

GeDi uses a class-conditioned LM to provide classification probabilities for all possible next tokens via Bayes’ rule. We use the toxicity class-conditioned LM released by the authors with the recommended generation hyperparameters.

We also explore an anti-expert-only ablation of DExperts, by reusing the base model as the expert. To be clear, we substitute $\mathbf{z}_{t}^{+}=\mathbf{z}_{t}$ in Equation 1, so that we have

We use the toxic anti-expert based on GPT-2 Large and the same hyperparameter value $\alpha=2.0$ .

Finally, we consider generating directly from the non-toxic expert based on GPT-2 Large.

For all baselines, we use nucleus sampling Holtzman et al. (2020) with $p=0.9$ to generate up to $20$ tokens. Note that for our method, nucleus sampling is done as described in §2, by using the nucleus from the base LM. Other training and generation details (e.g., hyperparameters) are described in Appendix A.

2.3 Automatic Evaluation

We evaluate our generations for toxicity, fluency, and diversity. Following previous work Gehman et al. (2020), we characterize generation toxicity using the toxicity score from Perspective API, along two axes: 1) the maximum toxicity over $k=25$ generations, and 2) the empirical probability of generating a continuation with toxicity $\geq 0.5$ at least once over $k=25$ generations. Generation fluency is measured by the mean perplexity of generated continuations according to a larger pretrained LM, GPT-2 XL. Generation diversity is measured using the mean number of distinct $n$ -grams, normalized by the length of text Li et al. (2016), among the 25 generations for each prompt. We report Dist-1, Dist-2, and Dist-3 scores for distinct uni-, bi-, and trigrams, respectively.

According to automatic metrics shown in Table 1, DExperts substantially outperforms all existing baselines at detoxification. In particular, DExperts (medium, large) are among the most fluent controllable generation methods, while fully preserving output diversity compared to the base model. Moreover, the DExperts (anti-only) ablation continues to outperform baselines at detoxification, although with a loss in fluency and diversity that is likely due to the less effective contrast between the base model and anti-expert. We report the per-generation runtime of each method in Appendix A.4 to demonstrate DExperts’s efficiency compared to other decoding-time methods.

2.4 Human Evaluation

While automatic toxicity classifiers like Perspective API enable the kind of large-scale evaluation required for systematic comparison of methods, an abundance of work shows that their accuracy is far from ideal Dixon et al. (2018); Sap et al. (2019); Davidson et al. (2019); Hutchinson et al. (2020) in part due to reliance on spurious features, which we discuss in §8. Therefore, we carry out a human evaluation on Amazon Mechanical Turk on 120 random prompts from the 10K nontoxic subset. For each prompt, we compare four pairs of models: DExperts (large) versus GPT-2 Large, PPLM, DAPT, and GeDi. For each pair of models, we randomly sample two generations from each model. This results in a total of $120\text{ prompts}\times 4\frac{\text{pairings}}{\text{prompt}}\times 2\frac{\text{generations}}{\text{pairing}}=960$ comparisons. Each comparison pair is rated by three Turkers, who select which of the two continuations is: (1) less toxic, (2) more fluent, and (3) more topical, i.e., whether the continuation is natural, relevant, and follows logically from the prompt. A screenshot of the user interface is provided in Appendix C.

According to human evaluations, DExperts is rated as less toxic more often than all baselines (Figure 2). In particular, it is rated equally fluent compared to GPT-2, yet less toxic than GPT-2 $10\%$ more often than the other way around. See Appendix E for examples of generations.

3 Steering GPT-3

We next use DExperts to steer GPT-3 Ada. Because the OpenAI APIhttps://openai.com/api/ allows access to only the top 100 log probabilities at each time step, we can only modify and sample from the probability distribution over the top 100 tokens. Nonetheless, results in Table 2 show that DExperts effectively reduces toxicity from GPT-3 to about the same level as when operating on GPT-2. This demonstrates that DExperts requires only the output of the base model, and indeed, the (anti-)experts do not need to be built on the base model.

4 Analysis: Dataset Size

In practice, gathering large amounts of toxic data may be challenging, especially in applications where we would want to customize the anti-expert LM for differing notions of harmful language. To explore the limited data setting, we investigate the relationship between the dataset size used to train the (anti-)experts and its effectiveness at steering the base model. We finetune GPT-2 Large on five different dataset sizes of exactly 40,960, 204.8K, 1.024M, 5.12M, and 10.24M tokens; for each dataset size, we train the expert and anti-expert for one epoch with checkpoints at every fifth of an epoch. The performance of each ensemble, at every (anti-)expert checkpoint, is show in Figure 3.

We can see that even with a dataset of 40,960 tokens ( $\sim$ 650 comments) corresponding to $<0.4\%$ of the original toxic dataset, we substantially reduce toxicity from the base model to about the same level as our strongest baseline, GeDi. (On one GPU, this corresponds to $\sim$ 3 minutes of finetuning.) Nonetheless, as the size of the finetuning dataset for (anti-)experts increases, the performance of DExperts increases as well.

Sentiment-Controlled Generation

As a second application we consider the well-studied task of controlling the polarity of text’s sentiment (e.g., Li et al., 2018; Sudhakar et al., 2019), steering towards either positive or negative sentiment.

We use the same pretrained model from §3, GPT-2 Large, as our base LM. We finetune GPT-2 (Small, Medium, Large) on a positive sentiment corpus for our positive LM, and on a negative sentiment corpus for our negative LM. We use Stanford Sentiment Treebank (SST-5; Socher et al., 2013), which contains movie reviews labeled by human raters for sentiment on a scale from 1 (very negative) to 5 (very positive). Our positive dataset contains “positive” and “very positive” reviews, and our negative dataset “negative” or “very negative” reviews. Each of these sentiment datasets has about 4K reviews.

For ease of notation we consider the positive LM our expert and negative LM our anti-expert, and use $\alpha=\pm 3.2$ for steering in each direction. The tradeoff between fluency and sentiment control for many values of $\alpha$ is shown in §4.3.

2 Evaluation

In order to test our method’s ability to control sentiment beyond the domain that the sentiment experts are trained on (movie reviews), we collect a dataset of 100K naturally occurring prompts from the OpenWebText Corpus (OWT) Gokaslan and Cohen (2019). Details are outlined in Appendix B. We generate $25$ continuations for each prompt from the base LM, and score them using HuggingFace’s sentiment analysis classifier Wolf et al. (2020) trained on SST-5 movie reviews. Using these generations from the base LM, we build three datasets of prompts: (1) 5K “neutral” prompts, which lead to $12$ or $13$ positive continuations, (2) 2.5K “negative” prompts, which lead to $25$ negative continuations, and (3) 2.5K “positive” prompts, which lead to $24$ or $25$ positive continuations. We consider the negative and positive prompts adversarial settings, where the task is to steer toward the opposite sentiment of the prompt.

2.2 Baselines

We consider the same baselines as in §3, along with a new baseline (CTRL; Keskar et al., 2019).

Corresponding to our DAPT baseline in §3, we score all documents in OpenWebText with the HuggingFace sentiment classifier, and keep the most positive 2% and most negative 2% (according to the probability of the predicted label) to obtain the positive and negative corpora. We perform another round of pretraining on each corpus to obtain a positive LM and negative LM.

As with toxicity §3, we retrain the sentiment classifier for PPLM with a larger embedding size compatible with our base model. The training data used is SST-5. Again, we evaluate PPLM on only 10% of the prompts compared to other models, which are randomly selected: 500 neutral prompts, 250 positive prompts, and 250 negative prompts.

We use GeDi with the sentiment class-conditioned LMs released by the original authors, which are trained on IMDB movie reviews Maas et al. (2011). (We find that retraining it on SST-5 results in slightly reduced performance, as discussed in Appendix A.)

To explore whether simply steering away from one sentiment will yield the opposite sentiment, we again explore an anti-expert-only version of DExperts. As in §3, we reuse the base model as the expert, and use only a negative anti-expert LM for positive steering, and only a positive anti-expert LM for negative steering. We use $\alpha=\pm 2.0$ for this setting.

Again, we consider decoding directly from the corresponding sentiment expert for positive and negative steering.

To control the sentiment of generations from CTRL , we use the “Reviews” control code and append a rating of “5.0” for positive generations and a rating of “1.0” for negative generations. The sentiment training examples for CTRL came from Amazon reviews McAuley et al. (2015).

As with toxicity experiments (§3), we use nucleus sampling with $p=0.9$ , and include our training and generation details in Appendix A.

2.3 Automatic Evaluation

We evaluate our generations for the target sentiment, fluency, and diversity. To estimate sentiment, we use HuggingFace’s sentiment analysis classifier, and report the mean percentage of generations per prompt (out of $25$ ) which are labeled positive (the rest are negative). We evaluate fluency and diversity in the same ways as §3.

As shown in Table 3, DExperts greatly outperforms previous controllable generation methods (PPLM, CTRL, DAPT, GeDi) on both neutral prompts and adversarial prompts. The limited performance of CTRL suggests that the effectiveness of class-conditioned training on domain-specific data is limited to the domain of that data; training on Amazon reviews does not allow generalization outside of the reviews domain. In a similar vein, while the positive and negative experts achieve decent performance (even performing the best on negative prompts), they do so at the expense of much higher output perplexity. This contrast shows two sides of the same coin: we observe that while CTRL acts like a standard language model on out-of-domain prompts (good fluency, poor control), the sentiment experts are highly specialized on movie reviews and tend to steer every generation toward movies (poor fluency, strong control). Meanwhile, DAPT is more effective while maintaining fluency, because its training domain is the same domain as the prompts domain (i.e., OWT), but its performance decreases substantially in the adversarial setting which requires more active steering. We observe that the poor fluency of PPLM is due to occasional generations with extremely high perplexity, suggesting cases of degenerate behavior. DExperts with only an anti-expert is mildly effective on neutral prompts (outperforming or matching the performance of CTRL and PPLM), but works very poorly in the adversarial setting, confirming our intuition that steering away from negative sentiment does not provide sufficiently strong guidance for positive sentiment.

2.4 Human Evaluation

For human evaluation, we randomly choose 30 neutral prompts, 30 positive prompts, and 30 negative prompts, and consider five pairs of models: DExperts versus GPT-2, CTRL, PPLM, DAPT, and GeDi. For each prompt and pairing of models, we sample two generations from each model for each steering direction considered. This results in a total of $120\text{ prompts}\times 5\frac{\text{pairings}}{\text{prompt}}\times 2\frac{\text{generations}}{\text{pairing}}=1200$ pairs, each rated by $3$ MTurk workers. We ask annotators to select which generation achieves the desired sentiment better, along with the fluency and topicality questions from §3.2.4.

As shown in Figure 4, DExperts is substantially more effective at steering toward positivity on negative prompts while achieving better topicality and better fluency compared to all other baselines, including GPT-2. In the opposite setting of steering toward negativity on positive prompts, the gap in sentiment control performance between DExperts and each of GPT-2, CTRL, DAPT, and PPLM is even more pronounced: DExperts is rated better than its comparison 62–78% of the time. While GeDi achieves close to DExperts’ performance in this setting, its topicality and fluency are much worse. The asymmetry, where negative steering appears easier than positive steering for DExperts, is reflected in automatic evaluation as well. We hypothesize that it is easier to derail a positive prompt with negativity than turn something negative into something positive; but to human readers, these negative continuations may be unexpected (a similar observation was made in previous work; Madotto et al., 2020). For the neutral prompts, we see similar trends as those in the automatic and the human adversarial evaluations. Due to space constraints, we include those in Appendix D.2.

3 Analysis: Sentiment versus Fluency

In practice, we may want different levels of sentiment control depending on the application (e.g., aggressively positive marketing pitches versus merely friendly chatbots). Figure 5 shows the relationship between output sentiment and fluency for different choices of $\alpha\in[-3.4,3.4]$ , conditioned on neutral prompts. The smooth tradeoff suggests that $\alpha$ can by adjusted by a practitioner or user, depending on their application. In our experiments, we pick $\alpha=\pm 3.2$ because the curve becomes less steep, meaning that a greater cost in fluency does not return as great of an increase in the desired sentiment. The tradeoff between output toxicity and fluency looks very similar for DExperts detoxification (§3), and is included in Appendix D.1.

Stylistic Rewriting with DExperts

As a preliminary exploration, we go beyond generating text continuations to apply DExperts to stylistic rewriting, i.e., rewriting a sentence in a target style while preserving as much content as possible. We replace the base model with a pretrained autoencoder, BART Lewis et al. (2020), and use GPT-2 Large sentiment (anti-)experts from §4 for steering. At each time step, the autoencoder base model conditions on both the input sequence and the generation-so-far, whereas the (anti-)experts condition on only the latter. As a proof of concept, we show some examples of input/output from this system in Table 4.

This exploration suggests that more innovation is required to apply DExperts to stylistic rewriting, but it is a promising direction. We anticipate future work on the subject.

Related Work

The task of controlling the output of a language generation model has been widely studied by previous work (for a review, see Prabhumoye et al., 2020). Prior to using pretrained LMs as a backbone, most work used custom neural models trained for their respective downstream generation tasks, including emotion-aware text generation Ghosh et al. (2017); Ficler and Goldberg (2017), attribute-aware product review generation Dong et al. (2017), and friendly or empathetic dialogue response generation See et al. (2019); Rashkin et al. (2019).

Since pretrained LMs have shown impressive text generation ability Radford et al. (2018, 2019), two directions have emerged to control their language generation: training approaches and decoding-time approaches. Training approaches include finetuning the pretrained LMs on datasets that contain the desired attributes Gururangan et al. (2020) as well as creating a class-conditioned pretrained LM trained on text with specific attributes control code prefixes Keskar et al. (2019). In contrast to our method, such approaches can only steer towards desired text attributes, they cannot steer away from them. Additionally, training approaches require significant computational resources, which may no longer be feasible with the size of more recent pretrained LMs Brown et al. (2020); Fedus et al. (2021).

Decoding-time methods, a more lightweight approach, have been used controlling the attributes of generated text, as well as for improving its quality Li et al. (2016); Holtzman et al. (2018); Welleck et al. (2020). PPLM Dathathri et al. (2020) is a steering method that updates a pretrained model’s hidden representations according to the gradient of a classifier with respect to the desired class. Unfortunately, this approach is computationally expensive, as shown in this and previous work Gehman et al. (2020). Contemporaneous with our work, FUDGE Yang and Klein (2021) trains classifiers on partial sequences to predict whether an attribute will be satisfied in the future, and uses Bayesian factorization to obtain the attribute-conditioned probability distribution. GeDi Krause et al. (2020) uses Bayes’ rule similarly, but computes classification probabilities using the output of class-conditioned LMs rather than directly training a classifier. In contrast, our experiments show that directly ensembling LMs’ probabilities as opposed to using them for estimating class probabilities is more effective at steering text generation.

Conclusion

We present DExperts, a method for controlled text generation that reweights the predictions of language models based on expert (and anti-expert) opinions. In experiments for two different tasks, detoxification and sentiment control, we show that our method is able to effectively steer the language model towards the desired generations, while preserving the fluency and diversity of generated text. As applications built on language models become ubiquitous, DExperts demonstrates promise in steering these models toward safe and user-friendly generations.

Acknowledgments

This research is supported in part by NSF (IIS-1714566), DARPA MCS program through NIWC Pacific (N66001-19-2-4031), and Allen Institute for AI. We thank OpenAI, specifically Bianca Martin and Miles Brundage, for providing access to GPT-3 through the OpenAI API Academic Access Program. We also thank UW NLP, AI2 Mosaic, and the anonymous reviewers for helpful feedback.

Broader Impact and Ethical Implications

Our study is motivated by the potential harms of using pretrained language models Bender et al. (2021), specifically their tendency to generate hateful, offensive, or toxic content Sheng et al. (2020); Gehman et al. (2020). Part of our work requires automatically detecting toxicity in generated texts, for which we use the Perspective API.https://github.com/conversationai/perspectiveapi a commercially deployed toxicity detection tool. However, the mismatch between the construct of toxicity and its operationalization through an automatic classifier can cause biased or unintended model behavior Jacobs and Wallach (2021). Specifically, recent work has shown that such hate speech classifiers overestimate the prevalence of toxicity in text that contains a minority identity mention Hutchinson et al. (2020); Dixon et al. (2018) or text written by racial minorities Sap et al. (2019); Davidson et al. (2019), therefore having the real possibility of backfiring against its very aim of fairness and inclusive dialogue. To address this limitation, we also perform a human evaluation of toxicity, for which we obtained IRB approval and sought to pay our workers a fair wage ( $\sim$ US$7–9/h).

We also acknowledge that any controllable detoxification method runs the risk of dual use Pandya (2019), specifically, this technology could be used to automatically generate hateful text (e.g., extremist texts; McGuffie and Newhouse, 2020). For a broader discussion of such risks, and of the risks of large pretrained LMs in general, please see Bender et al. (2021).

Nevertheless, toxicity in pretrained LMs is an unsolved issue Sheng et al. (2019); Gehman et al. (2020). Therefore, we hope future work continues to better define and evaluate the presence of harmful language (e.g., Sap et al., 2020), and to develop systems for mitigating such language that can be personalized to users’ diverse experiences with language (e.g., dealing with reclaimed slurs appropriately; Croom, 2013).

References

Appendix Overview

In this supplemental material, we provide additional information for producing the results of the paper and additional results.

Appendix A Modeling Details

We use HuggingFace Transformers Wolf et al. (2020) versions of all pretrained models (aside from GPT-3), implemented in the PyTorch deep learning framework. For GPT-3, we use the Ada model which is accessed with the OpenAI API.https://openai.com/api/

A.2 Training Details

All training is performed on a single NVIDIA Quadro 6000 GPU.

Hyperparameters for finetuning (anti-)experts for DExperts are given in Table 5.

The finetuning time for each model size is shown in Table 6.

For our implementation of DAPT in sentiment experiments (§4), we use HuggingFace’s sentiment analysis classifier to filter documents from OpenWebText for the most positive 2% and most negative 2% of documents. Because the classifier takes a maximum of $512$ tokens as input text, we approximate the sentiment of a document with its first $510$ tokens (a start and end token are added by the classifier). The hyperparameters for the additional phase of pretraining on the attribute data is given in Table 5.

For our implementation of PPLM in experiments, we retrain the toxicity and sentiment classifiers to be compatible with our base model GPT-2 (large), as the original paper used GPT-2 medium for experiments. We use the same training datasets and hyperparameters as in the original PPLM paper.

For toxicity and sentiment steering, we download the class-conditioned language models (based on GPT-2 Medium) made available by the original authors. As an experiment, we also align the finetuning data for the sentiment GeDis and the (anti-)experts used in DExperts by finetuning a new class-conditioned LM on SST-5 data (as opposed to IMDB used by in GeDi). We found slightly lower performance on sentiment control ( $\sim$ 1-2%) across the settings, and therefore use the original class-conditioned LMs.

A.3 Dataset Details

Details of datasets used for further pretraining in the DAPT baselines are given in Table 8, and those for finetuning our experts and anti-experts are given in Table 9 and Table 10.

A.4 Generation Details

Generation hyperparameters shared among all methods are shown in Table 11. Hyperparameters for PPLM generation are shown in Table 12. Following the recommendation of the authors, we performed a hyperparameter search for step size over the values $\{0.02,0.06,0.10,0.20,0.40\}$ , and for number of iterations over the values $\{10,20,40,60\}$ , over a small sample of twenty nontoxic prompts. We picked step size $0.20$ and $10$ iterations, for the best tradeoff between toxicity reduction and output fluency. Due to the extreme computational expense of this method, we were not able to repeat the hyperparameter search for sentiment prompts.

Hyperparameters for GeDi generation are shown in Table 13.

We compare the runtime for each controllable generation method used in §3 in Table 14, all on a single NVIDIA Quadro 6000 GPU.. We see that DExperts takes 2 to 3 times the time as decoding directly from the base model, depending on the size of the (anti-)experts. When using the same model size for the guiding language model as in GeDi (GPT-2 Medium), DExperts is more efficient than GeDi, and both methods are 100 $\times$ faster than PPLM.

Appendix B Collection of Sentiment Prompts

We build our prompts for sentiment experiments (§4) from the OpenWebText Corpus Gokaslan and Cohen (2019), a corpus of English web text scraped from outbound links on Reddit. We randomly sample 100K documents from OpenWebText and tokenize each document into sentences. Following the creation of RealToxicityPrompts Gehman et al. (2020), we split each sentence into the prompt, consisting of the first half of tokens, and the continuation, consisting of the remaining tokens. We keep only prompts that are between $4$ and $10$ tokens long (inclusive). For all tokenization, we use the NLTK library Bird and Loper (2004). This results in 140M prompts, from which we randomly sample 100K prompts.

For each of the 100K prompts, we generate 25 continuations from our base model, GPT-2 (large), and score the continuations for sentiment using the HuggingFace sentiment classifier described in §4. The distribution of prompts with $n\in$ positive continuations out of $25$ is shown in Figure 6. Interestingly, we observe that more prompts have more negative continuations than positive continuations than vice versa. Based on these generations, we create three sets of prompts as described in §4.

Appendix C Human Evaluation

Our interface for human evaluation is shown in Figure 7. For each category, the annotator is allowed to choose either one of the continuations, or rate the two options as equal.

Appendix D Additional Results

Figure 8 shows the relationship between output toxicity and fluency for different values of $\alpha$ in our method. The relationship is smooth, reflecting the corresponding figure for sentiment in §4.3.

D.2 Human Evaluation on Neutral Prompts

Figure 9 shows the results of human evaluation on sentiment control conditioned on neutral prompts.

Appendix E Generation Examples

Examples of generations from each method are given in Table 15 for detoxification (§3), and Table 16 for sentiment control (§4).