Encoder-Agnostic Adaptation for Conditional Language Generation

Zachary M. Ziegler, Luke Melas-Kyriazi, Sebastian Gehrmann, Alexander M. Rush

Introduction

Large-scale language models have been shown to dramatically improve the performance of natural language understanding (NLU) systems on a broad range of tasks (Peters et al., 2018; Devlin et al., 2018; Radford & Salimans, 2018; McCann et al., 2017). The dominant paradigm is to pretrain a self-attention based language model on a large corpus of unlabeled text and then finetune the language model and task-specific head on supervised data. Optimizing the effectiveness of this approach has been the focus of much study (Houlsby et al., 2019; Wang et al., 2019; Chronopoulou et al., 2019).

Given the success of pretraining for NLU tasks, how can large language models best be adapted for conditional language generation? Ideally, one should only need to train a large language model once and then apply it as part of the decoder to a range of tasks with different source modalities (e.g. text, images, bits). In the encoder/decoder framework, a task specific encoder can be used which encodes source information into a continuous vector. The central question is therefore how to adapt a pretrained decoder to effectively utilize arbitrary source information, i.e. encoder-agnostic.

Given the demonstrated quality of samples from large language models (Radford et al., 2019), it is natural to expect that encoder-agnostic adaptation should give improvements in coherence and grammatically even when the source modality is not text, such as with image captioning or class-conditional generation. Unfortunately, past results indicate otherwise. Edunov et al. (2019) show for example that a straightforward extension of Peters et al. (2018) to the conditional generation setting actually hurts performance compared to a model without any pretraining. Other pretraining approaches for language generation (Song et al., 2019; Dong et al., 2019; Lample & Conneau, 2019) have demonstrated strong performance on text-to-text tasks, but these methods are constrained to tasks where the source is natural language and do not address the encoder-agnostic setting.

In this work we consider several different approaches for the problem of encoder-agnostic adaptation. We first make that observation that standard adaptation approaches perform poorly on this task. We posit that because these techniques require relearning key parts of the network structure to inject contextual conditioning, they move the parameters too far from the pretrained values. In contrast, Radford et al. (2019) observe that even trivial conditioning with the original model produces reasonable zero-shot generations without fine-tuning.

These results motivate an approach that learns the correct conditioning to control the model’s output, which we call pseudo self attention. The idea is to learn a task specific encoder that injects pseudo history into a pretrained self attention model. Because self attention works with sets of any size, the model can immediately utilize or ignore this history. Finetuning adapts the model to this new input while training a task-specific encoder.

Experiments utilize the GPT-2 (Radford et al., 2019) transformer as a pretrained model. We consider four diverse generation tasks spanning a range of source modalities: class-conditional generation, document summarization, story generation, and image paragraph captioning. Across all tasks, we find that pseudo self attention outperforms the other pretraining methods and is the most consistent. As a practical tool, pseudo self attention improves performance compared to a baseline without pretraining by large margins without sacrificing adherence to the source, even for tasks with large amounts of supervised data. We further demonstrate that the approach is data efficient and produces qualitatively more coherent outputs. Code is available at https://github.com/harvardnlp/encoder-agnostic-adaptation.

Related Work

Extending upon the success of pretrained word embeddings (Mikolov et al., 2013), contextual word vectors based on LSTMs first demonstrated strong results across discriminative NLU tasks (McCann et al., 2017; Howard & Ruder, 2018; Peters et al., 2018). Recent work has shown that the transformer (Vaswani et al., 2017) could further improve language representation. BERT (Devlin et al., 2018) trains a transformer via a cloze task and next sentence prediction objectives, leading to state-of-the-art results on many NLU tasks. GPT and GPT-2 (Radford & Salimans, 2018; Radford et al., 2019) use a similar model in a unidirectional language modeling setting, the latter showing the additional ability to generate impressively coherent unconditional text. As they take the form of standard language models, the GPT models are a natural starting point for pretraining generation models.

Pretrained Decoder Transfer learning for NLG

Natural language generation (NLG) tasks have a long history of incorporating unconditional language models with conditional input, especially for machine translation and speech recognition (Bahl et al., 1983; Koehn et al., 2003). These approaches traditionally use the noisy channel model (i.e. Bayes’ rule), and $n$ -gram models as the language model. Recent adaptations of these ideas include the Neural Noisy Channel (Yu et al., 2017) as well as “fusion” methods (Koehn et al., 2003; Gulcehre et al., 2015; Sriram et al., 2018; Stahlberg et al., 2018) in which the output logits of a language model and a conditional model are combined to calculate the output probabilities. We consider this class of transfer learning as a baseline in a preliminary experiment (see Section 4.1), but focus on alternative “deep” approaches that incorporate the language model weights as an integral part of the model instead of an add-on at the end. Along these lines, Ramachandran et al. (2017) propose a finetuning-based method for machine translation with LSTMs, in which some of the layers of the LSTM are initialized with pretrained language model weights. As their method is specific to LSTMs, however, it is incompatible with modern transformer architectures.

Pretraining-Based Transfer Learning for NLG

Zhang et al. (2019) use BERT in the encoder and decoder of a summarization model via a unique cloze generative process. They demonstrate strong abstractive summarization performance, but the value of the BERT pretraining relative to other model components is not clear and the cloze process significantly reduces the practicality of the model. More related, Edunov et al. (2019) experiment with a representation-based approach for applying ELMo (Peters et al., 2018) to the source and target sides of a standard seq2seq model separately. Their approach consistently improves performance when applied to the source, but actually hurts performance when applied to the decoder. We consider such a representation approach as a baseline in this work.

Most recently, a number of studies experiment with BERT-like masking approaches that are compatible with natural language generation (Song et al., 2019; Dong et al., 2019; Lample & Conneau, 2019). While these works demonstrate impressive performance, they are constrained to text-to-text tasks because they do not have a way to handle arbitrary conditional information. Whereas these works study pretraining methods that optimize transfer for text-to-text tasks, our study considers the separate problem of adapting a fixed pretrained model to arbitrary source conditioning.

Concurrent with this work, Golovanov et al. (2019) propose a similar approach to pseudo self attention and report initial experiments with dialogue generation. This study compliments ours with positive results on dialogue generation, though we aim for experimental evidence over a wide range of language generation tasks and input modalities and comparison to strong encoder-agnostic baselines.

Methods

We assume that we have a large pretrained language model, $p(\boldsymbol{y})=p(y_{1},\ldots,y_{T};\theta)$ , that the model is an auto-regressive neural network, and that it is based on self attention to implement conditioning on previous tokens, i.e.,

where input $Y\in T\times D$ for hidden dimension $D$ , $W_{k},W_{v},W_{q}\in D\times D^{\prime}$ are parameters, representing the key, value, and query projections respectively, and the output is $T\times D^{\prime}$ . In practice many of these units (”heads”) are stacked together via concatenation across dimension followed by a final linear projection $W_{f}\in D\times D$ .

We are interested in using this model to estimate the conditional probability $p(\boldsymbol{y}\ |\ \boldsymbol{x})$ for an arbitrary input $\boldsymbol{x}$ for which we have a small amount of supervised $(\boldsymbol{x},\boldsymbol{y})$ pairs. The goal is to learn a model on this new data that best makes use of the pretrained model $p(\boldsymbol{y})$ with a method that is agnostic to the form of $\boldsymbol{x}$ .

All models considered are based on the encoder/decoder architecture, and for each we follow the same high-level procedure: First, some of the weights of the decoder are initialized with weight values from a pretrained language model. Next, a problem-specific encoder and all non-pretrained decoder weights are randomly initialized. Finally, the entire model is trained/fine-tuned end-to-end using the supervised data for the given task. In all cases the input and output embeddings are tied. The models differ only in where and how they use the pretrained weights in the decoder.

The first approach considered (Fig 1(a)) views the function of the pretrained LM as giving a general-purpose representation of the target text before the source information is introduced. For this method, a standard transformer decoder is used with the target word embeddings replaced by the output representation of the pretrained language model. Preliminary experiments considered both fixing and updating these representations, and found that a fixed weighted-averaging (”ELMo-Style”) method performed better, consistent with Edunov et al. (2019). One possible downside to this approach is that the conditioning information from the encoder is injected after all of the pretrained weights.

Baseline 2: Context-Attn

The second approach (Fig 1(b)) considers initializing a standard transformer decoder with the shared weights of a pretrained LM. The newly added context attention weights at each layer are randomly initialized. While compared to Repr-Transformer the conditioning information is injected alongside the pretrained weights, the randomly initialized context attention block may interfere with the carefully co-tuned pretrained weights of the rest of the model. This may lead to reduced performance and optimization challenges.

Proposed Model: Pseudo-Self

A more radical approach to incorporating conditional information is the ”zero-shot” model proposed by Radford et al. (2019). Instead of learning a representation for $\boldsymbol{x}$ and passing it into a separate context attention block they note that an auto-regressive model, $p(y_{t}\ |\ y_{<t})$ , is already a conditional model. If $\boldsymbol{x}$ is the same modality as $\boldsymbol{y}$ (e.g. both language), one can condition on $x$ by prepending the source to target: $p(y_{t}\ |\boldsymbol{x},y_{<t})=p(y_{t}\ |\ \boldsymbol{x}\odot y_{<t})$ .This method is most successful when hand-selected task-dependent buffer words are inserted between $\boldsymbol{x}$ and $y_{<t}$ as well such as ”tl;dr” for summarization. While this does not produce competitive models and is limited in its applicability, it is surprising that it works at all.

Taking inspiration from this approach, we propose learning this contextualization in an encoder-agnostic way. Our approach, pseudo self attention, simply injects learned encoder conditioning directly into the pretrained self attention of the model. Assume that we have a matrix $X\in S\times D$ representing a size $S$ encoding of $\boldsymbol{x}$ , define pseudo self attention as,

where $U_{k},U_{v}\in D\times D^{\prime}$ are new parameters tasked with projecting encoder outputs into decoder self attention space. Because attention is inherently variable length, these additional inputs can be injected without changing the module and only act additively on the attention output. The full model is shown in Figure 1(c).

Compared to Context-Attn, the proposed approach only introduces new parameters in the self attention block, which we expect leads to only minimal interference. To explore this quantitatively, we plot the root median squared deviation of parameters from their original values in the feed-forward layer of our first task (Figure 2). While both start with the same parameters, the Context-Attn parameters change significantly more than Pseudo-Self over training. As the pretrained LM weights encode for generation capability, deviating further from this initialization may lead to worse generation performance.

Experiments and Results

Experiments consider four diverse tasks spanning input modalities, training dataset sizes, and information about the target contained in the source. Tasks are chosen to emphasize long-form targets to probe the decoder generation capabilities of the different models in a conditional setting. Perplexity is used to measure overall performance and diversity of output, combined with standard task-specific metrics.

For all tasks, GPT-2 is used as the pretrained language model. GPT-2 is a large autoregressive transformer LM trained on 40 GB of non-Wikipedia text (Radford et al., 2019). We use the originally publicly available version of the model (117M parameters); it has 12 layers, 12 heads per layer, and a model dimension of 768 units. The Context-Attn and Pseudo-Self models use the same architecture hyperparameters. For the Repr-Transformer model to avoid overfitting we use 6/8/512 layers/heads/dim for the decoder (in addition to the 12/12/768 that make up GPT-2 for the initial contextual embedding in the decoder). All experiments use the same 50k type BPE GPT-2 vocabulary.

We first consider a control experiment with a minimal encoder model. We consider producing class-conditional samples, e.g. $p(\boldsymbol{y}\ |\ x=0)$ and $p(\boldsymbol{y}\ |\ x=1)$ , from the IMDb sentiment classification dataset (Maas et al., ), similar to previous works for sentiment transfer (Shen et al., 2017; Zhao et al., 2018). We set $x$ to be a sentiment bit (positive/negative), and the movie review as the target $\boldsymbol{y}$ . We maintain the original IMDb 25k/25k train/test split, with 2.5k reviews of the original train split held out for validation, and truncate reviews to 400 BPE tokens during training. Model quality is evaluated by perplexity, and adherence to the source bit $x$ is evaluated by the sentiment classification accuracy of an external classifier on generated reviews as in Shen et al. (2017). Reviews are generated via random sampling with a temperature of 0.7. To detect sentiment, we use the fastText external classifier from Joulin et al. (2016) which has an accuracy of 90.1% on the IMDb test set.

Table 1 shows results for all model, as well as unconditional GPT-2 and the results using Simple Fusion (Stahlberg et al., 2018). The GPT-2 model itself already shows a greatly reduced PPL compared to a problem-specific transformer. All pretraining methods further improve perplexity. The pseudo self attention approach significantly outperforms the approaches in terms of class adherence. Despite being initialized as a language model, the approach only sees a decrease of 0.4% classification accuracy compared to the randomly initialized model. In contrast, the Repr-Transformer model sees a decrease in accuracy of 20.0% and the Context-Attn model sees a decrease in accuracy of 3.9%. As a point of comparison, we additionally report the results of Simple Fusion in Table 1. Compared to Pseudo-Self it gives a worse PPL and extremely poor classification accuracy. Given the weak results, we focus on comparisons between the deep models for the rest of the paper.

2 Document Summarization

Abstractive document summarization requires the model to produce a long-form summary given a full news article. For these experiments we use the non-anonymized CNN-Daily Mail dataset (Hermann et al., 2015). The dataset is comprised of 280k training examples of document-scale source news articles and corresponding 2-4 sentence target summaries. Summarization is a mature testbed with state-of-the-art models that use task-specific architecture modifications, so transfer learning methods need to be able to mesh well with these changes. We use the transformer version of the copy mechanism from (Gehrmann et al., 2018) and employ bottom-up (BU) summarization attention pruning (Gehrmann et al., 2018). Generation is conducted via beam-search with a beam size of 5 with tri-gram blocking, consistent with the literature models (Edunov et al., 2019).

Table 2 shows the performance of the models tested with recent state-of-the-art models for comparison. Compared to the baseline model without pretraining, Pseudo-Self improves ROUGE-1 by 0.78, ROUGE-2 by 0.65, ROUGE-L by 0.37, and reduced PPL by 20%. The Context-Attn approach nearly matches these results for this task, but the Repr-Transformer approach performs more poorly.

We additionally experiment with the simple bottom-up summarization attention pruning approach without pretraining applied at inference time as in (Gehrmann et al., 2018). With this modification Pseudo-Self outperforms all literature models in ROUGE-1 except the text-to-text UniLM+ExtractLoss, which uses joint pretraining of the source and target and is trained with an additional extractive loss. The performance of all of our models can potentially be further improved with the addition of pretraining on the encoder side.

3 Conditional Story Generation

Conditional story generation with the WritingPrompts dataset (Fan et al., 2018) requires the model to produce an on-topic story given a short textual prompt. While summarization relies heavily on the encoder, this task gives more flexibility to the decoder. The dataset is well supervised, containing 300k single sentence writing prompts (the source) and stories (the target). Following the preprocessing of Fan et al. (2018), we truncate the stories to 1000 tokens. Due to the story lengths the total number of training tokens is on the order of 100 million, resulting in a large in-domain data setting.

To compare models we compute two metrics: perplexity (PPL) and prompt ranking. Perplexity is used as a proxy for generation quality, whereas prompt ranking is used to measure the relevance of the story to the prompt. To calculate prompt ranking, we use the procedure from Fan et al. (2018): For each story in the test set, the likelihood is evaluated under the model for the “true” corresponding prompt and 9 other randomly selected “fake” prompts from the test set. Then, the rank accuracy is the percentage of stories for which the model gave the highest likelihood to the true prompt.

Table 4 shows the results. Despite the large dataset size, the Repr-Transfomer and Pseudo-Self approaches still substantially reduce the PPL. That the models are able to improve PPL, despite the 100 million+ target tokens, suggests these models are able to effectively make use of the GPT-2 LM. Pseudo-Self sees only a 0.3% decrease in prompt ranking accuracy, while the Repr-Transformer approach sees a larger decrease. The Context-Attn model runs into optimization challenges and fails to learn in this setting.

4 Image Paragraph Captioning

Image paragraph captioning on the Visual Genome dataset from Krause et al. (2017), differes from standard image captioning task, where captions are single sentences or sentence fragments, and requires the model to generate an entire paragraph (usually 5-8 sentences) describing a given image. Recent work in the image captioning literature has argued for a greater focus on paragraph captioning because the descriptive capacity of single-sentence image captions is inherently limited. However, due to the difficulty of producing labeled paragraph captions, existing paragraph captioning datasets are quite small; whereas the MSCOCO (single-sentence captioning) dataset contains around 600,000 image-caption pairs, Visual Genome contains fewer than 20,000 image-paragraph pairs. As a result, models trained from scratch on Visual Genome have been observed to have difficulty learning the structure of language, necessitating the use of heuristics.

We use the same convolutional encoder as Krause et al. (2017), without the final pooling layer; that is, for each image, the output of the encoder is a tensor of size $(36,2048)$ extracted from a ResNet. Note that in this experiment, unlike those above, the encoder (CNN) and decoder (finetuned LM) are trained separately rather than end-to-end. Since we are interested in analyzing how to most effectively utilize pretraining for generation, we only compare with approaches using the same loss function (cross-entropy). Recent work shows it is possible to improve paragraph captioning models by incorporating sequence-level (Melas-Kyriazi et al., 2018) and adversarial (Chatterjee & Schwing, 2018) losses, but these loss function improvements are orthogonal to improvements in the underlying model architecture.

Table 4 shows the results on the captioning task, as measured by the widely-used CIDEr and BLEU-4 metrics. We compare the three transfer learning methods with a non-pretraining baseline and models from the literature. Of the three pretraining approaches Pseudo-Self gives the best performance, and is the only model to improve both CIDEr and BLEU-4 compared to the Transformer baseline. Furthermore, Pseudo-Self outperforms all other models on CIDEr but gives a slightly worse BLEU-4.

Analysis and Discussion

There is a continuing trend to larger pretrained LMs. During the preparation of this manuscript, a larger version of GPT-2 was made available with 345M parameters, increasing the model dimension to 1028, the number of attention heads to 16, and the number of layers to 24. We retrained our model using this larger LM for class-conditional generation, using the same training hyperparameters and re-tuning the generation temperature (Table 4). The larger model improves PPL by 4.5 points while attaining similarly high classification accuracy. This datapoint suggests that transfer learning effectiveness can continue to improve along with the quality of the pretrained model used.

2 Low-data supervision

Many of our tasks showed improvements even with medium-to-large training sets. To study the effectiveness of the approach in low data regimes, we create artificial small datasets by subsampling the IMDb dataset to sizes between 200 and 16k datapoints. We retrain our model using the same hyperparameters and use datasize-dependent early stopping to prevent overfitting. To reduce variance and measure uncertainty we repeat the process 8 times for each dataset size, calculating the PPL and classification accuracy. Results are shown in Figure 4. Note that a non-pretrained model has a PPL of over 1000 when trained on 200 examples. The pretrained model starts with reasonable outputs (44.4 PPL after 200 examples) and increases task accuracy steadily with more data. (See Section 5.4 for representative samples.)

3 Human evaluation

To assess the quality of generations, we conducted a human evaluation based on the story generation task. Generation uses a temperature of 0.9 and a top-k value of 100. We ask participants on Amazon Mechanical Turk a series of four yes/no questions mapped to desirable linguistic properties outlined in Dang (2006): grammaticality, non-redundancy, consistency, and typicality. 125 stories are evaluated for each model, and each story is evaluated by 5 unique workers. Scores are calculated for each property as the total percent of positive responses. A combined score rates the model overall on a scale from 0-4 based on the equally-weighted combination of the four properties.

The results are shown in Table 5. In all four categories the Pseudo-Self and Repr-Transformer models show statistically significant performance gains compared to the baseline Transformer model. The Pseudo-Self model achieves a grammaticality score of only 6.1% less than the test set, indicating strong grammaticality, likely a more localized property, is well learned by the pretrained LM and effectively transferred to the conditional models. In contrast, all models score significantly worse than the test data in terms of consistency and typicality. This suggests that these higher level properties, while best transferred in the Pseudo-Self case, still represent a challenge for neural models.

4 Qualitative examples

Representative samples for the movie review dataset are shown in Table 6. The No-Pretraining model is the transformer from Table 1, and the number in the left column indicates the number of supervised examples in the training dataset. Samples are generated via random sampling with a temperature of 0.75. Without pretraining, the model makes a number of clear coherence mistakes. The Pseudo-Self 22K makes no grammatical mistakes and follows a single train of thought, although it is somewhat generic.

The distinction between the models is further exaggerated when only 1.8k supervised examples are given. The baseline model trained on only 1.8k datapoints leads to an exceptionally poor generation. In contrast, the Pseudo-Attention model shows significantly improved grammar and sentence structure. Despite a handful of mistakes, the review follows a consistent description of a movie over multiple sentences. Given the poor performance of the baseline model, these properties must have been transferred from the original unconditional LM. These samples were selected to be representative of the broader set for the indicated models.

Conclusion

We study encoder-agnostic approaches for adapting pretrained language model for general purpose conditional language generation. Across a set of diverse long-form conditional generation tasks we show that pseudo self attention consistently improves performance over strong encoder-agnostic pretraining baselines. From a practical perspective, the approach gives robust, sizable improvements over a non-pretraining baseline while maintaining adherence to the source context. Furthermore, we demonstrate the data efficiency and qualitative properties of the approach.

Beyond empirical results, this study highlights the distinction between improving contextual representations of the source language and improving language generation capability of the target language. While they appear to be similar problems, they exhibit substantially different phenomenology. For example, the representation-based approach which works well for NLU gives poor performance for NLG. Future work can study this distinction further.