Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Jason Phang, Thibault Févry, Samuel R. Bowman

Introduction

Recent work has shown mounting evidence that pretraining sentence encoder neural networks on unsupervised tasks like language modeling, and then fine-tuning them on individual target tasks, can yield significantly better target task performance than could be achieved using target task training data alone (Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). Large-scale unsupervised pretraining in works like these seems to produce sentence encoders with substantial knowledge of the target language (which, so far, is generally English). These works have shown that the one-size-fits-all approach of fine-tuning a large pretrained model with a thin output layer for a given task can achieve results as good or better than carefully-designed task-specific models without such pretraining.

However, it is not obvious that the model parameters obtained during unsupervised pretraining should be ideally suited to supporting this kind of transfer learning. Especially when only a small amount of training data is available for the target task, fine-tuning experiments are potentially brittle, and rely on the pretrained encoder parameters to be reasonably close to an ideal setting for the target task. During target task training, the encoder must learn and adapt enough to be able to solve the target task—potentially involving a very different input distribution and output label space than was seen in pretraining—but it must also avoid overfitting or catastrophic forgetting of what was learned during pretraining.

This work explores the possibility that the use of a second stage of pretraining with data-rich intermediate supervised tasks might mitigate this brittleness, improving both the robustness and effectiveness of the resulting target task model. We name this approach, which is meant to be combined with existing approaches to pretraining, Supplementary Training on Intermediate Labeled-data Tasks (STILTs).

Experiments with sentence encoders on STILTs take the following form: (i) A model is first trained on an unlabeled-data task like language modeling that can teach it to reason about the target language; (ii) The model is then further trained on an intermediate, labeled-data task for which ample data is available; (iii) The model is finally fine-tuned further on the target task and evaluated. Our experiments evaluate STILTs as a means of improving target task performance on the GLUE benchmark suite (Wang et al., 2018)—a collection of language understanding tasks drawn from the NLP literature.

We apply STILTs to three separate pretrained sentence encoders: BERT (Devlin et al., 2018), GPT (Radford et al., 2018), and a variant of ELMo (Peters et al., 2018a). We follow Radford et al. and Devlin et al. in our basic mechanism for fine-tuning both for the intermediate and final tasks, and use the following four intermediate tasks: (i) the Multi-Genre NLI Corpus (MNLI; Williams et al., 2018), (ii) the Stanford NLI Corpus (SNLI; Bowman et al., 2015), (iii) the Quora Question Pairshttps://data.quora.com/First-Quora-Dataset-Release- Question-Pairs (QQP) dataset, and (iv) a custom fake-sentence-detection task based on the BooksCorpus dataset (Zhu et al., 2015a) using a method adapted from Warstadt et al. (2018). The use of MNLI and SNLI is motivated by prior work on using natual language inference tasks to pretrain sentence encoders (Conneau et al., 2017; Subramanian et al., 2018; Bowman et al., 2019). QQP has a similar format and dataset scale, while requiring a different notion of sentence similarity. The fake-sentence-detection task is motivated by Warstadt et al.’s analysis on CoLA and linguistic acceptability, and adapted for our experiments. These four tasks are a sample of data-rich supervised tasks that we can use to demonstrate the benefits of STILTs, but they do not represent an exhaustive exploration of the space of promising intermediate tasks.

We show that using STILTs yields significant gains across most of the GLUE tasks, across all three sentence encoders we used, and claims the state of the art on the overall GLUE benchmark. In addition, for the 24-layer version of BERT, which can require multiple random restarts for good performance on target tasks with limited training data, we find that STILTs substantially reduces the number of runs with degenerate results across random restarts. For instance, using STILTs with 5k training examples, we reduce the number of degenerate runs from five to one on SST and from two to none on STS.

As we expect that any kind of pretraining will be most valuable in a limited training data regime, we also conduct a set of experiments where a model is fine-tuned on only 1k- or 5k-example subsamples of the target task training set. The results show that STILTs substantially improves model performance across most tasks in this downsampled data setting, even more so than in the full-data setting.

Related Work

In the area of pretraining for sentence encoders, Zhang and Bowman (2018) compare several pretraining tasks for syntactic target tasks, and find that language model pretraining reliably performs well. Peters et al. (2018b) investigate the architectural choices behind ELMo-style pretraining with a fixed encoder, and find that the precise choice of encoder architecture strongly influences training speed, but has a relatively small impact on performance. Bowman et al. (2019) compare a variety of tasks for pretraining in an ELMo-style setting with no encoder fine-tuning. They conclude that language modeling generally works best among candidate single tasks for pretraining, but show some cases in which a cascade of a model pretrained on language modeling followed by another model pretrained on tasks like MNLI can work well. The paper introducing BERT (Devlin et al., 2018) briefly mentions encouraging results in a direction similar to ours: One footnote notes that unpublished experiments show “substantial improvements on RTE from multitask training with MNLI.”

Most prior work uses features from frozen, pretrained sentence encoders in downstream tasks. A more recent trend of fine-tuning the whole model for the target task from a pretrained state (Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2018) has led to state-of-the-art results on several benchmarks. For that reason, we focus our analysis on the paradigm of fine-tuning the whole model for each task.

In the area of sentence-to-vector encoding, Conneau et al. (2018) offer one of the most comprehensive suites of diagnostic tasks, and highlight the importance of ensuring that these models preserve lexical content information.

In earlier work less closely tied to the unsupervised pretraining setup studied here, Bingel and Søgaard (2017) and Kerinec et al. (2018) investigate the conditions under which task combinations can be productively combined in multitask learning. They show that multitask learning is more likely to work when the target task quickly plateaus and the auxiliary task keeps improving. They also report that gains are lowest when the Jensen-Shannon Divergence between the unigram distributions of tasks is highest, i.e when auxiliary and target tasks have different vocabulary.

In word representations, this work shares motivations with work on embedding space retrofitting (Faruqui et al., 2015) wherein a labeled dataset like WordNet is used to refine representations learned by an unsupervised embedding learning algorithm before those representations are used for a target task.

Methods

We primarily study the impact of STILTs on three sentence encoders: BERT (Devlin et al., 2018), GPT (Radford et al., 2018) and ELMo (Peters et al., 2018a). These models are distributed with pretrained weights from their respective authors, and are the best performing sentence encoders as measured by GLUE benchmark performance at time of writing. All three models are pretrained with large amounts of unlabeled text. ELMo uses a BiLSTM architecture whereas BERT and GPT use the Transformer architecture (Vaswani et al., 2017). These models are also trained with different objectives and corpora. BERT is a bi-directional Transformer trained on BooksCorpus (Zhu et al., 2015b) and English Wikipedia, with a masked-language model and next sentence prediction objective. GPT is uni-directional masked Transformer trained only on BooksCorpus with a standard language modeling objective. ELMo is trained on the 1B Word Benchmark (Chelba et al., 2013) with a standard language modeling objective.

For all three pretrained models, we follow BERT and GPT in using an inductive approach to transfer learning, in which the model parameters learned during pretraining are used to initialize a target task model, but are not fixed and do not constrain the solution learned for the target task. This stands in contrast to the approach originally used for ELMo (Peters et al., 2018b) and for earlier methods like McCann et al. (2017) and Subramanian et al. (2018), in which a sentence encoder component is pretrained and then attached to a target task model as a non-trainable input layer.

To implement intermediate-task and target-task training for GPT and ELMo, we use the public jiant transfer learning toolkit,https://github.com/jsalt18-sentence-repl/jiant which is built on AllenNLP (Gardner et al., 2017) and PyTorch (Paszke et al., 2017). For BERT, we use the publicly available implementation of BERT released by Devlin et al. (2018), ported into PyTorch(Paszke et al., 2017) by HuggingFacehttps://github.com/huggingface/pytorch-pretrained-BERT.

Target Tasks and Evaluation

We evaluate on the nine target tasks in the GLUE benchmark (Wang et al., 2018). These include MNLI, QQP, and seven others: acceptability classification with CoLA (Warstadt et al., 2018); binary sentiment classification with SST (Socher et al., 2013); semantic similarity with the MSR Paraphrase Corpus (MRPC; Dolan and Brockett, 2005) and STS-Benchmark (STS; Cer et al., 2017); and textual entailment with a subset of the RTE challenge corpora (Dagan et al., 2006, et seq.), and data from SQuAD (QNLI, Rajpurkar et al., 2016)A newer version of QNLI was recently released by the maintainers of GLUE benchmark. All reported numbers in this work, including the aggregated GLUE score, reflect evaluation on the older version of QNLI (QNLIv1). and the Winograd Schema Challenge (WNLI, Levesque et al., 2011) converted to entailment format as in White et al. (2017). Because of the adversarial nature of WNLI, our models do not generally perform better than chance, and we follow the recipe of Devlin et al. (2018) by predicting the most frequent label for all examples.

Most of our experiments—including all of our experiments using downsampled training sets for our target tasks—are evaluated on the development set of GLUE. Based on the results on the development set, we choose the best intermediate-task training scheme for each task and submit the best-per-task model for evaluation on the test set on the public leaderboard.

Intermediate Task Training

Our experiments follow the standard pretrain-then-fine-tune approach, except that we add a supplementary training phase on an intermediate task before target-task fine-tuning. We call this approach BERT on STILTs, GPT on STILTs and ELMo on STILTS for the respective models. We evaluate a sample of four intermediate tasks, which were chosen to represent readily available data-rich sentence-level tasks similar to those in GLUE: (i) textual entailment with MNLI; (ii) textual entailment with SNLI; (iii) paraphrase detection with QQP; and (iv) a custom fake-sentence-detection task.

Our use of MNLI is motivated by prior successes with MNLI pretraining by Conneau et al. (2018) and Subramanian et al. (2018). We include the single-genre captions-based SNLI in addition to the multi-genre MNLI to disambiguate between the benefits of domain shift and task shift from supplementary training on natural language inference. QQP is included as we believed it could improve performance on sentence similarity tasks such as MRPC and STS. Lastly, we construct a fake-sentence-detection task based on the BooksCorpus dataset in the style of Warstadt et al.. Importantly, because both GPT and BERT are pretrained on BooksCorpus, the fake-sentence-detection enables us to isolate the impact of task shift from domain shift from the pretaining corpus. We construct this task by sampling sentences from BooksCorpus, and fake sentences are generated by randomly swapping 2–4 pairs of words in the sentence. We generate a dataset of 600,000 sentences with a 50/50 real/fake split for this intermediate task.

Training Details

Unless otherwise stated, for replications and both stages of our STILTs experiments, we follow the model formulation and training regime of BERT and the GPT specified in Devlin et al. and Radford et al. (2018) respectively. Specifically, for both models we use a three-epoch training limit for both supplementary training and target-task fine-tuning. We use a fresh optimizer for each phase of training. For each task, we add only a single task-specific, randomly initialized output layer to the pretrained Transformer model, following the setup laid out by each respective work. For our baseline, we do not fine-tune on any intermediate task: Other than the batch size, this is equivalent to the formulation presented in the papers introducing BERT and GPT respectively and serves as our attempt to replicate their results.

For BERT, we use a batch size of 24 and a learning rate of 2e-5. This is within the range of hyperparameters recommended by the authors and initial experiments showed promising results. We use the larger, 24-layer version of BERT, which is the state of the art on the GLUE benchmark. For this model, fine-tuning can be unstable on small data sets—hence, for the tasks with limited data (CoLA, MRPC, STS, RTE), we perform 20 random restarts for each experiment and report the results of the model that performed best on the validation set.

For GPT, we choose the largest batch size out of 8/16/32 that a single GPU can accommodate. We use the version with an auxiliary language modeling objective in fine-tuning, corresponding to the entry on the GLUE leaderboard.Radford et al. (2018) introduced two versions of GPT: one which includes an auxiliary language modeling objective when fine-tuning, and one without.

For ELMo, to facilitate a fair comparison with GPT and BERT, we adopt a similar fine-tuning setup where all the weights are fine-tuned. This differs from the original ELMo setup that freezes ELMo weights and trains an additional encoder module when fine-tuning. The details of our ELMo setup are described in Appendix A.

We also run our main experiment on the 12-layer BERT and the non-LM fine-tuned GPT. These results are in Table 4 in the Appendix.

Multitask Learning Strategies

To compare STILTs to alternative multitask learning regimes, we also experiment with the following two approaches: (i) a single phase of fine-tuning simultaneously on both a intermediate task and the target task (ii) fine-tuning simultaneously on a intermediate task and the target task, and then doing an additional phase of fine-tuning on the target task only. In the multitask learning phase, for both approaches, training steps are sampled proportionally to the sizes of the respective training sets and we do not weight the losses.

Models and Code

Our pretrained models and code for BERT on STILTs can be found at https://github.com/zphang/pytorch-pretrained-BERT, which is a fork of the Hugging Face implementation. We used the jiant framework experiments on GPT and ELMo.

Results

Table 1 shows our results on GLUE with and without STILTs. Our addition of supplementary training boosts performance across many of the two-sentence tasks. We also find that most of the gains are on tasks with limited data. On each of our STILTs models, we show improved overall GLUE scores on the development set. Improvements from STILTs tend to be larger for ELMo and GPT and somewhat smaller for BERT. On the other hand, for pairs of pretraining and target tasks that are close, such as MNLI and RTE, we indeed find a marked improvement in performance from STILTs. For the two single-sentence tasks—the syntax-oriented CoLA task and the SST sentiment task—we find somewhat deteriorated performance. For CoLA, this mirrors results reported in Bowman et al. (2019), who show that few pretraining tasks other than language modeling offer any advantage for CoLA. The Best of Each score is computed based on taking the best score for each task, including no STILTs.

On the test set, we see similar performance gains across most tasks. Here, we compute the results for each model on STILTs, which shows scores from choosing the best corresponding model based on development set scores and evaluating on the test set. These also correspond to the selected models for Best of Each above.For BERT, we run an additional 80 random restarts–100 random restarts in total–for the tasks with limited data, and select the best model based on validation score for test evaluation For both BERT and GPT, we show that using STILTs leads to improvements in test set performance improving on the reported baseline by 1.4 points and setting the state of the art for the GLUE benchmark, while GPT on STILTs achieves a score of 76.9, improving on the baseline by 2.8 points, and significantly closing the gap between GPT and the 12-layer BERT model with a similar number of parameters, which attains a GLUE score of 78.3.

Table 2 shows the same models fine-tuned on 5k training examples and 1k examples for each task, selected randomly without replacement. Artificially limiting the size of the training set allows us to examine the effect of STILTs in data constrained contexts. For tasks with training sets that are already smaller than these limits, we use the training sets as-is. For BERT, we show the maximum task performance across 20 random restarts for all experiments, and the data subsampling is also random for each restart.

The results show that the benefits of supplementary training are generally more pronounced in these settings, with performance in several tasks showing improvements of more than 10 points. CoLA and SST are again the exceptions: Both tasks deteriorated moderately with supplementary training, and CoLA trained with the auxiliary language modeling objective in particular showed highly unstable results when trained on small amounts of data.

We see one obvious area for potential improvement: In our experiments, we follow the recipe for fine-tuning from the original works as closely as possible, only doing supplementary training and fine-tuning for three epochs each. Particularly in the case of the artificially data-constrained tasks, we expect that performance could be improved with more careful tuning of the training duration and learning rate schedule.

Fine-Tuning Stability

In the work that introduced BERT, Devlin et al. highlight that the larger, 24-layer version of BERT is particularly prone to degenerate performance on tasks with small training sets, and that multiple random restarts may be required to obtain a usable model. In Figure 1, we plot the distribution of performance scores for 20 random restarts for each task, using all training data and maximum of 5k or 1k training examples. For conciseness, we only show results for BERT without STILTs, and BERT with intermediate fine-tuning on MNLI. We omit the random restarts for tasks with training sets of more than 10k examples, consistent with our training methodology.

We show that, in addition to improved performance, using STILTs significantly reduces the variance of performance across random restarts. A large part of reduction can be attributed to the far fewer number of degenerate runs—performance outliers that are close to random guessing. This effect is consistent across target tasks, though the magnitude varies from task to task. For instance, although we show above that STILTs with our four intermediate tasks does not improve model performance in CoLA and SST, using STILTs nevertheless reduces the variance across runs as well as the number of degenerate fine-tuning results.

Multitask Learning and STILTs

We investigate whether setups that leverage multitask learning are more effective than STILTs. We highlight results from one of the cases with the largest improvement: GPT with intermediary fine-tuning on MNLI with RTE as the target task. To better isolate the impact of multitask learning, we exclude the auxiliary language modeling training objective in this experiment. Table 3 shows all setups improve compared to only fine-tuning, with the STILTs format of consecutive single-task fine-tuning having the largest improvement. Although this does not represent an in-depth inquiry of all the ways to leverage multitask learning and balance multiple training objective, naive multitask learning appears to yield worse performance than STILTs, at potentially greater computational cost.

Discussion

Broadly, we have shown that, across three different sentence encoders with different architectures and pretraining schemes, STILTs can leads to performance gains on many downstream target tasks. However, this benefit is not uniform. We find that sentence pair tasks seem to benefit more from supplementary training than single-sentence ones. We also find that tasks with little training data benefit much more from supplementary training. Indeed, when applied to RTE, supplementary training on the related MNLI task leads to a eight-point increase in test set score for BERT.

Overall, the benefit of STILTs is smaller for BERT than for GPT and ELMo. One possible reason is that BERT is better conditioned for fine-tuning for classification tasks, such as those in the GLUE Benchmark. Indeed, GPT uses the hidden state corresponding to the last token of the sentence as a proxy to encode the whole sentence, but this token is not used for classification during pre-training. On the other hand, BERT has a $<$ CLS $>$ token which is used for classification during pre-training for their additional next-sentence-prediction objective. This token is then used in fine-tuning for classification. When adding STILTs to GPT, we bridge that gap by training the last token with the classification objective of the intermediary task. This might explain why fake-sentence-detection is a broadly beneficial task for GPT and not for BERT: Since fake-sentence-detection uses the same corpus that GPT and BERT are pretrained on, it is likely that the improvements we find for GPT are due to the better conditioning of this sentence-encoding token.

Applying STILTs also comes with little complexity or computational overhead. The same infrastructure used to fine-tune BERT or GPT models can be used to perform supplementary training. The computational cost of the supplementary training phase is another phase of fine-tuning, which is small compared to the cost of training the original model. In addition, in the case of BERT, the smaller number of degenerate runs induced by STILTs will reduce the computational cost of a full training procedure in some settings.

Our results also show where STILTs may be ineffective or counterproductive. In particular, we show that most of our intermediate tasks were actually detrimental to the single-sentence tasks in GLUE. The interaction between the intermediate task, the target task, and the use of the auxiliary language modeling objective is a subject due for further investigation. Moreover, the four intermediary training tasks we chose represent only a small sample of potential tasks, and it is likely that a more expansive survey might yield better performance on different downstream tasks. Therefore, for best target task performance, we recommend experimenting with supplementary training with several closely-related data-rich tasks and use the development set to select the most promising approach for each task, as in the Best of Each formulation shown in Table 1.

Conclusion

This work represents only an initial investigation into the benefits of supplementary supervised pretraining. More work remains to be done to firmly establish when methods like STILTs can be productively applied and what criteria can be used to predict which combinations of intermediate and target tasks should work well. Nevertheless, in our initial work with four example intermediate training tasks, we showed significant gains from applying STILTs to three sentence encoders, BERT, GPT and ELMo, and set the state of the art on the GLUE benchmark with BERT on STILTs. STILTs also helps to significantly stabilize training in unstable training contexts, such as when using BERT on tasks with little data. Finally, we show that in data-constrained regimes, the benefits of using STILTs are even more pronounced, yielding up to 10 point score improvements on some intermediate/target task pairs.

Acknowledgments

We would like to thank Alex Wang, Ilya Kulikov, Nikita Nangia and Phu Mon Htut for their helpful feedback.

References

Appendix A ELMo on STILTs

We use the same architecture as Peters et al. (2018a) for the non-task-specific parameters. For task-specific parameters, we use the layer weights and the task weights described in the paper, as well as a classifier composed of max-pooling with projection and a logistic regression classifier. In contrast to the GLUE baselines and to Bowman et al. (2019), we refrain from adding many non-LM pretrained parameters by not using pair attention nor an additional encoding layer. The whole model, including ELMo parameters, is trained during both supplementary training on the intermediate task and target-task tuning. For two-sentence tasks, we follow the model design of Wang et al. (2018) rather than that of Radford et al. (2018), since early experiments showed better performance with the former. Consequently, we run the shared encoder on the two sentences $u$ and $u^{\prime}$ independently and then use $[u^{\prime};v^{\prime};|u^{\prime}-v^{\prime}|;u^{\prime}*v^{\prime}]$ for our task-specific classifier. We use the default optimizer and learning rate schedule from jiant.