SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer

Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, Daniel Cer

Introduction

The past few years have seen the rapid development of ever larger pre-trained language models, where it has repeatedly been shown that scaling up the model size is a key ingredient for achieving the best performance Devlin et al. (2019); Raffel et al. (2020); Brown et al. (2020). While this trend has continued to push the boundaries of possibility across various NLP benchmarks, the sheer size of these models presents a challenge for their practical application. For 100B+ parameter models, fine-tuning and deploying a separate instance of the model for each downstream task would be prohibitively expensive.

To get around the infeasibility of fine-tuning, Brown et al. (2020) propose PromptDesign, where every downstream task is cast as a language modeling task and the frozen pre-trained model performs different tasks by conditioning on manual text prompts provided at inference time. They demonstrate impressive few-shot performance with a single frozen GPT-3 model, although its performance depends highly on the choice of the prompt Zhao et al. (2021) and still lags far behind state-of-the-art fine-tuning results.

More recent work has explored methods for learning soft prompts Liu et al. (2021b); Qin and Eisner (2021); Li and Liang (2021); Lester et al. (2021), which can be seen as additional learnable parameters injected into the language model. Lester et al. (2021) propose PromptTuning, a simple method that learns a small task-specific prompt (a sequence of tunable tokens prepended to each example) for each downstream task during adaptation to condition the frozen language model to perform the task. Strikingly, as model capacity increases, PromptTuning becomes competitive with ModelTuning, which fine-tunes the entire model on each downstream task. Nevertheless, at smaller model sizes (below 11B parameters), there are still large gaps between PromptTuning and ModelTuning.

In this paper, we propose SPoT: Soft Prompt Transfer, a novel transfer learning approach in the context of prompt tuning. SPoT first trains a prompt on one or more source tasks, and then uses the resulting prompt to initialize the prompt for a target (downstream) task. Our experiments show that SPoT offers significant improvements over PromptTuning across tasks and model sizes. For instance, on the SuperGLUE benchmark Wang et al. (2019b), we obtain +10.1 and +2.4 point average accuracy improvements using the T5 Base (220M parameter) and T5 XXL (11B parameter) models Raffel et al. (2020), respectively. More importantly, SPoT is competitive with or outperforms ModelTuning across all model sizes (see Figure 1).

Motivated by these results, we investigate transferability between tasks, through the lens of soft task prompts. Our goal is to answer two questions: (a) For a given target task, when does initializing the prompt from a source task boost performance? (b) Can we use task prompts to efficiently predict which source tasks will transfer well onto a novel target task? To answer (a), we conduct a systematic study of the T5 model using 26 NLP tasks in 160 combinations of source and target tasks. Our results indicate that many tasks can benefit each other via prompt transfer. To address (b), we interpret the learned task prompts as task embeddings to construct a semantic space of tasks and formalize the similarity between tasks. We design an efficient retrieval algorithm that measures task embedding similarity, allowing practitioners to identify source tasks that will likely yield positive transfer.

To summarize, our main contributions are: (1) We propose SPoT, a novel prompt-based transfer learning approach, and show that scale is not necessary for PromptTuning to match the performance of ModelTuning; on SuperGLUE, SPoT matches or beats ModelTuning across all model sizes. (2) We conduct a large-scale and systematic study on task transferability, demonstrating conditions under which tasks can benefit each other via prompt transfer. (3) We propose an efficient retrieval method that interprets task prompts as task embeddings to construct a semantic space of tasks, and measures task embedding similarity to identify which tasks could benefit each other. (4) To facilitate future work on prompt-based learning, we will release our library of task prompts and pre-trained models, and provide practical recommendations for adapting our library to NLP practitioners at https://github.com/google-research/ prompt-tuning/tree/main/prompt_tuning/ spot.

Improving PromptTuning with SPoT

To improve performance of PromptTuning on a target task, SPoT introduces source prompt tuning, an intermediate training stage between language model pre-training and target prompt tuning (Figure 2, left), to learn a prompt on one or more source tasks (while still keeping the base model frozen), which is then used to initialize the prompt for the target task.The target task can be treated as one of the source tasks being mixed together. Our approach retains all the computational benefits of PromptTuning: for each target task, it only requires storing a small task-specific prompt, enabling the reuse of a single frozen pre-trained model across all tasks. In this section, we present a generic SPoT approach where a single transferred prompt is reused for all target tasks. In §3, we explore a targeted approach that retrieves different source prompts for different target tasks.

Our frozen models are built on top of the pre-trained T5 checkpoints of all sizes: Small, Base, Large, XL, XXL with 60M, 220M, 770M, 3B, and 11B parameters, respectively. In our experiments with SPoT, we leverage the LM adapted version of T5 T5 1.1 checkpoints trained for an additional 100K steps using the “prefix LM” objective Raffel et al. (2020), available at https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md , which was found to be easier to optimize for PromptTuning Lester et al. (2021).

We compare SPoT to the following baselines:

The vanilla prompt tuning approach of Lester et al. (2021), where an independent prompt is directly trained on each target task.

We compare prompt tuning approaches to ModelTuning, the standard fine-tuning approach Devlin et al. (2019); Raffel et al. (2020), where all model parameters are fine-tuned on each target task separately. For an apples-to-apples comparison, we include Multi-taskModelTuning, a more competitive baseline that first fine-tunes the entire model on the same mixture of source tasks used for SPoT before fine-tuning it on individual target tasks.In preliminary experiments, we found that using the original version of T5 1.1 (which was pre-trained exclusively on span corruption) for model tuning approaches results in better performance than using the LM adapted version. We therefore report results corresponding to the original T5 1.1 for ModelTuning and Multi-taskModelTuning.

1.2 Evaluation datasets

We study downstream performance on a diverse set of tasks from the GLUE Wang et al. (2019c) and SuperGLUE Wang et al. (2019b) benchmarks.These datasets include grammatical acceptability judgments (CoLA Warstadt et al. (2019)), sentiment analysis (SST-2 Socher et al. (2013)), paraphrasing/semantic similarity (MRPC Dolan and Brockett (2005), STS-B Cer et al. (2017), QQP Iyer et al. (2017)), natural language inference (MNLI Williams et al. (2018), QNLI Wang et al. (2019c), RTE (Dagan et al., 2005, et seq.), CB De Marneffe et al. (2019)), coreference resolution (WSC Levesque et al. (2012)), sentence completion (COPA Roemmele et al. (2011)), word sense disambiguation (WiC Pilehvar and Camacho-Collados (2019)), and question answering (MultiRC Khashabi et al. (2018), ReCoRD Zhang et al. (2018), BoolQ Clark et al. (2019)). We exclude the problematic WNLI Levesque et al. (2012) dataset from GLUE, following Devlin et al. (2019). We train for a fixed number of steps and report results on the validation set associated with each dataset.For tasks with multiple metrics, we average the metrics.

1.3 Data for source prompt tuning

As with language model pre-training, the choice of training data is crucial for successful prompt transfer. To investigate the impact of source training data on downstream performance, we compare a diverse set of source tasks.

We first consider training the prompt on a fraction of the C4 (Colossal Clean Crawled Corpus) dataset Raffel et al. (2020) using the ‘‘prefix LM’’ objective discussed in Raffel et al. (2020). Although this task was used to pre-train our frozen T5 models already, it could still be helpful for learning a general-purpose prompt.

Alternatively, we can train the prompt using a supervised task. We use either MNLI Williams et al. (2018) or SQuAD Rajpurkar et al. (2016) as a single source task. MNLI was shown to be helpful for many sentence-level classification tasks Phang et al. (2019), while SQuAD was found to generalize well to QA tasks Talmor and Berant (2019).

So far, we have been using a single source task. An alternative approach is multi-task training. Within T5’s unified text-to-text framework, this simply corresponds to mixing different datasets together. We explore mixing datasets from different NLP benchmarks or families of tasks, including GLUE, SuperGLUE, natural language inference (NLI), paraphrasing/semantic similarity, sentiment analysis, question answering (QA) on MRQA Fisch et al. (2019), commonsense reasoning on RAINBOW Lourie et al. (2021), machine translation, summarization, and natural language generation on GEM Gehrmann et al. (2021).See Appendix B for details about datasets. We create a mixture of source tasks from each of the NLP benchmarks/families of tasks above, and a mixture comprising all datasets (C4 + 55 labeled datasets), using the examples-proportional mixing strategy in Raffel et al. (2020) with an artificial dataset size limit $\bm{\mathcal{K}}=2^{19}$ examples.

1.4 Training details

2 Effect of SPoT

We compare the results of SPoT and other approaches in Table 1 and Figure 1. Below, we summarize and analyze each of our findings in detail.

Our results on the GLUE and SuperGLUE benchmarks with T5 Base (Table 1) suggest that prompt transfer provides an effective means of improving performance for PromptTuning. For example, the best-performing variant of SPoT outperforms the vanilla PromptTuning approach on both GLUE and SuperGLUE by a substantial margin, obtaining +4.4 and +10.1 point average accuracy improvements, respectively. Our ablation study indicates that longer tuning is also an important ingredient for achieving our best performance, and is complementary to prompt transfer. Additionally, when longer tuning is omitted, we observe that SPoT improves stability across runs.

Within SPoT, we can compare the effectiveness of different source mixtures (see Table 1). Source prompt tuning on GLUE performs best on both GLUE and SuperGLUE, obtaining average scores of 82.8 and 73.2, respectively.SuperGLUE tasks benefit less from source prompt tuning on SuperGLUE likely due to the small size of these datasets. Interestingly, unsupervised source prompt tuning on C4 (the same task used to pre-train our frozen models) still yields considerable improvements, even outperforming using SuperGLUE for SuperGLUE tasks. Using MNLI or SQuAD as a single source dataset is also particularly helpful across target tasks. Other source mixtures can lead to significant gains, with some families of tasks (e.g., NLI and paraphrasing/semantic similarity) showing more benefit than others. Mixing all the datasets together does not yield the best results, possibly due to task interference/negative transfer issues, where achieving good performance on one or more source tasks can hurt performance on a target task.

Figure 1 shows our SuperGLUE results across model sizes (see Appendix A for full results). As shown in Lester et al. (2021), PromptTuning becomes more competitive with scale, and at the XXL size, it nearly matches the performance of ModelTuning. However, at smaller model sizes, there are still large gaps between the two approaches. We show that SPoT helps close these gaps and even exceeds ModelTuning’s performance by a large margin at several model sizes, while retaining all the computational benefits conferred by PromptTuning. Finally, at the XXL size, SPoT achieves the best average score of 91.2, +1.1 points better than the strong Multi-taskModelTuning baseline, despite having 27,000 $\times$ fewer task-specific parameters in both multi-task source tuning and target tuning.

As a final test of SPoT’s effectiveness, we submitted our XXL model’s predictions to the SuperGLUE leaderboard, achieving a score of 89.2. This far exceeds all previous submissions using parameter-efficient adaptation, such as GPT-3 (71.8), and almost matches fully fine-tuned T5 XXL (89.3),Note that the T5 submission uses the original version of T5 (which was pre-trained on a multi-task mixture of unsupervised and supervised tasks) while we use T5 1.1 (which was pre-trained on C4 only without mixing in supervised tasks). despite tuning 27,000 $\times$ fewer parameters. To the best of our knowledge, SPoT is the first parameter-efficient adaptation approach that is competitive with methods that tune billions of parameters. See Appendix D for details.

Predicting task transferability

So far, we have seen that soft prompt transfer can significantly boost the performance of prompt tuning, but it is critical to pick the right source tasks for transfer. For instance, through an extensive search, we found that GLUE and MNLI provide excellent source tasks for transferring to individual GLUE and SuperGLUE tasks. But what about a resource-constrained scenario where a user is not able to exhaustively search for a set of source tasks? Can we predict which tasks will best transfer onto a novel target task without testing them one by one?

To investigate this, we conduct a large-scale empirical study with 26 NLP tasks. We first measure transferability across all task combinations (§3.1). Next, we show that by interpreting task prompts as task embeddings, we can construct a semantic space of tasks, wherein similar tasks cluster together (§3.2). Based on this observation, we propose a retrieval algorithm (§3.3) that leverages task embedding similarity to choose which source tasks to use for a given novel target task (Figure 2, right). Our proposed approach can eliminate $69$ % of the source task search space while keeping $90$ % of the best-case quality gain.

We study a diverse set of 16 source datasets and 10 target datasets (see Table 2).Beyond the datasets from §2, we use DocNLI Yin et al. (2021), Yelp-2 Zhang et al. (2015), CxC Parekh et al. (2021), DROP Dua et al. (2019), WinoGrande Sakaguchi et al. (2020), HellaSWAG Zellers et al. (2019), CosmosQA Huang et al. (2019), RACE Lai et al. (2017), and CR Hu and Liu (2004). We consider all 160 possible source-target pairs, and perform transfer from each source task to each target task. All source tasks are data-rich or have been shown to yield positive transfer in prior work. To simulate a realistic scenario, we use low-resource tasks (less than 10K training examples) as target tasks.The source tasks comprise one unsupervised task (C4) and 15 supervised tasks covering natural language inference (NLI), paraphrasing/semantic similarity, sentiment analysis, question answering (QA), and commonsense reasoning. The target tasks additionally include grammatical acceptability, word sense disambiguation, and coreference resolution.

To limit computational costs, we use T5 Base in all of our task transferability experiments. We perform $262{,}144$ prompt tuning steps on each source task. The prompt checkpoint with the highest source task validation performance is selected to initialize prompts for different target tasks. Since the target datasets are small, we only perform 100K prompt tuning steps on each target task. We repeat each experiment three times with different random seeds. Other training details match §2.1.4.

Figure 3 shows a heatmap of our results (see Appendix E for full results). In many cases, prompt transfer provides a significant gain on the target task. The transfer MNLI $\rightarrow$ CB yields the largest relative error reduction of 58.9% (from an average score of 92.7 to 97.0), followed by MNLI $\rightarrow$ COPA (29.1%) and ReCoRD $\rightarrow$ WSC (20.0%). Using the best source prompt (out of 48) for each target task dramatically improves the average score across 10 target tasks from 74.7 to 80.7. Overall, our results show effective transfer from large source tasks that involve high-level reasoning about semantic relationships among sentences (e.g., MNLI), or when the source and target tasks are similar (e.g., CxC $\rightarrow$ STS-B). Interestingly, positive transfer can occur between relatively dissimilar tasks (e.g., ReCoRD $\rightarrow$ WSC, SQuAD $\rightarrow$ MRPC, CxC $\rightarrow$ WiC).Table 7 in Appendix E contains more cases.

2 Defining task similarity through prompts

Since only prompt parameters are updated during prompt tuning on specific tasks, the learned prompts likely encode task-specific knowledge. This suggests that they could be used to reason about the nature of tasks and their relationships. To test this idea, we interpret task prompts as task embeddings and construct a semantic space of tasks. More concretely, we define a task’s embedding as the prompt checkpoint after training for 10K steps on that task.Our preliminary experiments with other checkpoint alternatives (in the range 1K to 100K) yielded worse performance. We also found that measuring task similarity using task embeddings derived from a fixed prompt checkpoint (10K steps) gave better results than those derived from the best-performing prompt checkpoint per task. This suggests that prompts trained for a differing number of steps may be less directly comparable than those trained for the same length. Note that using early checkpoints allows for quick computation of task embeddings for novel target tasks. We estimate the similarity between two tasks $t^{1},t^{2}$ by measuring the similarity between their corresponding task embeddings $\bm{e}^{1},\bm{e}^{2}$ , using the following metrics:

We compute the cosine similarity between the average pooled representations of the prompt tokens:

where $\bm{e}_{i}^{1},\bm{e}_{j}^{2}$ denote the respective prompt tokens of $\bm{e}^{1},\bm{e}^{2}$ , and $cos$ denotes the cosine similarity.

We compute the average cosine similarity between every prompt token pair $(\bm{e}_{i}^{1},\bm{e}_{j}^{2})$ :

Figure 4 shows a hierarchically-clustered heatmap of cosine similarities between the task embeddings using the Cosine Similarity of Average Tokens metric.To obtain the highest resolution of similarity between two tasks, we use the average of cosine similarities between their task embeddings obtained with all the three different prompt tuning runs (9 combinations). We observe that our learned task embeddings capture many intuitive task relationships. Specifically, similar tasks group together into clusters, including QA (SQuAD, ReCoRD, and DROP; MultiRC and BoolQ), sentiment analysis (Yelp-2, SST-2, and CR), NLI (MNLI and CB; DocNLI and RTE), semantic similarity (STS-B and CxC), paraphrasing (MRPC and QQP), and commonsense reasoning (WinoGrande, HellaSWAG, and CosmosQA). We note that QNLI, which is an NLI task built from the SQuAD dataset, is not closely linked to SQuAD; this suggests that our task embeddings are more sensitive to the type of task than domain similarity. Interestingly, they also capture the unintuitive case of ReCoRD’s high transferability to WSC. Additionally, task embeddings that are derived from different prompts of the same task have high similarity scores (see Appendix F).

3 Predicting transferability via similarity

We leverage our task embeddings to predict and exploit task transferability. Specifically, we explore methods to predict the most beneficial source tasks for a given target task and then make use of their prompts to improve performance on the target task. To enlarge our set of source prompts, we use the prompts from each of the three different prompt tuning runs on each source task, resulting in 48 source prompts. Given a target task $t$ with task embedding $\bm{e}^{t}$ , we rank all the source prompts $\bm{\rho}^{s}$ with associated embeddings $\bm{e}^{s}$ in descending order by the similarity $sim(\bm{e}^{s},\bm{e}^{t})$ . We denote the ranked list of source prompts as $\bm{\rho}^{s_{r}}$ , where $r$ denotes the rank $(r=1,2,\ldots,48)$ . We experiment with three methods for exploiting the ranked source prompts:

We select the top- $k$ source prompts and use each of them individually to initialize the target prompt. This procedure requires prompt tuning $k$ times on the target task $t$ . The best individual result is used for evaluating the effectiveness of this method.

We initialize the target prompt with a weighted average of the top- $k$ source prompts $\sum_{r=1}^{k}\alpha_{r}\bm{\rho}^{s_{r}}$ so that we only perform prompt tuning on the target task $t$ once. The weights $\alpha_{r}$ are computed as:

where $\bm{e}^{s_{r}}$ denotes the corresponding task embedding of $\bm{\rho}^{s_{r}}$ .

We first identify the source tasks whose prompts are in the top- $k$ prompts and mix their datasets and the target dataset together, using the examples-proportional mixing strategy of Raffel et al. (2020). Then, we perform source prompt tuning on this multi-task mixture and use the final prompt checkpoint to initialize the prompt for target prompt tuning.

We report the average score across all target tasks achieved by each method. For comparison, we measure the absolute and relative improvements over Baseline---prompt tuning on each target task from scratch (i.e., without any prompt transfer).For each target task $t$ , we report the average and standard deviation of performance across three prompt tuning runs. Additionally, we include Oracle---the oracle results achieved by a brute-force search to identify the best possible out of 48 source prompts for each target task.

Figure 5 shows how the relative error reduction on a target task changes as a function of the similarity between the source and target task embeddings. Overall, we observe a significant positive correlation between task embedding similarity and task transferability on four (out of 10) target tasks, including STS-B ( $p<0.001$ ), CB ( $p<0.001$ ), WSC ( $p<0.01$ ), and RTE ( $p<0.05$ ), while it is less significant on the other tasks.See Appendix G for full results. In some cases (e.g., on BoolQ), we observe a large relative error reduction (19.0%, achieved by a source prompt of MNLI) despite a low cosine similarity (0.4). This suggests that factors other than task similarity (data size, task difficulty, domain similarity, etc.) may also play a role in determining transferability.

Table 3 compares different methods for identifying which source prompts could be beneficial for a given target task. Overall, our results show the effectiveness of Best of Top- $k$ : simply choosing the source prompt with the highest task embedding similarity to the target task using Per-token Average Cosine Similarity improves over the baseline by a large margin (from an average score of 74.7 to 76.7, a 12.1% average relative error reduction). Trying all the top-3 (out of 48) source prompts for each target task yields an average score of 77.5. With larger values of $k$ , we can retain most of the benefits of oracle selection (80% of the gain in terms of average score with $k=9$ and 90% with $k=15$ ), while still eliminating over 2/3 of the candidate source prompts. Top- $k$ Weighted Average has similar average performance to Best of Top- $k$ with $k=1$ , but achieves lower variance. Thus, this may be an appealing alternative to Best of Top- $k$ in scenarios where trying multiple prompt tuning runs on the target task is prohibited. Finally, Top- $k$ Multi-task Mixture also provides a means of obtaining strong performance with an average score of 77.8, even outperforming Best of Top- $k$ with $k\leq 3$ .

Related Work

Large-scale pre-trained language models have been shown to exhibit remarkable performance on many NLP tasks Devlin et al. (2019); Liu et al. (2019b); Yang et al. (2019); Lan et al. (2020); Raffel et al. (2020); Brown et al. (2020); He et al. (2021). To improve practical applicability of these models, early work uses compression techniques Sanh et al. (2019); Jiao et al. (2020); Fan et al. (2020); Sanh et al. (2020) to obtain lightweight models. Other work involves updating only small parts of the model Zaken et al. (2021) or task-specific modules, such as adapters Houlsby et al. (2019); Karimi Mahabadi et al. (2021) or low-rank structures Mahabadi et al. (2021); Hu et al. (2021), while keeping the rest of the model fixed.

Recently, Brown et al. (2020) demonstrate impressive few-shot performance with PromptDesign, where their model is conditioned on a manual text prompt at inference time to perform different tasks. Several efforts have since focused on developing prompt-based learning approaches with carefully handcrafted prompts Schick and Schütze (2021), prompt mining and paraphrasing Jiang et al. (2020b), gradient-based search for improved prompts Shin et al. (2020), and automatic prompt generation Gao et al. (2021). The use of hard prompts, however, was found to be sub-optimal and sensitive to the choice of the prompt Zhao et al. (2021); Liu et al. (2021b). As such, more recent work has shifted toward learning soft prompts Liu et al. (2021b); Qin and Eisner (2021); Li and Liang (2021); Lester et al. (2021), which can be seen as learnable parameters injected into the model. We refer readers to Liu et al. (2021a) for a recent survey on prompt-based learning research.

In concurrent work, Gu et al. (2021) also explore the effectiveness of prompt transfer. Their method uses hand-crafted pre-training tasks tailored to specific types of downstream task, and thus may be less extensible to novel downstream tasks. In contrast, we use existing tasks as source tasks and show that prompt transfer can confer benefits even when there are mismatches (e.g., in task type or input/output format) between the source and target.

We also build on existing work on task transferability Wang et al. (2019a); Liu et al. (2019a); Talmor and Berant (2019); Pruksachatkun et al. (2020); Vu et al. (2020, 2021). Prior work shows effective transfer from data-rich source tasks Phang et al. (2019), those that require complex reasoning and inference Pruksachatkun et al. (2020), or those that are similar to the target task Vu et al. (2020). There have also been efforts to predict task transferability Bingel and Søgaard (2017); Vu et al. (2020); Poth et al. (2021). Vu et al. (2020) use task embeddings derived from either the input text or the diagonal Fisher information matrix of the model, while Poth et al. (2021) explore adapter-based alternatives. Here, our use of the same model (without task-specific components) and a unified text-to-text format allows us to better model the space of tasks. Additionally, prompt-based task embeddings are comparatively cheaper to obtain.

Limitations & Future work

As other parameter-efficient adaptation methods (see §4) may outperform PromptTuning in specific situations, it would be interesting to test whether an approach similar to SPoT could extend successfully to these methods. At the same time, we believe that PromptTuning has its own merit. As pre-trained language models become larger and larger, some advantages of PromptTuning over other methods are: (1) Among current methods with learnable parameters, PromptTuning is the most parameter efficient, requiring less than 0.01% task-specific parameters for most model sizes. (2) PromptTuning is simpler than other methods, as it does not modify the internal model architecture (cf. the Prefix-Tuning method of Li and Liang (2021), which adds a prefix to each layer of both the Transformer encoder and decoder); as such, PromptTuning allows mixed-task inference and facilitates transfer learning between tasks. (3) As model capacity increases, PromptTuning becomes more competitive with ModelTuning; to the best of our knowledge, this has not been shown for other methods. (4) Soft prompts could possibly be interpreted as natural language instructions.

Additionally, since our prompt-based task embedding approach does not capture all of the factors that influence task transferability, we leave further exploration of other task embedding methods to future work.

Conclusion

In this paper, we study transfer learning in the context of prompt tuning. We show that scale is not necessary for PromptTuning to match the performance of ModelTuning. On SuperGLUE, our SPoT approach matches or even exceeds the performance of ModelTuning by a large margin across model sizes while being more parameter-efficient. Our large-scale study on task transferability indicates that tasks can benefit each other via prompt transfer in various scenarios. Finally, we demonstrate that task prompts can be interpreted as task embeddings to formalize the similarity between tasks. We propose a simple yet efficient retrieval approach that measures task similarity to identify which source tasks could confer benefits to a novel target task. Taken as a whole, we hope that our work will spur more research into prompt-based transfer learning.

Acknowledgements

We thank Mohit Iyyer, Sebastian Ruder, Kalpesh Krishna, Thang Luong, Quoc Le, and the members of the Descartes team and the UMass NLP group for helpful discussion and feedback. We would also like to thank Grady Simon, Lucas Dixon, Slav Petrov, Nader Akoury, Haw-Shiuan Chang, Katherine Thai, Marzena Karpinska, and Shufan Wang for their comments on this manuscript. Finally, we are grateful to Vamsi Aribandi for his work on preprocessing several datasets used in our experiments.

References

Appendices

Appendix A Full results for Figure 1

Table 4 shows the performance of different model tuning and prompt tuning methods (described in §2.1.1) on the SuperGLUE benchmark.

Appendix B Source datasets used in our SPoT experiments in §2

Figure 6 displays the datasets used in our SPoT experiments in §2. In addition to the C4 unlabeled dataset Raffel et al. (2020), we use 55 labeled datasets. These datasets come from common NLP benchmarks/families of tasks, namely:

GLUE Wang et al. (2019c), including CoLA Warstadt et al. (2019), SST-2 Socher et al. (2013), MRPC Dolan and Brockett (2005), QQP Iyer et al. (2017), STS-B Cer et al. (2017), MNLI Williams et al. (2018), QNLI Wang et al. (2019c), and RTE (Dagan et al., 2005, et seq.).

SuperGLUE Wang et al. (2019b), including BoolQ Clark et al. (2019), CB De Marneffe et al. (2019), COPA Roemmele et al. (2011), MultiRC Khashabi et al. (2018), ReCoRD Zhang et al. (2018), RTE, WiC Pilehvar and Camacho-Collados (2019), and WSC Levesque et al. (2012).

Natural language inference (NLI), including ANLI Nie et al. (2020), CB, DocNLI Yin et al. (2021), MNLI, QNLI, RTE, and SNLI Bowman et al. (2015).

Paraphrasing/semantic similarity, including CxC Parekh et al. (2021), MRPC, QQP, and STS-B.

Sentiment analysis, including CR Hu and Liu (2004), Goemotions Demszky et al. (2020), Sentiment140 Go et al. (2009), SST-2, and Yelp-2 Zhang et al. (2015).

Question answering (QA) on MRQA Fisch et al. (2019), including SQuAD Rajpurkar et al. (2016), NewsQA Trischler et al. (2017), TriviaQA Joshi et al. (2017), SearchQA Dunn et al. (2017), HotpotQA Yang et al. (2018), and NaturalQuestions (NQ Kwiatkowski et al. (2019)).

Commonsense reasoning on RAINBOW Lourie et al. (2021) including $\alpha$ NLI Bhagavatula et al. (2020), CosmosQA Huang et al. (2019), HellaSWAG Zellers et al. (2019), PIQA Bisk et al. (2020), SocialIQa Sap et al. (2019), and WinoGrande Sakaguchi et al. (2020).

Machine translation, including WMT EnDe Bojar et al. (2014), WMT EnFr Bojar et al. (2015), and WMT EnRo Bojar et al. (2016).

Summarization, including Aeslc Zhang and Tetreault (2019), BillSum Kornilova and Eidelman (2019), CNN/Dailymail Hermann et al. (2015); See et al. (2017), Wikilingua Ladhak et al. (2020), Gigaword Graff et al. (2003); Rush et al. (2015), MultiNews Fabbri et al. (2019), Newsroom Grusky et al. (2018), SAMSum Gliwa et al. (2019), and XSum Narayan et al. (2018).

Natural language generation on GEM Gehrmann et al. (2021), including CommonGen Lin et al. (2020), DART Nan et al. (2021), E2E Dušek et al. (2019), SGD Rastogi et al. (2020), WebNLG Gardent et al. (2017), WikiAuto Jiang et al. (2020a), XSum, and Wikilingua.

Appendix C Additional training details

For PromptTuning, following Lester et al. (2021), we initialize the prompt tokens with embeddings that represent an enumeration of the output classes with a back off to sampled vocabulary to fill any remaining prompt positions.

For model tuning approaches, we use the default hyperparameters for T5 Raffel et al. (2020), i.e., learning rate 0.001, Adafactor optimizer with pre-training parameter states restored, and dropout probability 0.1. To improve the model tuning baselines, we perform a sweep over the batch size hyperparameter and select $2^{16}$ tokens per batch, following Lester et al. (2021).

Appendix D Details of our SuperGLUE submission

Table 5 shows the performance of our SPoT XXL SuperGLUE submission, along with several strong competitors from the public SuperGLUE leaderboard. Apart from the human baseline, the top-7 submissions all tune >3B parameters directly on the final tasks. Only three previous SuperGLUE submissions use parameter efficient adaptation, in the sense of tuning <1M parameters on the final tasks; all other submissions tune >50M parameters.The “AILabs Team, Transformers” submission is listed as tuning 3M parameters, but we suspect this is in error, as the submission mentions using the T5-3B and T5-Large models.

Our SPoT submission achieves a score of 89.2, which far exceeds all other parameter-efficient adaptation methods, including GPT-3, which benefits from over 10 $\times$ more frozen parameters (although it uses no tuned parameters). Compared to WARP Hambardzumyan et al. (2021), our SPoT approach tunes 16 $\times$ more parameters (410K vs. 25K), and benefits from 50 $\times$ more frozen parameters.

To the best of our knowledge, SPoT is the first parameter-efficient adaptation approach that is competitive with methods that tune billions of parameters. Most notably, SPoT’s performance almost matches that of fully fine-tuned T5 XXL (89.3), despite building on the same underlying model, and tuning 27,000 $\times$ fewer parameters. We note that SPoT outperforms T5 on three of eight SuperGLUE tasks (namely, CB, COPA, RTE).

Appendix E Task transferability results

The full results of our task transferability experiments can be found in Table 6. We show that in many cases, initializing the prompt to that of a source task can provide significant gain on a target task. Table 7 displays positive transfers with more than 10% relative error reduction on the target task.

Appendix F Task embedding similarity

In Figure 7, we show a clustered heatmap of cosine similarities between the task embeddings of the 26 NLP tasks we study in our task transferability experiments. For each task, we include the resulting task embeddings from all the three different prompt tuning runs on the task. As can be seen, our task embeddings capture task relationships: similar tasks cluster together. Additionally, task embeddings that are derived from different prompts of the same task are linked together.

Appendix G Correlation between task similarity and task transferability

Figure 8 shows how the relative error reduction on a target task changes as a function of the similarity between the source and target task embeddings.