What to Pre-Train on? Efficient Intermediate Task Selection

Clifton Poth, Jonas Pfeiffer, Andreas Rücklé, Iryna Gurevych

Introduction

Large pre-trained language models (LMs) are continuously pushing the state of the art across various NLP tasks. The established procedure performs self-supervised pre-training on a large text corpus and subsequently fine-tunes the model on a specific target task (Devlin et al., 2019; Liu et al., 2019b). The same procedure has also been applied to adapter-based training strategies, which achieve on-par task performance to full model fine-tuning while being considerably more parameter efficient Houlsby et al. (2019) and faster to train (Rücklé et al., 2021).Adapters are new weights at every layer of a pre-trained transformer model. To fine-tune a model on a downstream task, all pre-trained transformer weights are frozen and only the newly introduced adapter weights are trained. Besides being more efficient, adapters are also highly modular, enabling a wider range of transfer learning techniques (Pfeiffer et al., 2020b, 2021a, 2021b; Üstün et al., 2020; Vidoni et al., 2020; Rust et al., 2021; Ansell et al., 2021).

Extending upon the established two-step learning procedure, incorporating intermediate stages of knowledge transfer can yield further gains for fully fine-tuned models. For instance, Phang et al. (2018) sequentially fine-tune a pre-trained language model on a compatible intermediate task before target task fine-tuning. It has been shown that this is most effective for low-resource target tasks, however, not all task combinations are beneficial and many yield decreased performances (Phang et al., 2018; Wang et al., 2019a; Pruksachatkun et al., 2020). The abundance of diverse labeled datasets as well as the continuous development of new pre-trained LMs calls for methods that efficiently identify intermediate dataset that benefit the target task.

So far, it is unclear how adapter-based approaches behave with intermediate fine-tuning. In the first part of this work, we thus establish that this setup results in similar gains for adapters, as has been shown for full model fine-tuning Phang et al. (2018); Pruksachatkun et al. (2020); Gururangan et al. (2020). Focusing on a low-resource target task setup, we find that only a subset of intermediate adapters yield positive gains, while others hurt the performance considerably (see Table 1 and Figure 2). Our results demonstrate that it is necessary to obtain methods that efficiently identify beneficial intermediately trained adapters.

In the second part, we leverage the transfer results from part one to automatically rank and identify beneficial intermediate tasks. With the rise of large publicly accessible repositories for NLP models (Wolf et al., 2020; Pfeiffer et al., 2020a), the chances of finding pre-trained models that yield positive transfer gains are high. However, it is infeasible to brute-force the identification of the best intermediate task. Existing approaches have focused on beneficial task selection for multi-task learning Bingel and Søgaard (2017), full fine-tuning of intermediate and target transformer-based LMs for NLP tasks Vu et al. (2020), adapter-based models for vision tasks Puigcerver et al. (2021) and unsupervised approaches for zero-shot transfer for community question answering Rücklé et al. (2020). Each of these works require different types of data, such as intermediate task data and/or intermediate model weights, which, depending on the scenario, are potentially not accessible.Bingel and Søgaard (2017) and Vu et al. (2020) require access to both intermediate task data and models, Puigcerver et al. (2021) require access to only the intermediate model, and Rücklé et al. (2020) only to the intermediate task data.

In this work we thus aim to address the efficiency aspect of transfer learning in NLP from multiple different angles, resulting in the following contributions: 1) We focus on adapter-based transfer learning which is considerably more parameter Houlsby et al. (2019) and computationally efficient than full model fine-tuning Rücklé et al. (2021), while achieving on-par performance; 2) We evaluate sequential fine-tuning of adapter-based approaches on a diverse set of 42 intermediate and 11 target tasks (i.e. classification, multiple choice, question answering, and sequence tagging); 3) We identify the best intermediate task for transfer learning, without the necessity of computational expensive, explicit training on all potential candidates. We compare different selection techniques, consolidating previously proposed and new methods; 4) We provide a thorough analysis of the different techniques, available data scenarios, and task-, and model types, thus presenting deeper insights into the best approach for each respective setting; 5) We provide computational cost estimates, enabling informed decision making for trade-offs between expense and downstream task performance.

Related Work

Phang et al. (2018) show that training on intermediate tasks results in performance gains for many target tasks. Subsequent work further explores the effects on more diverse sets of tasks (Wang et al., 2019a; Talmor and Berant, 2019; Liu et al., 2019a; Sap et al., 2019; Pruksachatkun et al., 2020; Vu et al., 2020). Wang et al. (2019a), Yogatama et al. (2019), and Pruksachatkun et al. (2020) emphasizes the risks of catastrophic forgetting and negative transfer results, finding that the success of sequential transfer varies largely when considering different intermediate tasks.

While previous work has shown that intermediate task training improves the performance on the target task in full fine-tuning setups, we establish that the same holds true for adapter-based training.

2 Predicting Beneficial Transfer Sources

Automatically selecting intermediate tasks that yield transfer gains is critical when considering the increasing availability of tasks and models.

Proxy estimators have been proposed to evaluate the transferability of pre-trained models towards a target task. Nguyen et al. (2020), Li et al. (2021) and Deshpande et al. (2021) estimate the transferability between classification tasks by building an empirical classifier from the source and target task label distribution. Puigcerver et al. (2021) experiment with multiple model selection methods, including kNN proxy models to estimate the target task performance. In a similar direction, Renggli et al. (2020) study proxy models based on kNN and linear classifiers, finding that a hybrid approach combination of task-aware and task-agnostic strategies yields the best results.

Bingel and Søgaard (2017) find that gradients of the learning curves correlate with multi-task learning success. Zamir et al. (2018) build a taxonomy of vision tasks, giving insights into non-trivial transfer relations between tasks. Multiple works propose using embeddings that capture statistics, features, or the domain of a dataset. Edwards and Storkey (2017) leverage variational autoencoders (Kingma and Welling, 2014) to encode all samples of a dataset. Jomaa et al. (2019) train a dataset meta-feature extractor that can successfully capture the domain of a dataset. Vu et al. (2020) encode each training example of a dataset by averaging over BERT’s representations of the last layer. Rücklé et al. (2020) capture domain similarity by embedding dataset examples using a sentence embedding model. Achille et al. (2019) and Vu et al. (2020) compute task embeddings based on the Fisher Information Matrix of a probe network.

While many different methods have been proposed, there lacks a direct comparison among them. Additionally, previous work has only focus on BERT, which we find to behave considerably different to other model types such as RoBERTa for some methods. In this work we aim to consolidate all methods and experiment with newer model types to provide a more thorough perspective.

Adapter-Based Sequential Transfer

We present a large-scale study on adapter-based sequential fine-tuning, finding that around half of the task combinations yield no positive gains. This demonstrates the importance of finding approaches that efficiently identify suitable intermediate tasks.

We select QA tasks from the MultiQA repository (Talmor and Berant, 2019) and sequence tagging tasks from Liu et al. (2019a). Most of our classification tasks are available in the (Super)GLUE (Wang et al., 2018, 2019b) benchmarks. We experiment with multiple choice commonsense reasoning tasks to cover a broader range of different types, and domains. In total, we experiment with 53 tasks, divided into 42 intermediate and 11 target tasks.The choice for our intermediate and target task split was motivated by previous work (Sap et al., 2019; Vu et al., 2020, inter alia). For more details see Appendix A.

2 Experimental Setup

We experiment with BERT-base Devlin et al. (2019) and RoBERTa-base (Liu et al., 2019b), training adapters with the configuration proposed by Pfeiffer et al. (2021a). We adopt the two-stage sequential fine-tuning setup of Phang et al. (2018), splitting the tasks in two disjoint subsets $\mathcal{S}$ and $\mathcal{T}$ , denoted as intermediate and target tasks, respectively. For each pair $(s,t)$ with $s\in\mathcal{S}$ and $t\in\mathcal{T}$ , we first train a randomly initialized adapter on $s$ (keeping the base model’s parameters fixed). We then fine-tune the trained adapter on $t$ . For more details please refer to Appendix B.

For target task fine-tuning, we simulate a low-resource setup by limiting the maximum number of training examples on $t$ to 1000. This choice is motivated by the observation that smaller target tasks benefit the most from sequential fine-tuning while at the same time revealing the largest performance variances (Phang et al., 2018; Vu et al., 2020). Low-resource setups, thus, reflect the most beneficial application setting for our transfer learning strategy and also allow us to more thoroughly study different transfer relations.

3 Results

Figure 2 shows the relative transfer gains and Table 1 lists the absolute scores of all intermediate and target task combinations for RoBERTa.We list the corresponding transfer results for BERT in Table 10 of the Appendix. We observe large variations in transfer gains (and losses) across the different combinations. Even though larger variances may be explained by a higher task difficulty (see ‘No Transfer’ in Table 1), they also illustrate the heterogeneity and potential of sequential fine-tuning in our adapter-based setting. At the same time, we find several cases of transfer losses—with up to 60% lower performances (see Figure 2)—potentially occurring due to catastrophic forgetting.

Overall, for RoBERTa, 243 ( $53\%$ ) transfer combinations yield positive transfer gains whereas 203 ( $44\%$ ) yield losses. The mean of all transfer gains is $2.3\%$ . However, from our eleven target tasks only five benefit on average (see ‘Avg. Transfer’ in Table 1). This illustrates the high risk of choosing the wrong intermediate tasks. Avoiding such hurtful combinations and efficiently identifying the best ones is necessary; evaluating all combinations is inefficient and often not feasible.

We further find that the best performing intermediate tasks for BERT and RoBERTa overlap considerably as illustrated in Figure 1, with transfer performances correlating with a Spearman correlation of 0.94 when averaged over all settings, and 0.68 when averaged per target task.

Methods for the Efficient Selection of Intermediate Tasks

We now present different model selection methods, and later in §5, study their effectiveness in our setting outlined above. We group the different methods based on the assumptions they make with regard to the availability of intermediate task data $D_{S}$ and intermediate models $M_{S}$ . Access to both can be expensive when considering large pre-trained model repositories with hundreds of tasks.

A setting in which there exist neither access to the intermediate task data $D_{S}$ nor models trained on the data $M_{S}$ , can be regarded as an educated guess scenario. The selection criterion can only rely on metadata available for an intermediate task dataset.

Dataset Size. Under the assumption that more data implies better transfer performance, the selection criterion denoted as Size ranks all intermediate tasks in descending order by the training data size.

Task Type. Under the assumption that similar objective functions transfer well, we pre-select the subset of tasks of the same type. This approach may be combined with a random selection of the remaining tasks, or with ranking them by size.

2 Intermediate Task Data

With an abundance of available datasets,e.g. via https://huggingface.co/datasets. and the continuous development of new LMs, fine-tuned versions for every task-model combination are not (immediately) available. The following methods, thus, leverage the intermediate task data $D_{S}$ without requiring the respective fine-tuned models $M_{S}$ .

Text Embeddings (TextEmb). Vu et al. (2020) pass each example through a LM and average over the output representations of the final layer (across all examples and all input tokens). Assuming that similar embeddings imply positive transfer gains, they rank the intermediate tasks according to their embeddings’ cosine similarity to the target task.

SBERT Embeddings (SEmb). Sentence embedding models such as Sentence-BERT (SBERT; Reimers and Gurevych, 2019) may be better suited to represent the dataset examples. Similar to TextEmb, we rank the intermediate tasks according to their embedding cosine similarity.

3 Intermediate Model

Scenarios in which we only have access to the trained intermediate models ( $M_{S}$ ) occur when the training data is proprietary or if implementing all dataset is too tedious. With the availability of model repositories Wolf et al. (2020); Pfeiffer et al. (2020a) such approaches can be implemented without requiring additional data during model upload (i.e. in contrast to TaskEmbs, where the training dataset information needs to be made available). The following describes methods only requiring access to the intermediate models $M_{S}$ .

Few-Shot Fine-Tuning (FSFT). Fine-tuning of all available intermediate task models on the entire target task is infeasible. As an alternative, we can train models for a few steps on the target task to approximate the final performance. After $N$ steps on the target task, we rank the intermediate models based on their respective transfer performance.

Proxy Models. Following Puigcerver et al. (2021), we leverage simple proxy models to obtain a performance estimation of each trained model $M_{S}$ on on the target dataset $D_{T}$ . Specifically, we experiment with k-Nearest Neighbors (kNN), with $k=1$ and Euclidian distance, and logistic/ linear regression (linear) as proxy models. For both, we first compute $\mathbf{h}_{x_{i}}^{M}$ , the token-wise averaged output representations of $M_{S}$ , for each training input $x_{i}\in D_{T}$ . Using these, we define $D_{T}^{M}=\{(\mathbf{h}_{x_{i}}^{M},y_{i})\}_{i=1}^{N}$ as the target dataset embedded by $M_{S}$ . In the next step, we apply the proxy model on $D_{T}^{M}$ and obtain its performance using cross-validation. By repeating this process for each intermediate task model, we obtain a list of performance scores which we leverage to rank the intermediate tasks.

4 Intermediate Model and Task Data

Access to both intermediate dataset $D_{S}$ and intermediates model $M_{S}$ provides a wholesome depiction of the intermediate task, as all previously mentioned methods are applicable in this scenario. Further methods which require access to both are:

Task Embeddings (TaskEmb). Achille et al. (2019) and Vu et al. (2020) obtain task embeddings via the Fisher Information Matrix (FIM). The FIM captures how sensitive the loss function is towards small perturbations in the weights of the model and thus gives an indication on the importance of certain weights towards solving a task.

Given the model weights $\theta$ and the joint distribution of task features and labels $P_{\theta}(X,Y)$ , we can define the FIM as the expected covariance of the gradients of the log-likelihood w.r.t. $\theta$ :

We follow the implementation details given in Vu et al. (2020). For a dataset $D$ and a model $M$ fine-tuned on $D$ , we compute the empirical FIM based on $D$ ’s examples. The task embeddings are the diagonal entries of the FIM.

Few-Shot Task Embeddings (FS-TaskEmb). We also leverage task embeddings in our few-shot scenario outlined above (see FSFT), where we fine-tune intermediate models for a few steps on the target dataset. With very few training instances, the accuracy scores of FSFT (alone) may not be reliable indicators of the final transfer performances. As an alternative, we compute the TaskEmb similarity of each intermediate model before and after training $N$ steps on the target task. We then rank all intermediate models in decreasing order of this similarity.

Experimental Setup

We evaluate the approaches of §4, each having the objective to rank the intermediate adapters $s\in|\mathcal{S}|$ with respect to their performance on $t\in\mathcal{T}$ when applied in a sequential adapter training setup. We leverage the transfer performance results of our 462 experiments obtained in §3 for our ranking task.

If not otherwise mentioned, we follow the experimental setup as described in §3. We describe method specific hyperparameters in the following.

SEmb. We use a Sentence-(Ro)BERT(a)-base models, fine-tuned on NLI and STS tasks, in concordance with the respective target model type.

FSFT. We fine-tune each intermediate adapter on the target task for one full epoch and rank them based on their target task performances.As this represents a rather optimistic estimate of the few-shot transfer performance, in Appendix D we also investigate settings in which we train for only 5, 10, or 25 update steps.

Proxy Models. For both kNN and linear, we obtain performance scores with 5-fold cross-validation on each target task. The architectures slightly vary across task types. For classification, regression, and multiple-choice target tasks, proxy models predict the label or answer choice. For sequence tagging tasks, each token in a sequence represents a training instance of $D_{T}^{M}$ , with the tag being the class label. Since this would increase the total number of training examples, we randomly select 1000 embedded examples from $D_{T}$ , to maintain equal sizes of $D_{T}^{M}$ across all target tasks. We do not study proxy models on extractive QA tasks as they cannot directly be transformed into classification tasks.

TaskEmb. We perform standard fine-tuning of randomly initialized adapter modules within the pre-trained LM to obtain task embeddings.

FS-TaskEmb. We follow the setup of FSFT by training for one epoch (50 update steps).

2 Metrics

We compute the NDCG (Järvelin and Kekäläinen, 2002), a widely used information retrieval metric that evaluates a ranking with attached relevances (which correspond to our transfer results of §3).

Furthermore, we calculate Regret@k Renggli et al. (2020), which measures the relative performance difference between the top $k$ selected intermediate tasks and the optimal intermediate task:

where $T(s,t)$ is the performance on target task $t$ when transferring from intermediate task $s$ . $O(\mathcal{S},t)$ denotes the expected target task performance of an optimal selection. $M_{k}(\mathcal{S},t)$ is the highest performance on $t$ among the $k$ top-ranked intermediate tasks of the tested selection method. We take the difference between both measures and normalize it by the optimal target task performance to obtain our final relative notion of regret.We provide more details about our selection of metrics in Appendix C.

Experimental Results

Table 2 shows the results when selecting among all available intermediate tasks for BERT and RoBERTa. Table 5 in the appendix shows results when preferring tasks of the same type for BERT and RoBERTa. As expected the Random and Size baselines do not yield good rankings when selecting among all intermediate tasks.

Access to only $\mathbf{D}_{\mathbf{S}}$ or $\mathbf{M}_{\mathbf{S}}$ . These methods typically perform better than our baselines.

TextEmb and SEmb perform on par in most cases.The used SBERT model is trained on NLI and STS-B tasks, which are included in our set of intermediate and target tasks, respectively. A direct comparison between TextEmb and SEmb for the respective classification tasks is thus difficult. While FSFT outperforms the other approaches in most cases, it comes at the high cost of requiring downloading and fine-tuning all intermediate models for a few steps. This can be prohibitive if we consider many intermediate tasks. If we have access to TextEmb or SEmb information of the intermediate task (i.e., individual vectors distributed as part of a model repository), these techniques yield similar performances at a much lower cost.

Access to both $\mathbf{D}_{\mathbf{S}}$ and $\mathbf{M}_{\mathbf{S}}$ . Assuming the availability of both intermediate models and intermediate data is the most prohibitive setting. Surprisingly, we find BERT and RoBERTa to behave considerably differently, especially evident for QA tasks. As shown by Vu et al. (2020), TaskEmb performs very well for BERT, however we find that the results of this gradient based approach do not translate to RoBERTa. While these approaches perform best or competitively for all task types using BERT, they considerably underperform all methods when leveraging pre-trained RoBERTa weights. Here, the two much simpler domain embedding methods outperform the TaskEmb method based on the FIM.

Summary. We find that simple indicators such as domain similarity are suitable for selecting intermediate pre-training tasks for both BERT and RoBERTa based models. Our evaluated methods are able to efficiently select the best performing intermediate tasks with a Regret@3 of 0.0 in many cases. Our results, thus, show that the selection methods are able to effectively rank the top tasks with relative certainty, thus considerably reducing the number of necessary experiments.We also find that combining domain and task type match indicators often yield the best overall results, outperforming computationally more expensive methods. See Appendix LABEL:app:task_type for more experiments with task type pre-selection.

Analysis

Computational Costs. Table 3 estimates the computational costs of each transfer source selection method. Complexity shows the required data passes through the model.We neglect computations related to embedding similarities and proxy models as they are cheap compared to model forward/ backward passes. For the embedding-based approaches, we assume pre-computed embeddings for all intermediate tasks. For TaskEmb, we only train an adapter on the target task for $e$ epochs.

In addition to the complexity, we calculate the required Multiply-Accumulate computations (MAC) for 42 intermediate tasks and one target task with 1000 training examples, each with an average sequence length of 128. We recorded MAC with the pytorch-OpCounter package. Following our experimental setup in §5, we set $e=15$ for TaskEmb and $e=1$ for FSFT/ FS-TaskEmb. We find that embedding-based methods require two orders of magnitude fewer computations compared to fine-tuning approaches. The difference may be even larger when we consider more intermediate tasks. Since fine-tuning approaches do not yield gains that would warrant the high computational expense (see §6), we conclude that SEmb has the most favorable trade-off between efficiency and effectiveness.

SEmb Model Dependency. We compare different pre-trained sentence-embedding model variants to identify the extent to which SEmb is invariant to such changes. We experiment with BERT and RoBERTa variants of sizes Distill, Base, and Large, and present results for RoBERTa tasks in Table 4.The full results can be found in Table 7 of the appendix. We find that all variants perform comparably, demonstrating that SEmb is a computationally efficient, model-type invariant method for selecting beneficial intermediate tasks.

BERT vs RoBERTa TaskEmb Space. To better understand the TaskEmb performance differences between BERT and RoBERTa models, we visualize the respective embedding spaces using T-SNE in Figure 3. We find that BERT embeddings are clustered much more closely in the vector space than RoBERTa embeddings. While TaskEmbs of BERT also seem to be located in the proximity of related tasks, TaskEmbs of RoBERTa are distributed further apart. This can result in worse performance due to the curse of dimensionality.

Overall, our results and analysis suggest that TaskEmb, unlike SentEmb, considerably depend on the chosen base model.

Within- and Across-Type Transfer. Our experimental setup includes tasks of four different types, i.e. Transformer prediction head structures: sequence classification/ regression, multiple choice, extractive question answering and sequence tagging. Figure 4 compares the relative transfer gains within and across these task types for RoBERTa. We see that within-type transfer is consistently stronger across all target tasks. We find the largest differences between within-type and across-type transfer for the extractive QA target tasks. These observations may be partly explained by the homogeneity of the included QA intermediate tasks; They overwhelmingly focus on general reading comprehension across multiple domains with paragraphs from Wikipedia or the web as contexts. Tasks of other types more distinctly focus on individual domains and scenarios.

Overall, we find a negative across-type transfer gain (i.e., loss) for 8 out of 11 tested target tasks (on average). This suggests that task type match between intermediate and target task is a strong indicator for transfer success. Thus, in the next section, we evaluate variants of all methods presented in §4 that prefer intermediate tasks of the same type as the target task.

Pre-Ranking by Task Types. We implement a simple mechanism to ensure that tasks with the same type as the target task are always ranked before tasks of other types during intermediate task selection. Given a task selection method, we first rank all tasks of the same type at the top before ranking tasks of all other types below. Results for applying this mechanism to all presented task selection methods are given for BERT and RoBERTa in Table 5 of the Appendix.

We find that even though the random and Size baselines do not yield good rankings when selecting among all intermediate tasks (cf. Table 2), the scores considerably improve when preferring tasks of the same type. In general, we see almost consistent improvements across all task selection methods for both BERT and RoBERTa when implementing pre-ranking by task types. Considering all target tasks and all methods, preferring intermediate tasks of the same type yields improved NDCG scores in 77 of 99 cases.

Further Analysis. We further find that embedding based approaches are sample efficient, while FSFT appproaches are not (§D). We also report results for combining ranking approaches with Rank Fusion, which does not yield consistent improvements over the individual approaches presented before (§E).

Conclusion

In this work we have established that intermediate pre-training can yield gains in adapter-based setups, however, around 44% of all transfer combinations result in decreased performances. We have consolidated several existing and new methods for efficiently identifying beneficial intermediate tasks. Experimenting with different model types, we find that the previously proposed best performing approaches for BERT do not translate to RoBERTa.

Overall, efficient embedding based methods, such as those relying on pre-computable sentence representations, perform better or often on-par with more expensive approaches. The best methods achieve a Regret@3 of less than 1% on average, demonstrating that they are effective at efficiently identifying the best intermediate tasks. The approaches evaluated and proposed in this work, thus, enable the automatic identification of beneficial intermediate tasks, deeming exhaustive experimentation on many task-combinations unnecessary. When applied on a broad scale, these methods can contribute to more sustainable Strubell et al. (2019); Moosavi et al. (2020) and more inclusive Joshi et al. (2020) natural language processing.

Acknowledgements

Clifton and Jonas are supported by the LOEWE initiative (Hesse, Germany) within the emergenCITY center. Andreas was supported by the German Research Foundation (DFG) as part of the UKP-SQuARE project (grant GU 798/29-1).

We thank Leonardo Ribeiro and the anonymous reviewers for insightful feedback and suggestions on a draft of this paper.

References

Appendix A Tasks

Our experiments cover a diverse set of 53 different tasks, broadly divided into the four task types sequence classification/ regression, multiple choice, extractive question answering and sequence tagging. Motivated by previous work, we first select tasks that are either part of widely used benchmarks (Wang et al., 2018, 2019b; Talmor and Berant, 2019) or have been successfully applied to sequential transfer setups previously (Sap et al., 2019; Liu et al., 2019a; Pruksachatkun et al., 2020; Vu et al., 2020). Additionally, we include other recent challenging tasks that fall under the four defined task types (e.g. Bhagavatula et al. (2020); Rogers et al. (2020)) and tasks that extend the range of included dataset sizes and task domains. In general, we focus on tasks with publicly available datasets, e.g. via HuggingFace Datasetshttps://huggingface.co/datasets. Our full set of tasks is split into 42 intermediate tasks, presented in Table LABEL:table:source_tasks, and 11 target tasks, presented in Table 9.

Appendix B Transfer training details

For all our experiments, we use the PyTorch implementations of BERT and RoBERTa in the HuggingFace Transformers library (Wolf et al., 2020) as the basis. The adapter implementation is provided by the AdapterHub framework (Pfeiffer et al., 2020a) and integrated into the Transformers library https://github.com/Adapter-Hub/adapter-transformers.

In the light of the number and variety of different tasks used, we don’t perform any extensive hyperparameter tuning on each training task. We mostly adhere to the hyperparameter recommendations of the Transformers library and Pfeiffer et al. (2021a) for adapter training. Specifically, we train all adapters for a maximum of 15 epochs, with early stopping after 3 epochs without improvements on the validation set. We use a learning rate of $10^{-4}$ and batch sizes between $4$ and $32$ , depending on the size of the dataset. These settings apply to the adapter training on each intermediate task as well as the subsequent fine-tuning on the target dataset. Additionally, since performances on the low-resource target tasks can be unstable, we perform multiple random restarts (five restarts for RoBERTa and three restarts for BERT) for all training runs on the target tasks, reporting the mean of all restarts. The final scores on each task are computed on the respective tests set if publicly available, otherwise on the validation sets.

Results for RoBERTa are shown in Table 1 and results for BERT are shown in Table 10.

Appendix C Metrics for transfer source selection

Following Vu et al. (2020), we compute the Normalized Discounted Cumulative Gain (NDCG) (Järvelin and Kekäläinen, 2002), a widely used information retrieval metric that evaluates a ranking with attached relevances. The NDCG is defined via the Discounted Cumulative Gain (DCG), which represents a relevance score for a set of items, each discounted by its position in the ranking. The DCG of a ranking $R$ , accumulated at a particular rank position $p$ , can be computed as:

In our setting, $R$ refers to a ranking of intermediate tasks where the relevance $\text{rel}_{i}$ of the intermediate task with rank $i$ is set to the mean target performance when transferring the adapter trained on this intermediate task, i.e. $\text{rel}_{i}\in$ . We always evaluate the full ranking of intermediate tasks, thus we set $p=|\mathcal{S}|$ .

The NDCG finally normalizes the DCG of the ranking predicted by the task selection method ( $R_{pred}$ ) by the perfect ranking produced by the empirical transfer results ( $R_{true}$ ). An NDCG of $100\%$ indicates a perfect ranking.

C.2 Choice of metrics

Our selection of evaluation metrics combines two measures that both evaluate the quality of the full ranking (NDCG) and the top selections of each methods (Regret). We prefer this combination of metrics over various other common possible evaluation metrics. We experimented with classical correlation measures such as Spearman rank correlation, finding they give poor indication on the overall quality of a selection method. The Spearman correlation is agnostic to the location within the ranking, thus penalizing mismatches at the bottom of the ranking with the same weight as mismatches at the top. In our setting, the top ranks are more important, making the NDCG which is biased towards correct rankings at the top a better fit. Renggli et al. (2020) further discuss the limitations of correlation as an evaluation metric for task selection.

Vu et al. (2020) use the average predicted rank $\rho$ of the source task with the best target performance as an additional metric. However, this metric does not account for the real target performance difference between the top ranked source tasks across different methods. In a simple example, assume two selection methods $A$ and $B$ assign the top performing source task $s_{max}$ to the same average rank. Further, $A$ ranks a different source task on top which nearly performs on par with $s_{max}$ while $B$ predicts a much weaker source task on top. In this case, we clearly would want to prefer method $A$ over method $B$ . Unlike $\rho$ , our choice of regret as evaluation metric considers these differences.

Appendix D Sample Efficiency

Embedding-based approaches. Intermediate pre-training can have a larger impact on small target tasks. We therefore analyze and compare the effectiveness of embedding-based approaches with only $10$ , $100$ , and $1000$ target examples.

Figure 5 plots the results for all feature embedding methods when applied to intermediate task selection for RoBERTa. We find that the quality of the rankings can decrease substantially in the smallest setting with only 10 target examples. SEmb is a notable exception, achieving results close to that of the full 1000 examples ( $73\%$ vs. $74.9\%$ NDCG). With that, SEmb consistently performs above all other methods in all settings.

Few-Shot approaches. We experiment with $\text{N}\in\{5,10,25,50\}$ update steps for the fine-tuning methods FSFT and FS-TaskEmb. Results for RoBERTa are shown in Figure 6. While unsurprisingly, the performance for both methods improves consistently with the number of fine-tuning steps, FS-TaskEmb produces superior rankings at earlier checkpoints, however is outperformed by FSFT on the long run. The results indicate that updating for $<25$ update steps does not provide sufficient evidence to reliably predict the best intermediate tasks.

Appendix E Rank Fusion

Vu et al. (2020) use the Reciprocal Rank Fusion algorithm (Cormack et al., 2009) to aggregate the rankings of TextEmb and TaskEmb. further experiment with various combinations of ranks produced by methods of different categories, e.g. Size + SEmb. Table 6 shows the results for a selection of all possible method combinations when applied to intermediate task selection for RoBERTa.

In a few cases, fusing improves performance over the single-method performances of all included methods (e.g. TaskEmb+TextEmb). However, for most cases, rank fusion performance is either roughly on-par with the performance of the best included single method (e.g. SEmb+TaskEmb) or even hurts task selection performance sometimes significantly (e.g. Size+SEmb). Thus, while adding additional computational overhead to the task selection process, fusing does not yield better performance in general.

Appendix F SEmb Model Dependency

The full results of our experiments with sentence-embedding model variants can be found in Table 7. Experiments were conducted on RoBERTa transfer results.