Exploring and Predicting Transferability across NLP Tasks

Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, Mohit Iyyer

cs.CL

Introduction

With the advent of methods such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), the dominant paradigm for developing NLP models has shifted to transfer learning: first, pretrain a large language model, and then fine-tune it on the target dataset. Prior work has explored whether fine-tuning on intermediate source tasks before the target task can further improve this pipeline (Phang et al., 2018), but the conditions for successful transfer remain opaque, and choosing arbitrary source tasks can even adversely impact downstream performance Wang et al. (2019b). Our work has two main contributions: (1) we perform a large-scale empirical study across 33 different datasets to shed light on the transferability between NLP tasks, and (2) we develop task embeddings to predict which source tasks to use for a given target task.

Our study includes over 3,000 combinations of tasks and data regimes within and across three broad classes of problems (text classification, question answering, and sequence labeling), which is considerably more comprehensive than prior work Wang et al. (2019a); Talmor and Berant (2019a); Liu et al. (2019a). Our results show that transfer learning is more beneficial than previously thought Wang et al. (2019b), especially for low-data target tasks, and even low-data source tasks that are on the surface very different than the target task can result in transfer gains. While previous work has recommended using the amount of labeled data as a criterion to select source tasks (Phang et al., 2018), our analysis suggests that the similarity between the source and target tasks and domains are crucial for successful transfer, particularly in data-constrained regimes.

Motivated by these results, we move on to a more practical research question: given a particular target task, can we predict which source tasks (out of some predefined set) will yield the largest transfer learning improvement, especially in low-data settings? We address this challenge by learning embeddings of tasks that encode their individual characteristics (Figure 2). More specifically, we process all examples from a dataset through BERT and compute a task embedding based on the model’s gradients with respect to the task-specific loss, following recent meta-learning work in computer vision Achille et al. (2019). We empirically demonstrate the practical value of these task embeddings for selecting source tasks (via simple cosine similarity) that effectively transfer to a given target task. To the best of our knowledge, this is the first work that builds explicit representations of NLP tasks to investigate transferability.

We publicly release our task library, which consists of pretrained models and task embeddings for the 33 NLP tasks we study, along with a codebase that computes task embeddings for new tasks and identifies source tasks that will likely yield positive transferability.Library and code available at http://github.com/ tuvuumass/task-transferability.

Exploring task transferability

To shed light on the transferability between different NLP tasks,We define a task as a (dataset, objective function) pair. we perform an empirical study with 33 tasks across three broad classes of problems: text classification/regression (CR), question answering (QA), and sequence labeling (SL).We divide tasks into classes based on how they are modeled; there is considerable in-class linguistic diversity. In each experiment, we follow the STILTs pipeline of Phang et al. (2018) by taking a pretrained BERT model,We use BERT-Base Uncased, which has 12 layers, 768-d hidden size, 12 heads, and 110M total parameters. fine-tuning it on an intermediate source task, and then fine-tuning the resulting model on a target task. We explore in-class and out-of-class transfer in both data-rich and data-constrained regimes and demonstrate that positive transfer can occur in a more diverse array of settings than previously thought Wang et al. (2019b).

We denote a dataset $D=\{(x^{i},y^{i})\}_{i=1}^{n}$ , with $n$ total examples of inputs $x$ and associated outputs $y$ . Each input $x$ , which can be either a single text or a concatenation of multiple text segments (e.g., a question-passage pair), is represented as:

where $w^{i}_{j}$ is token $i$ of the $j^{\text{th}}$ segment, [cls] is a special symbol for classification output, and [sep] is a special symbol to separate any text segments if they exist. Finally, each task is solved by applying a classification layer over either the final [cls] token representation (for CR) or the entire sequence of final layer token representations (for QA or SL). For both stages of fine-tuning, we follow Devlin et al. (2019) by backpropagating into all model parameters for a fixed number of epochs.We fine-tune all CR and QA tasks for three epochs, and SL tasks for six epochs, using the Transformers library Wolf et al. (2019) and its recommended hyperparameters. While individual task performance can likely be further improved with more involved hyperparameter tuning for each experimental setting, we standardize hyperparameters across each of the three classes to cut down on computational expense, following prior work (Phang et al., 2018; Wang et al., 2019b).

Table 1 lists the 33 datasets in our study.Appendix A.1 contains more details about dataset characteristics and their associated evaluation metrics. We select these datasets by mostly following prior work: nine of the eleven CR tasks come from the GLUE benchmark Wang et al. (2019b); all eleven QA tasks are from the MultiQA repository Talmor and Berant (2019b); and all eleven SL tasks were used by Liu et al. (2019a). We consider all possible pairs of source and target datasets;All experiments conducted on a GPU cluster operating on renewable energy. while some training datasets contain overlapping examples (e.g., SQuAD-1 and 2), we evaluate our models on target development sets, which do not contain overlap.

For each (source, target) dataset pair, we perform transfer experiments in three data regimes to examine the impact of data size on source $\rightarrow$ target transfer: Full $\rightarrow$ Full , Full $\rightarrow$ Limited , and Limited $\rightarrow$ Limited. In the Full training regime, all training data for the associated task is used for fine-tuning. In the Limited setting, we artificially limit the amount of training data by randomly selecting 1K training examples without replacement, following Phang et al. (2018); since fine-tuning BERT can be unstable on small datasets Devlin et al. (2019), we perform 20 random restarts for each experiment and report the mean.See Appendix B for variance statistics. We resample 1K examples for each restart; for tasks with fewer than 1K training examples, we use the full training dataset.

We measure the impact of transfer learning by computing the relative transfer gain given a source task $s$ and target task $t$ . More concretely, if a baseline model that is directly fine-tuned on the target dataset (without any intermediate fine-tuning) achieves a performance of $p_{t}$ , while a transferred model achieves a performance of $p_{s\rightarrow t}$ , the relative transfer gain is defined as: $g_{s\rightarrow t}=\dfrac{p_{s\rightarrow t}-p_{t}}{p_{t}}.$

2 Analyzing the transfer results

Table 2 contains the results of our transfer experiments across each combination of classes and data regimes.See Appendix B for tables for each individual task. In each cell, we first compute the transfer gain of the best source task for each target task in a particular class, and then average across all target tasks in the same class. We summarize our findings as follows:

Contrary to prior belief, transfer gains are possible even when the source dataset is small.

Out-of-class transfer succeeds in many cases, some of which are unintuitive.

Factors other than source dataset size, such as the similarity between source and target tasks, matter more in low-data regimes.

In the rest of this section, we analyze each of these three findings in more detail.

The diagonal of each block of Table 2 shows the results for in-class transfer, in which source tasks are from the same class as the target task. Across all three data regimes, most target tasks benefit from in-class transfer, and the average transfer gain is larger for CR and QA tasks than for SL tasks. Changing the data regimes significantly impacts the average transfer gain, which is lowest in the Full $\rightarrow$ Full regime (+5.4% average relative gain across all tasks) and highest in the Full $\rightarrow$ Limited regime (+47.0%). In general, tasks with fewer training examples benefit the most from transfer, such as RTE (+17.0 accuracy points) and CQ (+14.9 F1), and the best source tasks in the Full $\rightarrow$ Full regime tend to be data-rich tasks such as MNLI, SNLI, and SQuAD-2 (Figure 2).As in Phang et al. (2018), we find that intermediate fine-tuning reduces variance across random restarts (Appendix B).

We switch gears now to out-of-class transfer, in which the source task comes from a different class than the target task. The off-diagonal entries of each block of Table 2 summarize our results. In general, we observe that most tasks benefit from out-of-class transfer, although the magnitude of the transfer gains is lower than for in-class transfer, and that CR and QA tasks benefit more than SL tasks (similar to our in-class transfer results). While some of the results are intuitive (e.g., SQuAD is a good source task for QNLI, which is an entailment task built from QA pairs), others are more difficult to explain (using part-of-speech tagging as a source task for DROP results in huge transfer gains in limited target regimes).

Phang et al. (2018) observe that source data size is a good heuristic to obtain positive transfer gain. In the Full $\rightarrow$ Limited regime, we find to the contrary that the largest source datasets do not always result in the largest transfer gains. For CR tasks, MNLI/SNLI are the best sources for only four targets (three of which are entailment tasks), compared to seven in Full $\rightarrow$ Full . STS-B, which is much smaller than MNLI and SNLI, is the best source for MRPC and QQP, while MRPC, an even smaller dataset, is the best source for STS-B. As STS-B, QQP, and MRPC are all sentence similarity and paraphrase tasks, this result suggests that the similarity between the source and target tasks matters more for data-constrained targets. We observe similar task similarity patterns for QA (the best source for WikiHop is the other multi-hop QA task, HotpotQA) and SL (POS-PTB is the best source for POS-EWT, the only other POS tagging task). However, the large SQuAD-2 dataset is almost always the best source within QA. Another important factor especially apparent in our QA tasks is domain similarity (e.g., SQuAD and several other datasets were all built from Wikipedia).

We now turn to the Limited $\rightarrow$ Limited regime, which eliminates the source data size confound. For CR, STS-B is the best source for six targets out of 11, including four entailment tasks (MNLI, QNLI, SNLI, SciTail), whereas MNLI/SNLI are the best sources for only two tasks (RTE, WNLI). This result suggests that source/target task similarity, which we found to be a factor for the Full $\rightarrow$ Limited , is not the only important factor for effective transfer in data-constrained scenarios. We hypothesize that the complexity of the source task can also play a role: perhaps regression objectives (as used in STS-B) are more useful for transfer learning than classification objectives (MNLI/SNLI). Unknown factors may also play a role: in QA, SQuAD-2 is no longer the best source for any targets, while NewsQA is the best source for five tasks.

Predicting task transferability

The above analysis suggests that no single factor (e.g., data size, task and domain similarity, task complexity) is predictive of transfer gain across all of our settings. Given a novel target task, how can we identify the single source task that maximizes transfer gain? One straightforward but extremely expensive approach is to enumerate every possible (source, target) task combination. Work on multi-task learning within NLP offers a more practical alternative by developing feature-based models to identify task and dataset characteristics that are predictive of task synergies Bingel and Søgaard (2017). Here, we take a different approach, inspired by recent computer vision methods (Achille et al., 2019), by computing task embeddings from layer-wise gradients of BERT. Our approach generally outperforms baseline methods that use the data size heuristic Phang et al. (2018) and the gradients of the learning curve Bingel and Søgaard (2017) in terms of selecting the most transferable source tasks across settings.

We develop two methods for computing task embeddings from BERT. The first, TextEmb, is computed by pooling BERT’s representations across an entire dataset, and as such captures properties of the text and domain. The second, TaskEmb, relies on the correlation between the fine-tuning loss function and the parameters of BERT, and encodes more information about the type of knowledge and reasoning required to solve the task.

As our analysis indicates that domain similarity is a relevant factor for transfer, we first explore a simple method based on averaging BERT token-level representations of the inputs. Given a dataset $D$ , we process each input sample $x^{i}$ through the pretrained BERT model without any finetuning and compute $\boldsymbol{h}_{x}$ , the average of final layer token-level representations. The final task embedding is the average of these pooled vectors over the entire dataset: $\sum_{x\in D}\dfrac{\boldsymbol{h}_{x}}{|D|}$ . This method captures linguistic properties of the input text $x$ and does not depend on the training labels $y$ .

Ideally, we want a way of capturing task similarity beyond just input properties represented by TextEmb. Following the methodology of Task2Vec Achille et al. (2019), which develops task embeddings for meta-learning over vision tasks, we create representations of tasks derived from the Fisher information matrix (or simply Fisher). The Fisher captures the curvature of the loss surface (the sensitivity of the loss to small perturbations of model parameters), which intuitively tells us which of the model parameters are most useful for the task and thus provides a rich source of knowledge about the task itself.

To begin, we fine-tune BERT on the training dataset of a given task; the model without the final task-specific layer forms our feature extractor. Next, we feed the entire training dataset into the model and compute the task embedding based on the Fisher of the feature extractor’s parameters (weights) $\theta$ , i.e., the expected covariance of the gradients of the log-likelihood with respect to $\theta$ :

In our experiments, we compute the empirical Fisher, which uses the training labels instead of sampling from $P_{\theta}(x,y)$ : $\displaystyle F_{\theta}=\frac{1}{n}\sum\limits_{i=1}^{n}\left[\nabla_{\theta}\log P_{\theta}(y^{i}|x^{i})\nabla_{\theta}\log P_{\theta}(y^{i}|x^{i})^{T}\right]\text{,}$ and only consider the diagonal entries to reduce computational complexity. Additionally, we consider the Fisher $F_{\phi}$ with respect to the feature extractor’s outputs (activations) $\phi$ , which encodes useful features about the inputs to solve the task. The diagonal $F_{\phi}$ is averaged over the input tokens and over the entire dataset.While Fisher matrices are theoretically more comparable when the feature extractor is fixed during fine-tuning, as done in Task2Vec, we find empirically that TaskEmb computed from a fine-tuned task-specific BERT result in better correlations to task transferability in data-constrained scenarios. We leave further exploration of this phenomenon to future work.

We explore task embeddings derived from the diagonal Fisher of different components of BERT, including the token embeddings, multi-head attention, feed-forward network, and the layer output, performing layer-wise averaging. Since our base model is BERT, this method may result in high-dimensional task embeddings (from 768-d to millions of dimensions). While one can optionally perform dimensionality reduction (e.g., through PCA), all of our experiments are conducted directly on the original task embeddings.

2 Task embedding evaluation

We investigate whether a high similarity between two different task embeddings correlates with a high degree of transferability between those two tasks. Our evaluation centers around the meta-task of selecting the best source task for a given target task. Specifically, given a target task, we rank all the other source tasks in our library in descending order by the cosine similarityWe leave the exploration of asymmetric similarity metrics to future work. between their task embeddings and the target task’s embedding. This ranking is evaluated using two metrics: (1) the average rank $\rho$ of the source task with the highest absolute transfer gain from Section 2’s experiments, and (2) the Normalized Discounted Cumulative Gain (NDCG; Järvelin and Kekäläinen, 2002), a common information retrieval measure that evaluates the quality of the entire ranking, not just the rank of the best source task.We use NDCG instead of Spearman correlation, as the latter penalizes top-ranked and bottom-ranked mismatches with the same weight. The NDCG at position $p$ is defined as: $\displaystyle\text{NDCG}_{p}=\dfrac{\text{DCG}_{p}(R_{pred})}{\text{DCG}_{p}(R_{true})}$ , where $R_{pred},R_{true}$ are the predicted and gold rankings of the source tasks, respectively; and $\displaystyle\text{DCG}_{p}(R)=\sum\limits_{i=1}^{p}\dfrac{2^{rel_{i}}-1}{\log_{2}(i+1)}$ , where $rel_{i}$ is the relevance (target performance) of the source task with rank $i$ in the evaluated ranking $R$ .In our experiments, we set $p$ to the number of source tasks in each setting. An NDCG of 100% indicates a perfect ranking.

For our TaskEmb approach, we aggregate rankings from all of the different components of BERT rather than evaluate each component-specific ranking separately.We observe that rankings derived from certain components are more useful than others (e.g., token embeddings are crucial for classification), but aggregating across all components generally outperforms individual ones. We expect that task embeddings derived from different components might contain complementary information about the task, which motivates this decision. Concretely, given a target task $t$ , assume that $r_{1:c}$ are the rank scores assigned to a source task $s$ by $c$ different components of BERT. Then, the aggregated score is computed according to the reciprocal rank fusion algorithm Cormack et al. (2009): $\displaystyle\text{RRF}(s)=\sum\limits_{i=1}^{c}\dfrac{1}{60+r_{i}}$ . We also use this approach to aggregate rankings from TextEmb and TaskEmb, which results in Text + Task.

3 Baseline methods

To measure the effect of data size, we compare rankings derived from TextEmb and TaskEmb to DataSize, a heuristic baseline that ranks all source tasks by the number of training examples.

We also consider CurveGrad, a baseline that uses the gradients of the loss curve of BERT for each task. Bingel and Søgaard (2017) find such learning curve features to be good predictors of gains from multi-task learning. They suggest that multi-task learning is more likely to work when the main tasks quickly plateau (small negative gradients) while the auxiliary tasks continue to improve (large negative gradients). Following the setup in Bingel and Søgaard (2017), we fine-tune BERT on each source task for a fixed number of steps (i.e., 10,000) and compute the gradients of the loss curve at 10, 20, 30, 50 and 70 percent of the fine-tuning process. Given a target task, we rank all the source tasks in descending order by the gradients and aggregate the rankings using the reciprocal rank fusion algorithm.

4 Source task selection experiments

The average performance of selecting the best source task across target tasks using different methods is shown in Table 3.In the Limited settings, we report the mean results across random restarts. Here, we provide an overview and analysis of these results.

DataSize is a good heuristic when the full source training data is available, but it struggles in all out-of-class transfer scenarios as well as on SL tasks, for which most datasets contain roughly the same number of examples (Table 1).All methods obtain a higher NDCG score on SL tasks in the Full $\rightarrow$ Full regime because there is little difference in target task performance between source tasks here (see Figure 2), and thus the rankings are not penalized heavily. CurveGrad lags far behind DataSize in most cases, though its performance is better on SL tasks in the Full $\rightarrow$ Full regime. This indicates that CurveGrad cannot reliably predict the most transferable source tasks in our transfer scenarios.

Table 3 shows that TextEmb performs better than DataSize on average, especially within the limited data regimes. Interestingly, TextEmb underperforms significantly on CR tasks compared to QA and SL. We theorize that this effect is partly due to the relative homogeneity of the QA and SL datasets (i.e., many QA datasets use Wikipedia while many SL tasks are extracted from the Penn Treebank) compared to the more diverse CR datasets. If TextEmb captures mainly domain similarity, then it may struggle when that is not a relevant transfer factor.

TaskEmb can substantially boost the quality of the rankings, frequently outperforming the other methods across different classes of problems, data regimes, and transfer scenarios. These results demonstrate that the task similarity between the computed embeddings is a robust predictor of effective transfer. The ensemble of Text + Task results in further slight improvements, but the small magnitude of these gains suggests that TaskEmb partially encodes domain similarity. For Limited $\rightarrow$ Limited , where the DataSize heuristic does not apply, TaskEmb still performs strongly, although not as well as in the full source data regimes. Figure 2 shows that TaskEmb usually selects the best or near the best available source task for a given target task across data regimes.

What kind of information is encoded by TaskEmb and TextEmb? Figure 3 visualizes the different task spaces in the Full $\rightarrow$ Full regime using the Fruchterman-Reingold force-directed placement algorithm Fruchterman and Reingold (1991).An alternative to dimensionality reduction algorithms for better preservation of the data’s topology; see Appendix A.2.

The task space of TextEmb (Figure 3, top) shows that datasets with similar sources are near one another: in QA, tasks built from web snippets are closely linked (CQ and ComQA), while in SL, tasks extracted from Penn Treebank are clustered together (CCG, POS-PTB, Parent, GParent, GGParent, Chunk, and Conj). Additionally, the SQuAD datasets are strongly linked to QNLI, which was created by converting SQuAD questions. TaskEmb captures domain information to some extent (Figure 3, bottom), but it also encodes task similarity: for example, POS-PTB is closer to POS-EWT, another part-of-speech tagging task that uses a different data source. Neither method captures some unintuitive cases in low-data regimes, such as STS-B’s high transferability to CR target tasks, or that DROP benefits most from SL tasks in low-data regimes (see Tables 9, 10, 27, and 28 in Appendix B). Our methods clearly do not capture all of the factors that influence task transferability, which motivates the future development of more sophisticated task embedding methods.

Related Work

We build on existing work in exploring and predicting transferability across tasks.

Sharing knowledge across different tasks, as in multi-task/transfer learning, often improves over standard single-task learning Ruder (2017). Within multi-task learning, several works (e.g., Luong et al., 2016; Liu et al., 2019b; Raffel et al., 2020) combine multiple tasks for better regularization and transfer. More related to our work, Phang et al. (2018) explore intermediate fine-tuning and find that transferring from data-rich source tasks boosts target task performance for text classification, while Liu et al. (2019a) observe transfer gains between related sequence labeling tasks. Expanding from single to multi-source transfer, Talmor and Berant (2019a) show that pretraining on multiple datasets improves generalization on QA tasks. Nevertheless, exploiting synergies between tasks remains difficult, with many combinations of tasks negatively impacting downstream performance Bingel and Søgaard (2017); McCann et al. (2018); Wang et al. (2019a), and the factors that determine successful transfer still remain murky. Concurrent work indicates that intermediate tasks that require high-level inference and reasoning abilities tend to work best (Pruksachatkun et al., 2020).

To predict transferable tasks, some methods (Martínez Alonso and Plank, 2017; Bingel and Søgaard, 2017) rely on features derived from dataset characteristics and learning curves. However, manually designing such features is time-consuming and may not generalize well across classes of problems Kerinec et al. (2018). Recent work on task embeddings in computer vision offers a more principled way to encode tasks for meta-learning (Zamir et al., 2018; Achille et al., 2019; Yan et al., 2020). Taskonomy Zamir et al. (2018) models the underlying structure among tasks to reduce the need for supervision, while Task2Vec Achille et al. (2019) uses a frozen feature extractor pretrained on ImageNet to represent tasks in a topological space (analogous to our approach’s reliance on BERT). Finally, recent work in NLP augments a generative model with an embedding space for modeling latent skills Cao and Yogatama (2020).

Conclusion

We conduct a large-scale empirical study of the transferability between 33 NLP tasks across three broad classes of problems. We show that the benefits of transfer learning are more pronounced than previously thought, especially when target training data is limited, and we develop methods that learn vector representations of tasks that can be used to reason about the relationships between them. These task embeddings allow us to predict source tasks that will likely improve target task performance. Our analysis suggests that data size, the similarity between the source and target tasks and domains, and task complexity are crucial for effective transfer, particularly in data-constrained regimes.

Acknowledgments

We thank Yoshua Bengio and researchers at Microsoft Research Montreal for valuable feedback on this project. We also thank the anonymous reviewers, Kalpesh Krishna, Nader Akoury, Shiv Shankar, and the rest of the UMass NLP group for their helpful comments. We are grateful to Alon Talmor and Nelson Liu for sharing the QA and SL datasets. Finally, we thank Peter Potash for additional experimentation efforts. Vu and Iyyer were supported by an Intuit AI Award for this project.

References

Appendices

Appendix A Additional details for experimental setup

In this work, we experiment with 33 datasets across three broad classes of problems (text classification/regression, question answering, and sequence labeling). Below, we briefly describe the datasets, and summarize their characteristics in Table 4.

We use the nine GLUE datasets Wang et al. (2019b), including grammatical acceptability judgments (CoLA; Warstadt et al., 2019); sentiment analysis (SST-2; Socher et al., 2013); paraphrase identification (MRPC; Dolan and Brockett, 2005); semantic similarity with STS-Benchmark (STS-B; Cer et al., 2017) and Quora Question Pairshttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs (QQP); natural language inference (NLI) with Multi-Genre NLI (MNLI; Williams et al., 2018), SQuAD Rajpurkar et al. (2016) converted into Question-answering NLI (QNLI; Wang et al., 2019b), Recognizing Textual Entailment 1,2,3,5 (RTE; Dagan et al., 2005, et seq.), and the Winograd Schema Challenge Levesque (2011) recast as Winograd NLI (WNLI). Additionally, we include the Stanford NLI dataset (SNLI; Bowman et al., 2015) and the science QA dataset Khot et al. (2018) converted into NLI (SciTail). We report F1 scores for QQP and MRPC, Spearman correlations for STS-B, and accuracy scores for the other tasks. For MNLI, we report the average score on the “matched” and “mismatched” development sets.

We use eleven QA datasets from the MultiQA Talmor and Berant (2019a) repositoryhttps://github.com/alontalmor/MultiQA, including the Stanford Question Answering datasets SQuAD-1 and SQuAD-2 Rajpurkar et al. (2016, 2018); NewsQA Trischler et al. (2017); HotpotQA Yang et al. (2018) – the version where the context includes 10 paragraphs retrieved by an information retrieval system; Natural Yes/No Questions dataset (BoolQ; Clark et al., 2019); Discrete Reasoning Over Paragraphs dataset (DROP; Dua et al., 2019) – we only use the extractive examples in the original dataset but evaluate on the entire development set, following Talmor and Berant (2019a); WikiHop Welbl et al. (2018); DuoRC Self (DuoRC-s) and DuoRC Paraphrase (DuoRC-p) datasets Saha et al. (2018) where the questions are taken from either the same version or a different version of the document from which the questions were asked, respectively; ComplexQuestions (CQ; Bao et al., 2016; Talmor et al., 2017); and ComQA Abujabal et al. (2019) – contexts are not provided but the questions are augmented with web snippets retrieved from Google search engine Talmor and Berant (2019a). We report F1 scores for all QA tasks.

We experiment with eleven sequence labeling tasks used by Liu et al. (2019a), including CCG supertagging with CCGbank (CCG; Hockenmaier and Steedman, 2007); part-of-speech tagging with the Penn Treebank (POS-PTB; Marcus et al., 1993) and the Universal Dependencies English Web Treebank (POS-EWT; Silveira et al., 2014); syntactic constituency ancestor tagging, i.e., predicting the constituent label of the parent (Parent), grandparent (GParent), and great-grandparent (GGParent) of each word in the PTB phrase-structure tree; semantic tagging task (ST; Bjerva et al., 2016; Abzianidze et al., 2017); syntactic chunking with the CoNLL 2000 shared task dataset (Chunk; Tjong Kim Sang and Buchholz, 2000); named entity recognition with the CoNLL 2003 shared task dataset (NER; Tjong Kim Sang and De Meulder, 2003); grammatical error detection with the First Certificate in English dataset (GED; Yannakoudakis et al., 2011; Rei and Yannakoudakis, 2016); and conjunct identification, i.e., identifying the tokens that comprise the conjuncts in a coordination construction, with the coordination annotated PTB dataset (Conj; Ficler and Goldberg, 2016). We report F1 scores for all SL tasks.

A.2 Fruchterman-Reingold force-directed placement algorithm

The Fruchterman-Reingold force-directed placement algorithm Fruchterman and Reingold (1991) simulates a space of nodes (in our setup, tasks) as a system of atomic particles/celestial bodies, exerting attractive forces on one another. In our setup, the algorithm resembles molecular/planetary simulations: the transferability between tasks specify the forces that are used to place the tasks towards each other in order to minimize the energy of the system. The force between a pair of tasks $(t_{1},t_{2})$ is defined as: $\displaystyle f(t_{1},t_{2})=\dfrac{1}{r_{\rightarrow t_{2}}(t_{1})}+\dfrac{1}{r_{\rightarrow t_{1}}(t_{2})}$ , where $r_{\rightarrow t}(s)$ is the rank of the source task $s$ in the list of source tasks to transfer to the target task $t$ .

Appendix B Full results for fine-tuning and transfer learning across tasks

For both fine-tuning and transfer learning, we use the same architecture across tasks, apart from the task-specific output layer. The feature extractor, i.e., BERT, is pretrained while the task-specific output layer is randomly initialized for each task. All the parameters are fine-tuned end-to-end. An alternative approach is to keep the feature extractor frozen during fine-tuning. We find that fine-tuning the whole model for a given task leads to better performance in most cases, except for WNLI and DROP, possibly because of their adversarial nature (see Tables 5, 6, and 7). In our experiments, we follow the fine-tuning recipe of Devlin et al. (2019), i.e., only fine-tuning for a fixed number of $t$ epochs for each class of problems. We develop our infrastructure using the HuggingFace’s Transformers Wolf et al. (2019) and its recommended hyperparameters for each class.

We show the full results for fine-tuning and transfer learning across tasks from Table 5 to Table 34. Below, we describe the setting for these tables in more detail:

In Tables 5, 6, and 7, we report the results of fine-tuning BERT (without any intermediate fine-tuning) on the 33 NLP tasks studied in this work. We perform experiments in two data regimes: Full and Limited . In the Full regime, all training data for the associated task is used while in the Limited setting, we artificially limit the amount of training data by randomly selecting 1K training examples without replacement, following Phang et al. (2018). For each experiment in the Limited regime, we perform 20 random restarts (1K examples are resampled for each restart) and report the mean and standard deviation. We show the results after each training epoch $t$ .

For our transfer experiments, we consider every possible pair of (source, target) tasks within and across classes of problems in the three data regimes described in 2.1.1, which results in 3267 combinations of tasks and data regimes. We follow the transfer recipe of Phang et al. (2018) by first fine-tuning BERT on the source task (intermediate fine-tuning) before fine-tuning on the target task. For both stages, we only perform training for a fixed number $t$ of epochs, following previous work Devlin et al. (2019); Phang et al. (2018). For each task, we use the same value of $t$ as in our fine-tuning experiments.

From Table 8 to Table 16, we show our in-class transfer results for each combination of (source, target) tasks, in which source tasks come from the same class as the target task. In each table, rows denote source tasks while columns denote target tasks. Each cell represents the target task performance of the transferred model from the associated source task to the associated target task. The orange-colored cells along the diagonal indicate the results of fine-tuning BERT on target tasks without any intermediate fine-tuning. Positive transfers are shown in blue and the best results are highlighted in bold (blue). For transfer results in the Limited setting, we report the mean and standard deviation across 20 random restarts.

Finally, from Table 17 to Table 34, we present our out-of-class transfer results, in which source tasks come from a different class than the target task. In each table, results are shown in a similar way as above, except that the orange-colored row Baseline shows the results of fine-tuning BERT on target tasks without any intermediate fine-tuning.