Parameter-Efficient Transfer Learning with Diff Pruning
Demi Guo, Alexander M. Rush, Yoon Kim
Introduction
Task-specific finetuning of pretrained deep networks is the dominant paradigm in contemporary NLP, achieving state-of-the-art results across a suite of natural language understanding tasks (Devlin et al., 2019; Liu et al., 2019c; Yang et al., 2019; Lan et al., 2020). While straightforward and empirically effective, this approach is difficult to scale to multi-task, memory-constrained settings (e.g. for on-device applications), as it requires shipping and storing a full set of model parameters for each task. Inasmuch as these models are learning generalizable, task-agnostic language representations through self-supervised pretraining, finetuning the entire model for each task seems especially profligate.
A popular approach to parameter-efficiency is to learn smaller compressed models for each task (Gordon et al., 2020; Sajjad et al., 2020; Zhao et al., 2020; Sanh et al., 2020). Such approaches face a steep sparsity/performance tradeoff and keep a substantial amount of nonzero parameters per task (e.g. 10%-30%). Multi-task learning and feature-based transfer allow for more parameter-efficient transfer learning per task (Liu et al., 2019b; Clark et al., 2019; Stickland & Murray, 2019; Reimers & Gurevych, 2019). These methods train a small number of additional parameters (e.g. a linear layer) on top of a shared model. However, multi-task learning generally requires access to all tasks during training to prevent catastrophic forgetting (French, 1999), while feature-based transfer learning (e.g. based on task-agnostic sentence representations) is typically outperformed by finetuning (Howard & Ruder, 2018).
An appealing middle ground is to finetune an extension of the base model for specific tasks. This approach captures the training benefits of finetuning while maintaining the task modularity of feature-based transfer. For example, Adapters (Rebuffi et al., 2018) use smaller, task-specific modules that are inserted between layers of a model This approach does not require access to all tasks during training, targeting realistic settings where as new tasks arrive in stream (Houlsby et al., 2019; Pfeiffer et al., 2020a, b, c). Houlsby et al. (2019) find that adapter layers can match the performance of fully finetuned BERT on the GLUE benchmark while requiring 3.6% additional parameters (on average) per task.
Diff pruning is a new extension to pretrained models with the goal of even more parameter-efficient transfer learning. Instead of modifying the architecture of the model, diff pruning extends the base model through a task-specific difference vector.
In order to learn this vector, we reparameterize the task-specific model parameters as , where the pretrained parameter vector is fixed and the task-specific diff vector is finetuned. The diff vector is regularized with a differentiable approximation to the -norm penalty (Louizos et al., 2018) to encourage sparsity.
Diff pruning can become extremely parameter-efficient, as it only requires storing the nonzero positions and weights of the diff vector for each task. The cost of storing the shared pretrained model remains constant and is amortized across multiple tasks. On the GLUE benchmark (Wang et al., 2019a), diff pruning can match the performance of the fully finetuned BERT baselines while finetuning only of the pretrained parameters per task. As the number of tasks increase, diff pruning outperforms popular pruning-based methods in amount of storage required.
Background: Transfer Learning
Transfer learning in NLP mostly uses a pretrain-and-finetune paradigm, which initializes a subset of the model parameters for all tasks from a pretrained model and then finetunes on a task-specific objective. Pretraining objectives include context prediction (Mikolov et al., 2013), autoencoding (Dai & Le, 2015), machine translation (McCann et al., 2017), and more recently, variants of language modeling (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019) objectives.
Here we consider applying transfer learning to multiple tasks. We consider a setting with a potentially unknown set of tasks (which may arrive in stream), where each task has an associated training set . For all tasks, the goal is to produce (possibly tied) model parameters to minimize the empirical risk,
We can use the pretrain-finetune approach by simply learning independent parameters for each task. However, the large size of pretrained models makes this approach exceedingly parameter inefficient. For example, widely-adopted models such as BERT and BERT have 110M and 340M parameters respectively, while their contemporaries have parameter counts in the billions (Raffel et al., 2020; Shoeybi et al., 2019; Rajbhandari et al., 2019). Storing the fully finetuned models therefore becomes difficult even for a moderate number of tasks.An intriguing line of work suggests that large-scale language models can be used without finetuning for a variety of tasks if given the appropriate context (Radford et al., 2019; Brown et al., 2020). While interesting, these models generally underperform task-specific models and require billions of parameters, though recent work suggests that they can be made substantially smaller (Schick & Schutze, 2020). A classic approach to tackling this parameter-inefficiencyis to train a single shared model (along with a task-specific output layer) against multiple tasks through joint training (Caruana, 1997). However, the usual formulation of multi-task learning requires the set of tasks to be known in advance in order to prevent catastrophic forgetting (French, 1999),However, work on continual learning mitigates these issues to an extent (Shin et al., 2017; Lopez-Paz & Ranzato, 2017; Lee et al., 2017; Kirkpatrick et al., 2017). making it unsuitable for applications in which the set of tasks is unknown or when tasks arrive in stream.
Diff Pruning
Diff pruning formulates task-specific finetuning as learning a diff vector that is added to the pretrained model parameters , which remain fixed. We first reparameterize the task-specific model parameters,
which results in the following empirical risk minimization problem,
where for brevity we define as
This trivial reparameterization shows that the cost of storing the pretrained parameters is amortized across tasks, and the only marginal cost for new tasks is the diff vector. If we can regularize to be sparse such that , then this approach can become more parameter-efficient as the number of tasks increases. We can specify this goal with an -norm penalty on the diff vector,
This regularizer is difficult to optimize as it is non-differentiable. In order to approximate this objective, we follow an approach for gradient-based learning with sparsity using a relaxed mask vector (Louizos et al., 2018). This approach involves relaxing a binary vector into continuous space, and then multiplying it with a dense weight vector to determine how much of the weight vector is applied during training. After training, the mask is made deterministic, and a large portion of the diff vector is zero.It is also possible to learn sparse diff vectors through other penalties such as the -norm. We chose to work with the relaxed -norm formulation as past work has shown that SGD-based optimization works well in this setting.
To apply this method we first decompose into a binary mask vector multiplied with a dense vector,
We now lower bound the true objective and optimize an expectation with respect to , whose distribution is initially Bernoulli with introduced parameters ,
This objective is still complicated by the discrete nature of ’s, but the expectation provides some guidance for empirically effective relaxations. We follow prior work (Louizos et al., 2018; Wang et al., 2019b) and relax into continuous space with a stretched Hard-Concrete distribution (Jang et al., 2017; Maddison et al., 2017), which allows for the use of pathwise gradient estimators. Specifically, is now defined to be a deterministic and (sub)differentiable function of a sample from a uniform distribution,
Here and are two constants used to stretch into the interval before it is clamped to with the operation. In this case we have a differentiable closed-form expression for the expected -norm,
Thus the final optimization problem is given by,
and we can now utilize pathwise gradient estimators to optimize the first term with respect to since the expectation no longer depends on it.To reduce notation clutter we subsume the parameters of the task-specific output layer, which is not pretrained, into . We do not apply the -norm penalty on these parameters during training. After training we obtain the final diff vector by sampling once to obtain (which is not necessarily a binary vector but has a significant number of dimensions equal to exactly zero due to the clamping function), then setting .We found sampling once to work as well as other alternatives (e.g. based on multiple samples).
Specifically, we use magnitude pruning on the diff vector and target a sparsity rate % by only keeping the top values in . Wang et al. (2019b) show that it also is possible to inject such a constraint softly into the training objective by regularizing the expected model size towards a certain rate. However, since the constraint is soft this approach also makes it difficult to target an exact sparsity rate. Note that unlike standard magnitude pruning, this is based on the magnitude of the diff vector values and not the model parameters. We found it important to further finetune with the nonzero masks fixed to maintain good performance, as is often the case in magnitude pruning (Han et al., 2016). Since this type of parameter-efficiency through projection onto the -ball can be applied without adaptive diff pruning,Concretely, one can obtain through usual finetuning, set , and then apply magnitude pruning followed by additional finetuning on . such an approach will serve as one of our baselines in the empirical study.
3 Structured Diff Pruning
To allow diff pruning to adapt to the model architecture, we consider a structured extension which incorporates dependence between dimensions. We hypothesize that this approach can allow the model to learn to modify parameters in local regions, as opposed to treating each parameter independently.
We modify the regularizer to first partition the parameter indices into groups where is a subset of parameter indices governed by group .While groups can be defined in various ways, we found that defining groups based on each matrix/bias vector of the pretrained model was simple and worked well enough. We then introduce a scalar (with the associated parameter ) for each group , and decompose the task-specific parameter for index as The expected -norm is then given by
We can train with gradient-based optimization as before. Parameters in a group are encouraged by the regularizer to be removed jointly.
Experiments
For evaluation we use the GLUE benchmark (Wang et al., 2019b) as well as the SQuAD extractive question answering dataset (Rajpurkar et al., 2016). Following Adapters (Houlsby et al., 2019), we test our approach on the following subset of the GLUE tasks: Multi-Genre Natural Language Inference (MNLI), where the goal is two predict whether the relationship between two sentences is entailment, contradiction, or neutral (we test on both MNLIm and MNLImm which respectively tests on matched/mismatched domains); Quora Question Pairs (QQP), a classification task to predict whether two question are semantically equivalent; Question Natural Language Inference (QNLI), which must predict whether a sentence is a correct answer to the question; Stanford Sentiment Treebank (SST-2), a sentence classification task to predict the sentiment of movie reviews; Corpus of Linguistic Acceptability (CoLA), where the goal is predict whether a sentence is linguistically acceptable or not; Semantic Textual Similarity Benchmark (STS-B), which must predict a similarity rating between two sentences; Microsoft Research Paraphrase Corpus (MRPC), where the goal is to predict whether two sentences are semantically equivalent; Recognizing Textual Entailment (RTE), which must predict whether a second sentence is entailed by the first. The benchmark uses Matthew’s correlation for CoLA, Spearman for STS-B, F1 score for MRPC/QQP, and accuracy for MNLI/QNLI/SST-2/RTE.
For the main experiments and analysis, we use the model from Devlin et al. (2019) to compare against the adapter-based approach of Houlsby et al. (2019). Our implementation is based on the Hugging Face Transformer library (Wolf et al., 2019).
2 Baselines
We compare both structured and non-structured variants of diff pruning against the following baselines: Full finetuning, which fully finetunes as usual; Last layer finetuning, which only finetunes the penultimate layer (along with the final output layer)Wu et al. (2020) observe that finetuning later layers generally performs better than finetuning earlier layers; Adapters from Houlsby et al. (2019), which train task-specific bottleneck layers between each layer of a pretrained model, where parameter-efficiency can be controlled by varying the size of the bottleneck layers; and Non-adaptive diff pruning, which performs diff pruning just based on magnitude pruning (i.e., we obtain through usual finetuning, set , and then apply magnitude pruning followed by additional finetuning on ). For diff pruning we set our target sparsity rate to 0.5% and investigate the effect of different target sparsity rates in section 1.
3 Implementation details and hyperparameters
Diff pruning introduces additional hyperparameters (for stretching the Hard-Concrete distribution) and (for weighting the approximate -norm penalty). We found to work well across all tasks. We also initialize the weight vector to , and to a positive vector (we use ) to encourage to be close to at the start of training.These values were found via by a light hyperparameter search on the SST-2 validation set. While we mainly experiment with BERT models to faciliate comparison against existing work, in preliminary experiments we found these hyperparameters to work for finetuning RoBERTa (Liu et al., 2019c) and XLNet (Yang et al., 2019) models as well.
For all tasks we initially train for 3 epochs and perform a hyperparameter search over batch size and learning rate .However we found the default settings used for regular finetuning as suggested in the original BERT paper to work well for most tasks. Finetuning with the fixed mask after projecting onto the -ball with magnitude pruning is done for 3 epochs with a learning rate of for all datasets except for MRPC/STS-B/RTE/SST-2 dataset, where we finetune for 5 epochs. The exact hyperparameters for each task are given in section A.1 of the appendix. Grouping for the structured version of diff pruning is based on the matrix/bias vectors (i.e. parameters that belong to the same matrix or bias vector are assumed to be in the same group), which results in 393 groups.This definition of groups is implementation-specific since it depends on how one concatenates the input vector before each affine layer. Our grouping is based on Hugging Face’s BERT implementation at commit 656e1386a296d696327a9db37de2ccccc79e2cc7. We found this simple definition to work well compared to alternative definitions (e.g. based on individual neurons).
Results
Our main results on the GLUE benchmark are shown in Table 1. Structured diff pruning can match the performance of a fully finetuned model while only requiring 0.5% additional parameters per task. Diff pruning without structured sparsity also performs well, though slightly worse than the structured approach. Non-adaptive diff pruning, which magnitude prunes the diff vector without learning the binary mask , performs significantly worse, indicating the importance of learning the masking vector. Compared to Adapters, diff pruning obtains similar performance while requiring many fewer parameters per task, making it a potential alternative for parameter-efficient transfer learning.Comparing storage costs is a bit more challenging as it is implementation-specific. Diff pruning incurs additional storage cost due to storing the nonzero positions of the diff vector. See section 6.6 for storage comparison against Adapters assuming float32 for weights and int32 for positions.
2 Results on SQuAD
To demonstrate the effectiveness of our approach beyond the GLUE tasks, we additionally experiment on SQuAD (Rajpurkar et al., 2016), an extractive question answering dataset where the model has to select the answer span to a question given a Wikipedia paragraph. To make direct comparisons with Houlsby et al. (2019), we run all experiments on SQuAD v1.1. For diff pruning, we use the same general hyperparameters as our full finetuning baseline (see section A.1). As shown in Figure 1 (right), diff pruning is able achieve comparable or better performance with only additional parameters. Interestingly, diff pruning measurably improves the upon the full finetuning baseline while modifying fewer parameters, which indicates that diff pruning can have a useful regularization effect on top of parameter-efficiency.
Analysis
In Figure 1 (left), we plot results on the GLUE validation set averaged across all tasks at target sparsity rates of for the different baselines. Structured diff pruning consistently outperforms non-structured and and non-adaptive variants across different sparsity rates. The advantage of adaptive methods becomes more pronounced at extreme sparsity rates. In Table 2, we report the breakdown of accuracy of structured diff pruning across different tasks and sparsity rates, where we observe that different tasks have different sensitivity to target sparsity rates. This suggests that we can obtain even greater parameter-efficiency through targeting task-specific sparsity rates in the diff vector.
2 Structured vs. Non-structured Diff Pruning
Structured diff pruning introduces an additional mask per group, which encourages pruning of entire groups. This is less restrictive than traditional group sparsity techniques that have been used with -norm relaxations, which force all parameters in a group to share the same mask (Louizos et al., 2018; Wang et al., 2019b). However we still expect entire groups to be pruned out more often, which might bias the learning process towards either eliminating completely or clustering together nonzero diffs. In Table 3, we indeed find that structured diff pruning leads to finetuned models that are much more likely to leave entire groups unchanged from their pretrained values (zero diffs).
3 Task-specific Sparsity
Different layers of pretrained models have been argued to encode different information (Liu et al., 2019a; Tenney et al., 2019). Given that each task will likely recruit different kinds of language phenomena embedded in the hidden layers, we hypothesize that diff pruning will modify different parts of the pretrained model through task-specific finetuning. Figure 2 shows the percentage of nonzero diff parameters attributable to the different layers for each task. We find that different tasks indeed modify different parts of the network, although there are some qualitative similarities between some tasks, for example between QNLI & QQP (both must encode questions), and MRPC & STS-B (both must predict similarity between sentences). The embedding layer is very sparsely modified for all tasks. While some of the variations in the sparsity distributions is due to simple randomness, we do observe some level of consistency over multiple runs of the same task, as shown in section A.2 of the appendix.
The ability to modify different parts of the pretrained model for each task could explain the improved parameter-efficiency of our approach compared to Houlsby et al. (2019)’s Adapters, which can only read/write to the pretrained model at certain points of the computational graph.To simulate this restricted setting, we tried applying diff pruning only on the fully-connected layers after the self-attention layers, and observed much worse performance. This potentially suggests that Adapters with more fine-grained access into model internals (e.g. Adapters for key/value/query transformations) might result in even greater parameter-efficiency. While left as future work, we also note that diff pruning can be applied in conjunction with Adapters, which might further improve results.
4 Effect of L0-ball projection
Applying magnitude pruning to project onto the L0-ball was crucial in achieving exact sparsity targets. As shown in Table 4, we observed little loss in performance through this approach. We reiterate that it was crucial to finetune with a fixed mask, even for the approach which does not apply magnitude pruning.Without fixed-mask finetuning, GLUE performance decreases from 84.9 to 81.4.
5 Comparison against BERT compression
Direct BERT compression methods also provide a straightforward approach to parameter-efficient transfer learning. Here we compare diff pruning against existing BERT compression methods, in particular DistilBERT (Sanh et al., 2019), MobileBERT (Sun et al., 2020b) and TinyBERT (Jiao et al., 2020). In these experiments we apply diff pruning on the smaller model as these works typically utilize as the baseline. As shown in Table 5, we observe that diff pruning is more parameter-efficient when considering all GLUE tasks while maintaining better performance. Of course, BERT compression methods typically have faster inference time (e.g. TinyBERT4 is 9.4 faster that BERT). However we note that diff pruning can be applied on these methods, which may further improve parameter-efficiency while maintaining fast inference.
6 Storage cost
Finally, Table 6 shows the actual memory requirements for diff pruning compared to Adapters for a Python implementation. While diff pruning requires storing positions in addition to the weights (unlike Adapters which can just store the weights), diff pruning is still more storage-efficient due to the greater parameter-efficiency.
7 Discussion and caveats
For training, our approach requires more memory than usual finetuning due to additionally optimizing and . Since the majority of GPU memory is typically utilized by a minibatch’s intermediate layers, this did not present a significant challenge for pretrained models that we experimented with in this study. However, this could present an issue as model sizes get larger and larger. After training, storing the task-specific diff vector requires storing a compressed version with both the nonzero positions and weights, which incurs additional storage requirements. Finally, while training efficiency was not a primary concern of this work, diff pruning was also approximately to slower to train per minibatch than regular finetuning.
Related Work
Multi-task learning (Caruana, 1997), broadly construed, aims to learn models and representations that can be utilized across a diverse range of tasks, and offers a natural approach to training parameter-efficient deep models. Several works have shown that a single BERT model can obtain good performance across multiple tasks when jointly trained (Liu et al., 2019b; Clark et al., 2019; Stickland & Murray, 2019). An alternative approach to multi-task learning that does not require access to all tasks during training involve training smaller task-specific layers that interact with a fixed pretrained model (Rebuffi et al., 2018; Zhang et al., 2020a). In particular, Adapters (Rebuffi et al., 2018), which learn to read and write to layers of a shared model, have been applied to obtain parameter-efficient BERT models (Houlsby et al., 2019; Pfeiffer et al., 2020a, b, c). In recent work, Li & Liang (2021) and Qin & Eisner (2021) explore the use of learned prompts on top of pretrained models to obtain task-specific models. Yet another line of work targets extreme parameter-efficiency through task-agnostic sentence representations that can be used without finetuning for downstream tasks (Le & Mikolov, 2014; Kiros et al., 2015; Wieting et al., 2016; Hill et al., 2016; Arora et al., 2017; Conneau et al., 2017; Cer et al., 2018; Zhang et al., 2018; Subramanian et al., 2018; Reimers & Gurevych, 2019; Zhang et al., 2020b). These feature-based transfer learning methods are however generally outperformed by fully finetuned models (Howard & Ruder, 2018).
Model compression
There has been much recent work on compressing pretrained trained with self-supervision (see Ganesh et al. (2020) for a recent survey). A particularly promising line of work focuses on obtaining smaller pretrained models (for subsequent finetuning) through weight pruning (Gordon et al., 2020; Sajjad et al., 2020; Chen et al., 2020) and/or knowledge distillation (Sanh et al., 2019; Sun et al., 2019; Turc et al., 2019; Jiao et al., 2020; Sun et al., 2020b). It would be interesting to see whether our approach can be applied on top of these smaller pretrained models to for even greater parameter-efficiency.
Learning to mask
Our work is closely related to the line of work on learning to mask parts of deep networks with differentiable relaxations of binary masks for model pruning and parameter sharing (Wang et al., 2019b; Zhao et al., 2020; Sanh et al., 2020; Radiya-Dixit & Wang, 2020; Mallya et al., 2018; Guo et al., 2019; Sun et al., 2020a; Cao et al., 2021). While these works also enable parameter-efficient transfer learning, they generally apply the masks directly on the pretrained parameters instead of on the difference vector as in the present work.
Regularization towards pretrained models
Finally, diff pruning is also related to works which regularize the learning process towards pretrained/shared models for continual learning (Rusu et al., 2016; Kirkpatrick et al., 2017; Schwarz et al., 2018), domain adaptation (Wiese et al., 2017; Miceli Barone et al., 2017), and stable finetuning (Lee et al., 2020). These works typically do not utilize sparse regularizers and target a different goal than parameter-efficiency.
Conclusion
We propose diff pruning as a simple approach for parameter-efficient transfer learning with pretrained models. Experiments on standard NLP benchmarks and models show that diff pruning can match the performance of fully finetuned baselines while requiring only a few additional parameters per task, and can sometimes have a regularization effect and improve upon regular finetuning. We also propose a structured variant of diff pruning which provides further improvements. Avenues for future work include (i) injecting parameter-efficiency objectives directly into the pretraining process (to pretrain models that are better suited towards sparse transfer learning), and (ii) combining diff pruning with other techniques (e.g. adapters, model compression) to achieve even greater parameter-efficiency.
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable feedback on the initial draft. AMR was supported by NSF 1704834 and NSF Career 2037519.
References
Appendix A Appendix
Table 7 shows hyperparameters we used for training GLUE tasks. For SQuAD v1.1 experiments, we ran distributed training across 8 GPUs, and used per gpu batch size 3, maximum sequence length 384, document stride 128, learning rate , number of initial training epochs 2 and number of finetuning epochs 2.
A.2 Consistency of Nonzero Parameters
Figure 3 shows the percentage of modified parameters attributable to each layer across 5 runs of SST-2. We find that there is nonotrivial variation in sparsity across runs, but also a degree of consistency. For example, the first layer is modified considerably more than other layers across all runs.