AdapterDrop: On the Efficiency of Adapters in Transformers

Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, Iryna Gurevych

Introduction

While transfer learning has become the go-to method for solving NLP tasks Pan and Yang (2010); Torrey and Shavlik (2010); Ruder (2019); Howard and Ruder (2018); Peters et al. (2018), transformer-based models are notoriously deep requiring millions or even billions of parameters Radford et al. (2018); Devlin et al. (2019); Radford et al. (2019); Liu et al. (2019); Brown et al. (2020). This results in slow inference and large storage requirements.

At least three independent lines of research have recently evolved to tackle these shortcomings. (1) Smaller and faster models that are either distilled or trained from scratch Sanh et al. (2019); Sun et al. (2020); Bai et al. (2021); Wang et al. (2020). (2) Robustly trained transformers in which the model depth can be reduced at run-time, thereby decreasing inference time dynamically Fan et al. (2020); Elbayad et al. (2020); Xin et al. (2020); Hou et al. (2020). (3) Adapters, which, instead of fully fine-tuning the model, only train a newly introduced set of weights at every layer, thereby sharing the majority of parameters between tasks Houlsby et al. (2019); Bapna and Firat (2019); Pfeiffer et al. (2020a). Adapters have been shown to work well for machine translation (Bapna and Firat, 2019), cross-lingual transfer Pfeiffer et al. (2020b, 2021b); Üstün et al. (2020); Vidoni et al. (2020); Ansell et al. (2021), community QA (Rücklé et al., 2020), and task composition for transfer learning Stickland and Murray (2019); Pfeiffer et al. (2021a); Lauscher et al. (2020); Wang et al. (2021); Poth et al. (2021). Despite their recent popularity, the computational efficiency of adapters has not been explored beyond parameter efficiency.

We close this gap and establish the computational efficiency of two adapter architectures at training and inference time. We investigate different strategies to further improve the efficiency of adapter-based models by incorporating ideas from all three directions mentioned above. Our strategies rely on dropping out adapters from transformers, at training and inference time, resulting in models that are dynamically adjustable regarding the available computational resources. Our approaches are agnostic to the pre-trained transformer model (e.g., base, large), which makes them broadly applicable.

We are the first to establish the computational efficiency of adapters compared to full fine-tuning. We show that the training steps of adapters can be up to $60\%$ faster than full model fine-tuning with common hyperparameter choices, while being 4–6% slower at inference. Hence, adapters are a suitable choice for researchers interested in achieving faster training times, or when requiring extensive hyperparameter tuning.

We propose AdapterDrop, the efficient and dynamic removal of adapters with minimal impact on the task performances. We show that dropping adapters from lower transformer layers considerably improves the inference speed in multi-task settings. For example, with adapters dropped from the first five layers, AdapterDrop is $39\%$ faster when performing inference on 8 tasks simultaneously. This can be beneficial for researchers working on models that need to make multiple predictions on each input.

We prune adapters from adapter compositions in AdapterFusion (Pfeiffer et al., 2021a) and retain only the most important adapters after transfer learning, resulting in faster inference while maintaining the task performances entirely. This is suitable for settings with little labeled training data, where AdapterFusion can achieve ample improvements over standard single task models.

Efficiency of Adapters

We first establish the computational efficiency of adapters without AdapterDrop. As illustrated in Figure 1, significant differences exist in the forward and backward pass when fine-tuning adapters compared to fully fine-tuning the model. In the forward pass, adapters add complexity with the additional components; however, it is not necessary to backpropagate through the entire model during the backward pass. We compare the training and inference speed of full model fine-tuning against the adapter architectures of Houlsby et al. (2019) and Pfeiffer et al. (2021a) (depicted in Figure 1) using the AdapterHub.ml framework Pfeiffer et al. (2020a). We conduct our measurements with the transformer configuration of BERT base and verify them with different GPUs.We experiment with newer and older GPUs, Nvidia V100 and Titan X, respectively. See Appendix A.1 for details.

We provide measurements corresponding to common experiment configurations in Table 1.

Adapters can be considerably faster compared to full model fine-tuning—60% faster in some configurations. The two adapter architectures differ only marginally in terms of training efficiency: due to its simpler architecture, training steps of the Pfeiffer adapters are slightly faster. The magnitude of the differences depends on the input size; the available CUDA cores are the primary bottleneck.We include detailed plots in Appendix G.1. We do not observe any particular differences between adapters and full fine-tuning regarding the training convergence.We also pre-train adapters with masked language modeling, finding that this does not yield better results (Appendix B).

The training speedup can be explained by the decreased overhead of gradient computation. Most of the parameters are frozen when using adapters and it is not necessary to backpropagate through the first components (see Figure 1).

Inference.

The two adapter architectures are 94–96% as fast as fully fine-tuned models, which varies depending on the input size. This can have a considerable impact when deployed at scale.

AdapterDrop

We have established that adapters are more efficient in terms of training time, however, there is a perpetuate need for sustainable and efficient models Strubell et al. (2019). Backpropagating through as few layers as possible would further improve the efficiency of training adapters. The efficiency for inference can be improved by sharing representations at lower transformer layers when simultaneously performing inference for multiple tasks—in other words, when performing multiple independent classifications on the same input. We establish this in Table 2, finding that models are up to $8.4\%$ faster with every shared layer (16 tasks).

Motivated by these observations, we propose AdapterDrop: Dynamically removing adapters from lower transformer layers (depicted in Figure 1). AdapterDrop is similar to dropping out entire transformer layers (Fan et al., 2020), however, specialized to adapter settings—where lower layers often have a small impact on the task performances (Houlsby et al., 2019).

We study two training methods for AdapterDrop: (1) Specialized AdapterDrop: Removing adapters from the first $n$ transformer layers, where $n$ is fixed during training. This yields separate models for each possible $n$ . (2) Robust AdapterDrop: Drawing the integer $n$ randomly from $$ for each training batch.We also explored dropping adapters from randomly chosen layers (instead of early layers). This generally performs worse and it requires selecting a suitable dropout rate. This yields one robust model that is applicable to a varying number of dropped layers. We study the effectiveness of AdapterDrop on the devsets of the GLUE benchmark Wang et al. (2018) using RoBERTa base Liu et al. (2019).The detailed setup is listed in Appendix A.2.

Figure 2 shows that specialized AdapterDrop maintains good results even with several dropped layers. With the first five layers dropped, specialized AdapterDrop maintains 97.1% of the original performance (averaged over all eight GLUE tasks; see Table 8). Moreover, robust AdapterDrop achieves comparable results, and with five layers dropped it maintains 95.4% of the original performance (on avg). The advantage of robust over specialized AdapterDrop is that the robust variant can be dynamically scaled. Based on current available computational resources, robust AdapterDrop can (de)activate layers with the same set of parameters, whereas specialized AdapterDrop needs to be trained for every setting explicitly.

The efficiency gains can be large. When performing inference for multiple tasks simultaneously, we measure inference speedups of 21–42% with five dropped layers—depending on the number of simultaneous tasks (Table 2).For more details see Appendix G.2 Training of our robust adapters is also more efficient, which increases the speed of training steps by 26%.Every dropped adapter improves the speed of training steps by 4.7% and we drop on average 5.5 adapters when training robust adapter models (more hyperparameter settings and details are given in Appendix G.2).

Efficiency of AdapterFusion

AdapterFusion (Pfeiffer et al., 2021a) leverages the knowledge of several adapters from different tasks and learns an optimal combination of the adapters’ output representations for a single target task (see Figure 3). AdapterFusion (AF) is particularly useful for small training sets where learning adequate models is difficult. Despite its effectiveness, AF is computationally expensive because all included adapters are passed through sequentially.We also test AF with parallel operations and found no efficiency gains (see Appendix H).

Table 3 shows that the differences can be substantial for both training and inference. For instance, compared to a fully fine-tuned model, AF with eight adapters is around 47% slower at training time and 62% slower at inference.All with Pfeiffer adapter and depending on the input size. We provide more measurements in Appendix G.3.

AdapterDrop for AdapterFusion

There exists considerable potential for improving the efficiency of AF, especially at inference time. We address this with two variants of AdapterDrop for AF by (1) removing entire AF layers; (2) pruning the least important adapters from AF models.

We fuse the adapters from all eight GLUE tasks and observe the largest gains of AF on RTE and CoLA. We additionally train robust AF models with the same procedure as in §3. We investigate from how many lower layers we can remove AF at test time while still outperforming the corresponding single-task adapter (without AdapterDrop).

Figure 4 shows that AF performs better than the single-task adapter on RTE until removing AF from the first five layers. This improves the inference efficiency by 26%.We include detailed measurements in Appendix G.4. On CoLA, we observe a different trend. Removing AF from the first layer results in more noticeable performance decreases, achieving lower task performances than the single-task adapter. This is in line with recent work showing that some linguistic tasks heavily rely on information from the first layers (Vulić et al., 2020). We deliberately highlight that AdapterDrop might not be suitable for all tasks. However, Figure 13 shows that CoLA represents the most extreme case. Nevertheless, our results suggest that researchers need to be cautious when removing AdapterFusion layers as there may exist a considerable performance/efficiency tradeoff.

2 AdapterFusion Pruning

The inference efficiency of AF largely depends on the number of fused adapters, see Table 3. We can, therefore, achieve efficiency improvements by pruning adapters from the trained AF models (depicted in Figure 3). Our hypothesis is that we can safely remove adapters if they are not usually activated by AF, which means that they do not contribute much to the output representations. In each fusion layer, we record the average adapter activations—their relative importance—using all instances of the respective AF training set. We then remove the adapters with lowest activations.

Figure 5 demonstrates that we can remove most adapters in AF without affecting the task performance. With two remaining adapters, we achieve comparable results to the full AF models with eight adapters and improve the inference speed by 68%.

We therefore recommend performing AdaperFusion pruning before deploying these models in practice. This is a simple yet effective technique to achieve efficiency gains even when aiming at maintaining performance entirely.

Conclusion

Adapters have emerged as a suitable alternative to full model fine-tuning, and their most widely claimed computational advantage is the small model size. In this work, we have demonstrated that the advantages of adapters go far beyond mere parameter efficiency. Even without our extensions, the training steps of two common adapter architectures are up to 60% faster. However, these improvements come at the cost of 4–6% slower inference speed. Thus, if training is more important, adapters can be advantageous over full model fine-tuning.

AdapterDrop expands these advantages by dropping a variable number of adapters from lower transformer layers. We dynamically reduce the computational overhead at run-time when performing inference over multiple tasks and maintain task performances to a large extent. This benefits researchers working on models that need to make multiple independent predictions on a single input.

Finally, we also investigated the computational efficiency of AdapterFusion models. We find that dropping entire AdapterFusion layers comes at a considerable performance/efficiency tradeoff, whereas pruning of the least activated adapters in each layer can improve the model efficiency while maintaining performance entirely.

We believe that our work can be widely extended and that there exist many more directions to obtain efficient adapter-based models. For instance, we could explore more efficient pre-trained adapters,In Appendix B, we evaluate MLM pre-trained adapters. Our results suggest that different strategies are necessary for adapters as compared to fully fine-tuned transformers, which can serve as a starting point for further experiments. sharing the adapter weights across layers,Appendix D shows that adapter with shared weights across layers achieves comparable results to a standard adapter while drastically reducing the number of parameters. or pruning adapters from AdapterFusion at training time.Appendix E shows that we can randomly dropout 75% of the adapters during AdapterFusion training with a minimal impact on the task performance. In the Appendix to this paper, we present preliminary results for several related ideas, which may serve as a starting point for future work.

Acknowledgments

This work has received financial support from multiple sources. (1) The German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. (2) The European Regional Development Fund (ERDF) and the Hessian State Chancellery – Hessian Minister of Digital Strategy and Development under the promotional reference 20005482 (TexPrax). (3) The German Research Foundation (DFG) as part of the Research Training Group KRITIS No. GRK 2222. (4) The German Federal Ministry of Education and Research (BMBF) as part of the Software Campus program under the promotional reference 01|S17050. (5) The LOEWE initiative (Hesse, Germany) within the emergenCITY center. (6) The German Research Foundation (DFG) as part of the UKP-SQuARE project (grant GU 798/29-1). Finally, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

References

Appendix A Measuring Computational and Task Performance

We use Python 3.6, PyTorch 1.5.1, CUDA 10.1 for all measurements. We repeat them with two different GPUs: NVIDIA Tesla V100 PCIe (32GB) and a NVIDIA Titan X Pascal (12GB). We make use of the torch.cuda.Event class and torch.cuda.synchronize to measure only the exact period of time of a training (or inference) step.This is necessary due to the asynchronous nature of the command execution on CPU and GPU. For both inference and training, we repeat the respective step 300 times. We report the median to mitigate the impact of outliers caused by GPU warmup.

We define the relative speed of an adapter compared full model fine-tuning as: $\frac{S_{a}}{S_{f}}$ where $S_{a}$ and $S_{f}$ are the time of one step with the adapter model and the fully fine-tuned model, respectively. For example, a relative speed of 1.5 means that the adapter model can perform 1.5 steps in the time the fully fine-tuned model performs one step.

Speedup.

Speedup describes the positive change in relative speed of an adapter model when using AdapterDrop (or another method). A speedup of $p$ % means that the adapter model with AdapterDrop requires only $(1-p/100)\times$ of the runtime than the adapter model without AdapterDrop.

The speedup of AdapterDrop (and AdapterFusion) are additive. If dropping one layer results in $p$ % speedup, dropping two layers results in $2p$ % speedup, etc.

A.2 Task Performances

We study the task performances of adapter models on the popular GLUE benchmark Wang et al. (2018). Following Devlin et al. (2019), we exclude the WNLI because of the problematic data construction.See https://gluebenchmark.com/faq We perform our analyses using RoBERTa base (Liu et al., 2019) as our pre-trained model and report the mean and standard deviation over three runs of the best development performance evaluated after every epoch. We train larger data sets (SST-2, MNLI, QNLI, and QQP) for 10 epochs and the rest of the data sets for 20 epochs. We use a batch size of 32 and, if not otherwise noted, the default hyperparameters for adapter fine-tuning as in Pfeiffer et al. (2021a).

Appendix B Adapter Initialization and Convergence

Besides measuring training and inference time, we are interested in (1) how using adapters compare to standard RoBERTa-base with regards to downstream task convergence, and (2) if initializing adapters with pre-trained weights using masked language modeling can lead to faster convergence.

First, we compare RoBERTa-base with adapter models using the architecture proposed by Pfeiffer et al. (2021a). Second, we pretrain an adapter with masked language modeling (MLM) using documents from the English Wikipedia.We used a recent dump of English Wikipedia. We train with a batch size of 64 and for 250k steps such that no sentence was used twice. The results for both experiments are visualized in Figure 12. When comparing RoBERTa-base with randomly initialized adapters, We find that adapters do not come at the cost of requiring more training steps for convergence (1). For several of the eight GLUE tasks, we observe similar convergence behavior with the standard RoBERTa-base model and its counterpart using adapters.

Further, we observe across all tasks that initializing the adapter weights with MLM pre-training does not have a substantial impact on the downstream task convergence (compared to a randomly initialized adapter). Thus, we find no evidence that pre-training of adapters with our masked language modeling objective leads to better convergence performance in our experiments (2).

Appendix C Detailed Results: AdapterDrop Task Performances

We plot the detailed task performances of AdapterDrop with the different training strategies in Figure 13. The relative differences of AdapterDrop to a standard adapter with no AdapterDrop are given in Table 8.

Appendix D Adapter with Cross-Layer Parameter Sharing

We can further reduce the number of parameters required for each task by sharing the weights of the adapters across all transformer layers. This is similar to weight sharing in ALBERT (Lan et al., 2020), but specialized on adapters and can therefore be applied to a wide range of pre-trained models.

We use the Pfeiffer adapter architecture in our experiments with the same hyperparameters as in Appendix A.2. Because cross-layer parameter sharing reduces the capacity of adapter models, we study the impact of the adapter compression rate. The compression rate refers to the down-projection factor in the adapter’s bottleneck layer and thus impacts the its capacity (the compression rate specifies by how much ‘FF Down’ in Figure 1 compresses the representations). The standard compression rate is 16, and smaller values result in a larger model capacity.

Table 6 shows that cross-layer parameter sharing with the same compression rate of 16 largely maintains the performance compared to separate weights with an average difference of 2.35%. With a smaller compression rate of 4, we close this gap by more than 50% while still requiring 66% fewer parameters.Even smaller compression rates do not yield similar gains. The resulting models are light-weight: our shared adapter with a compression rate of 16 requires only 307KB storage space.

Appendix E Training AdapterFusion with Dropout

We investigate the random dropout of adapters from AdapterFusion during training (using our eight task adapters as in §4) to improve the speed of training steps. Each layer randomly selects different adapters to drop out. This means that the model itself may still use the knowledge from all tasks, although not in the layers individually.

Table 7 shows the results for the four smallest GLUE tasks in terms of training data size. The speedup that we achieve with AdapterFusion dropout can be substantial: with a dropout rate of 75% (i.e., dropping out 6 out of our 8 adapters) each training step is 74% faster on average (with a sequence length of 128, a batch size of 32). We observe no clear trend in terms of task performances. Fusion dropout leads to consistent decreases on RTE and CoLA, only a small impact on STS-B (no difference when dropping out 25% of adapters), and yields improvements on MRPC.

The effectiveness of Fusion dropout, thus, depends on the individual downstream task. Nevertheless, we believe that this methods could be suitable, e.g., for resource-constrained settings.

Appendix F Detailed Results: Removing AdapterFusion Layers

The computational overhead of AF can be reduced during inference by decreasing the number of adapters. We investigate how dropping AF layers impacts the performance on the four smallest GLUE tasks (MRPC, STS-B, CoLA, RTE) and visualize the results in Figure 7.

In this experiment we compare the performance of AF with and without AdapterDrop during training. For both, we use standard adapters as well as adapters created via AdapterDrop as basis for AF. Unsurprisingly, the performance of AF without AdapterDrop within the adapters or fusion drops fastest on all four datasets. Using AdapterDrop when creating the adapters, applying AdapterDrop on AF, or the combination of both significantly reduces the performance drop when omitting fusion layers during inference. On RTE and MRPC, multiple AF layers can be omitted while still performing en par with or better compared to a single task adapter. We further find this robustness to be task dependent. Even AF with AdapterDrop shows a steep fall in performance on RTE and CoLA, while being relatively stable on MRPC and STS-B, even with most layers omitted.

Appendix G Detailed Efficiency Measurements

In this section, we present detailed results of our efficiency measurements for V100 and TitanX GPUs.

We present the efficiency results for adapters and fully fine-tuned models in Figure 6, where we plot the required time (absolute numbers) during training and inference. The relative speed of adapters compared to fully fine-tuned models is given in Table 9.

G.2 AdapterDrop

In Figure 8, we plot the speed of adapters in a multi-task setting compared to fully fine-tuned models with sequential processing of inputs. In Table 11, we present the relative speed of adapters in this setting and show the speedup gained with AdapterDrop for each dropped layer. The average speedup in Table 2 is calculated as the average speedup over the batch sizes 16, 32 and 64 in Table 11.

Training adapters with dropped layers.

Table 5 shows the speedup of AdapterDrop when training a single adapter. The average speedup for training with AdapterDrop is 4.7% per layer for the V100 and 4.5% for the TitanX. This is the average result over batch sizes 16, 32, 64 and sequence length 64, 128, 256, and 256 (see Table 5).

G.3 AdapterFusion

We plot the speed of AdapterFusion with different numbers of included adapters in Figure 9. In Table 10, we present the relative speed of AdapterFusion compared to a fully-finetuned model and a model with one adapter. This also shows the computational overhead (slowdown) that results from adding more adapters to AdapterFusion.

G.4 AdapterDrop for AdapterFusion

Table 4 shows the speedup gained with AdapterDrop for AdapterFusion during training and inference. Figure 10 shows the required time as a function of the dropped layers.

Appendix H Parallel Implementation of AdapterFusion

AdapterHub’s implementation of AdapterFusion passes through each task adapter sequentially. We hypothesized that a better efficiency can be achieved with parallel processing of adapters. We implement the parallel computation of the different adapters by reformulation the linear layers as two convolutions.

The first convolution is a convolution with a kernel size equal to the hidden dimension of the transformer and output channels equal to the number of adapters times the downprojection dimension of the adapters. The second convolution is a grouped convolutionUsing the ’groups’ parameter in Pytorch (https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html#torch.nn.Conv1d) which processes the channels in blocks the size of the downprojection dimension. It outputs channels equal to the number of adapters times the hidden dimension.

We show in Figure 11 and in Table 12 that the iterative implementation is faster than the parallel implementation for larger input sizes (e.g., batch sizes greater than). This indicates that once the input can no longer be processed entirely in parallel on the GPU (due to limited CUDA cores) the iterative implementation seems to be more efficient.