Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder

Introduction

State-of-the-art pretrained language models (PLMs) in natural language processing (NLP) have used heavily over-parameterized representations consisting of hundreds of millions or billions of parameters to achieve success on a wide range of

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

NLP benchmarks . These models are generally applied to downstream tasks via fine-tuning , which requires updating all parameters and storing one copy of the fine-tuned model per task. This causes substantial storage and deployment costs and hinders the applicability of large-scale PLMs to real-world applications. Additionally, fine-tuning of over-parameterized models on low-resource datasets has been shown to be subject to instabilities and may lead to poor performance .

Inspired by John von Neumann’s quotation, we ask, given that we have already learned general-purpose language representations via a PLM (i.e. we have fit our elephant), how many more parameters do we need to reach state-of-the-art performance on standard NLP tasks. Specifically, we aim to develop practical, memory-efficient methods that train a minimum set of parameters while achieving performance on par or better than full fine-tuning for state-of-the-art NLP models.

Recent literature has introduced parameter-efficient fine-tuning methods. These approaches generally keep the pretrained model’s parameters fixed and introduce a set of trainable parameters per task, trading off the number of trainable parameters with task performance. At one end of the spectrum, prompts, i.e. natural language descriptions of a task, together with demonstrations have been used to achieve reasonable performance without any parameter updates on some benchmarks but their performance generally lags behind fine-tuned models. They also require huge models to work well but choosing good prompts becomes harder with larger model sizes . Soft prompt methods treat prompts as trainable continuous parameters, which are prepended to the inputs at the input layer or intermediate layers . Such methods, however, often require large models to achieve good performance and are very sensitive to initialization and unstable during training.

The theoretically motivated low-rank methods train a small number of parameters that lie in a low-dimensional subspace using random projections . However, storing the random projection matrices causes substantial memory overhead and leads to slow training times. At the other end of the spectrum, adapter methods that insert trainable transformations at different layers of the pretrained model require more parameters than the aforementioned approaches but are more memory-efficient and obtain performance comparable to full fine-tuning .

In this work, we propose Compacter, a method for fine-tuning large-scale language models with an excellent trade-off between the number of trainable parameters, task performance, and memory footprint, compared to existing methods (see Figure 2). Compacter builds on ideas from adapters , low-rank methods , as well as recent hypercomplex multiplication layers . Similar to adapters, Compacter inserts task-specific weight matrices into a pretrained model’s weights. Each Compacter weight matrix is computed as the sum of Kronecker products between shared “slow” weights and “fast” rank-one matrices defined per Compacter layer (see Figure 3). As a result, Compacter achieves a parameter complexity of $\mathcal{O}(k+d)$ compared to $\mathcal{O}(kd)$ for regular adapters, where the adapters are of size $k\!{\times}\!d$ . In practice, Compacter trains $0.047\%$ of a PLM’s parameters. On the standard GLUE and SuperGLUE benchmarks, Compacter outperforms other parameter-efficient fine-tuning methods and obtains performance on par or better than full fine-tuning. On low-resource settings, Compacter outperforms standard fine-tuning.

In summary, we make the following contributions: 1) We propose Compacter (Compact Adapter) layers, a parameter-efficient method to adapt large-scale language models. 2) We show that Compacter obtains strong empirical performance on GLUE and SuperGLUE. 3) We demonstrate that Compacter outperforms fine-tuning in low-resource settings. 4) We provide a parameter complexity analysis of Compacter, showing that it requires dramatically fewer parameters than adapters and fine-tuning. 5) We provide a systematic evaluation of recent parameter-efficient fine-tuning methods in terms of training time and memory consumption. We release our code to facilitate future work.

Background

We start by introducing the required background on the Kronecker product and adapter layers .

where $a_{ij}$ shows the element in the $i^{\text{th}}$ row and $j^{\text{th}}$ column of $\bm{A}$ .

2 Adapter Layers

Recent work has shown that fine-tuning all parameters of a language model can lead to a sub-optimal solution, particularly for low-resource datasets . As an alternative, Rebuffi et al. and Houlsby et al. propose to transfer a model to new tasks by inserting small task-specific modules called adapter layers within the layers of a pretrained model, as depicted in Figure 2. They then only train adapters and layer normalizations, while the remaining parameters of the pretrained model remain fixed. This approach allows pretrained language models to efficiently adapt to new tasks.

Each layer of a transformer model is composed of two primary modules: a) an attention block, and b) a feed-forward block. Both modules are followed by a skip connection. As shown in Figure 2, Houlsby et al. suggest to insert an adapter layer after each of these blocks before the skip connection.

where $\bm{x}$ is the input hidden state.

Method

In this section, we present Compacter, a compact and efficient way to adapt large-scale PLMs.

Problem formulation We consider the general problem of fine-tuning large-scale language models, where we are given the training data $\mathcal{D}=\{(\bm{x^{i}},y^{i})\}_{i=1}^{P}$ with $P$ samples. We assume we are also given a large-scale pretrained language model $f_{\bm{\theta}}(.)$ parameterized by $\bm{\theta}$ that computes the output for input $\bm{x^{i}}$ . Our goal is to fine-tune $f_{\bm{\theta}}(.)$ efficiently to enable the model to adapt to new tasks.

2 Beyond Hypercomplex Adapters

Prior work indicates that some of the information captured in pretrained models can be ignored for transfer . Similarly, redundancies have been observed in the information captured by adapters, with adapters in lower layers being less important . In addition, sharing adapters across layers leads to a comparatively small drop of performance for some tasks . Motivated by these insights, we propose the following two extensions to make hypercomplex adapters more efficient.

Sharing information across adapters Sharing all adapter parameters across layers is overall too restrictive and is not able to perform on par with fine-tuning or using regular adapters ; however, our decomposition of adapters into $\bm{A_{i}}$ and $\bm{B_{i}}$ matrices as in Eq. (4) allows us to be more flexible. Consequently, we divide our adaptation weights into shared parameters that capture general information useful for adapting to the target task and adapter-specific parameters that focus on capturing information relevant for adapting each individual layer. Specifically, we define $\bm{A_{i}}$ as shared parameters that are common across all adapter layers while $\bm{B_{i}}$ are adapter-specific parameters.

In general, we set $r=1$ so that $\bm{B_{i}}$ is a rank-one matrix. Depending on the complexity of the target task, $r$ can be set to a higher value.If factors are over-parameterized, Compacter can be used for overcomplete knowledge distillation . Figure 3 illustrates our method. Overall, the LPHM layer reduces complexity further to $\mathcal{O}(k+d)$ (see §4). The LPHM layer can also be seen as leveraging “slow” weights $\bm{A_{i}}$ that are shared across adapters and capture general information and “fast” weights $\bm{B_{i}}$ that learn adapter-specific information for adaptation of each individual layer .

Compacter Based on the above formulation, we introduce Compacter layers, which replace the down-projection and up-projection layers in adapters as follows:

where the up-projection weights $\text{LPHM}^{U^{l}}$ are computed as in (5), replacing the layer $U^{l}$ in (2). Similarly, down-projection weights $\text{LPHM}^{D^{l}}$ replace the layer $D^{l}$ . While the two adapters in each layer of a transformer have their own $\bm{s_{i}}$ and $\bm{t_{i}}$ rank-one weights, we share the $\bm{A_{i}}$ across all layers and positions of the adapter layers.

Parameter Efficiency

In this section, we compare the number of parameters of Compacter with adapters.

Adapters parameters In the standard setting, two adapters are added per layer of a transformer model . Each adapter layer consists of $2kd$ parameters for the down and up-projection matrices ( $\bm{U^{l}}$ , $\bm{D^{l}}$ ) respectively where $k$ is the size of the input dimension and $d$ is the adapter’s bottleneck dimension. The total number of parameters for adapters for a transformer model with $L$ layers of both an encoder and a decoder is, therefore, $2L(2kd)$ , which scales linearly with all three variables.

Similarly, employing PHM layers for modeling down and up-projection matrices offers a parameter reduction of almost $\frac{1}{n}$ . Each adapter with a PHM layer has in total $2(\frac{kd}{n}+n^{3})$ parameters. For a Transformer model with $L$ layers, the total number of parameters of PHM-Adapter is $4L(\frac{kd}{n}+n^{3})$ .

Compacter parameters Compacter shares the trained weight matrices $\{\bm{A_{i}}\}_{i=1}^{n}$ in (5) consisting of $n^{3}$ parameters across all layers. Compacter also has two rank-one weights for each adapter, $\bm{s_{i}},\bm{t_{i}}$ in (5) consisting of $\frac{k}{n}+\frac{d}{n}$ parameters, resulting in a total of $2n(\frac{k}{n}+\frac{d}{n})$ parameters for down and up-projection weights. Therefore, the total number of parameters of Compacter is $4L(k+d)+n^{3}$ for a transformer with $L$ layers in the encoder and decoder.

In settings with a large number of layers, the dominant term is $4L(k+d)$ . Therefore, with a mild condition that $4L(k+d)>n^{3}$ , Compacter has a complexity of $\mathcal{O}(k+d)$ , which is far more efficient compared to adapters’ $\mathcal{O}(kd)$ and PHM-Adapter’s $\mathcal{O}(\frac{kd}{n})$ complexity respectively. In settings where $n$ is large, the number of parameters for shared weight matrices $\{\bm{A_{i}}\}_{i=1}^{n}$ for all layers remain constant in Compacter with a total of $n^{3}$ parameters while this scales linearly with the number of layers $L$ for PHM and adapter layers. As an example, in the T5BASE model with 222M parameters , Compacter only learns $0.047\%$ of the parameters, and maintains comparable performance to full fine-tuning.

Experiments

Datasets Following Raffel et al. , we evaluate the performance of the methods on the GLUE and SUPERGLUE benchmarks. These benchmarks cover multiple tasks of paraphrase detection (MRPC, QQP), sentiment classification (SST-2), natural language inference (MNLI, RTE, QNLI, CB), linguistic acceptability (CoLA), question-answering (MultiRC, ReCoRD, BoolQ), word sense disambiguation (WiC), and sentence completion (COPA). Following Raffel et al. , Devlin et al. , as a common practice, we do not experiment with WNLI due to its adversarial nature with respect to the training set. As the original test sets are not publicly available, we follow Zhang et al. and split off 1k samples from the training set that we use for validation, while we use the original validation data as the test set. For datasets with fewer than 10k samples (RTE, MRPC, STS-B, CoLA, COPA, WiC, CB, BoolQ, MultiRC), we divide the original validation set in half, using one half for validation and the other for testing.

Experimental details We use the state-of-the-art encoder-decoder T5 model as the underlying model for all methods in our experiments. For computational efficiency, we report all results on T5BASE models (12 encoder and decoder layers and 222M parameters). We use its HuggingFace PyTorch implementation . We fine-tune all methods for 3 epochs on large datasets and 20 epochs for low-resource datasets of GLUE (MRPC, CoLA, STS-B, RTE, BoolQ, CB, COPA, WiC) to allow the models to converge . For all adapter-based methods, we experiment with adapters of bottleneck size of $\{96,48,24\}$ . We save a checkpoint every epoch for all models and report the results for the hyper-parameters performing the best on the validation set for each task. For the PHM layers, we use the PyTorch implementation of Le et al. . We include low-level details in Appendix A. For our methods, we experiment with $n=\{4,8,12\}$ and report the model performing the best. We include the results for all values of $n$ in Appendix B.

Following Mahabadi et al. , we freeze the output layer of the pretrained model for all tasks across all methods.This is much more efficient as the output layer includes 11.1% of the parameters of T5BASE. Tasks are formulated in a text-to-text format so the model can be applied to them without learning a new output layer . We note that this is in contrast to the original adapter setting, which used an encoder-only masked PLM . We show the results with fine-tuning the output layer in Appendix C. Following Houlsby et al. , we update the layer normalization parameters for all methods where applicable.For BitFit, we only update the biases. For Prompt Tuning, the entire model is frozen.

We compare against several recently proposed parameter-efficient fine-tuning methods:

T5BASE We compare our method to the standard practice of fine-tuning T5, where we fine-tune all parameters of the model on each individual task.

Adapter We compare to a strong adapter baseline , which adds adapters for each task after the feed-forward and attention modules in each transformer block of T5.

Pfeiffer-Adapter Pfeiffer et al. propose a more efficient adapter variant, which keeps only one of the adapters in each layer for better training efficiency. We experimented with keeping either adapter and found keeping the adapter after the self-attention module in each layer to perform the best.

Adapter-LowRank We parameterize each adapter’s weight as a product of two rank-one weights.

Prompt Tuning Prompt tuning is the successor variant of Li and Liang , which prepends a randomly initialized continuous prompt to the input (Prompt Tuning-R). We also compare to a variant, which initializes prompts using token embeddings of the pretrained language model’s vocabulary (Prompt Tuning-T) .

AdapterDrop We apply the method of Rücklé et al. , which drops the adapters from lower transformer layers for a better training efficiency to T5 with Adapter. Consequently, we drop adapters from the first five layers of both the encoder and the decoder in T5BASE.

BitFit Cai et al. propose to freeze the weights and only train the biases. By not storing intermediate activations, this method enables substantial memory savings. Ravfogel et al. study a similar method for PLMs that fine-tunes only the biases and the final output layer.Note that in the HuggingFace T5 implementation, the biases in layer normalizations, linear layers, the output layer and self-attention layers are removed. We re-introduce these biases for BitFit.

2 Our Methods

PHM-Adapter We learn the weights of adapters using PHM layers as in (4). To our knowledge, we are the first who exploit the idea of PHM for efficient fine-tuning of large-scale language models.

Compacter We learn adapter weights using LPHM layers as described in (5). We also explore a variant where we only keep the Compacter layer after the feed-forward layer in each transformer block (Compacter++).We found this to slightly outperform keeping the Compacter layer after the self-attention layer instead.

3 Results on the GLUE Benchmark

Table 1 shows the results on GLUE with T5BASE (see Appendix E for results on T5SMALL). Compacter and Compacter++ outperform all previous parameter-efficient methods and perform on par with full fine-tuning while only training 0.07% and 0.047% of parameters respectively. We now discuss the different methods in detail.

Adapter-based methods For Adapter, not fine-tuning the classifier hurts the performance substantially (85.78 versus 86.48; cf. Appendix C). Pfeiffer-Adapter, which adds adapters only after the self-attention module outperforms the standard Adapter while being more parameter-efficient. AdapterDrop obtains lower performance than fine-tuning, demonstrating that adapting the lower layers of an encoder-decoder T5 model is important for its performance. Additionally, Adapter-LowRank is not expressive enough to perform well on this benchmark.

Prompt tuning and BitFit For Prompt Tuning, we observe high sensitivity to initialization and learning rate, as also confirmed in . We experimented with multiple random seeds but performance lags behind fine-tuning substantially, in particular on low-resource datasets. This can be explained by the low flexibility of such methods as all the information needs to be contained in the prefixes. As a result, the method only allows limited interaction with the rest of the model and good performance requires very large models . In addition, increasing the sequence length leads to memory overhead (see §5.5) and the number of prompt tokens is limited by the number of tokens that can fit in the model’s maximum input length, which makes such methods less flexible and unsuitable for dealing with large contexts. Similarly, BitFit performs worse than fine-tuning, especially on low-resource datasets.

Intrinsic-SAID Interestingly, the average performance of Intrinsic-SAID, which fine-tunes only 0.009% of a model’s parameters is only 1.05 points below the fine-tuning baseline. However, this method has two practical drawbacks: a) storing the random projection matrices results in a substantial memory overhead; b) it is very slow to train (see §5.5). Despite this, Intrinsic-SAID provides insights regarding the effectiveness of low-rank optimization of pretrained language models , which motivates the development of parameter-efficient methods such as Compacter.

Compacter For our proposed methods, we observe fine-tuning the output layer for both PHM-Adapter and Compacter++ does not provide much performance difference (see Appendix C). PHM-Adapter reduces the parameters of Adapter from 0.83% to 0.179% (with $n=12$ ), being 4.64 $\times$ more parameter-efficient. Compacter reduces the number of parameters to the remarkable rate of 0.073% while obtaining comparable results to full fine-tuning. By removing the Compacter layer after self-attention, Compacter++ obtains similar performance, while reducing the parameters to 0.047%. Adaptation without updating the layer normalization can be a promising direction to reduce the parameters further, for instance by building on recent advances in normalization-free models , which we leave to future work.

4 Results on the SuperGLUE Benchmark

Table 2 shows the performance of the methods on SuperGLUE . We include the results for all values of $n$ in Appendix D. We observe a similar pattern as on GLUE in Table 1. Compacter and Compacter++ perform substantially better compared to other parameter-efficient fine-tuning methods and even outperform full fine-tuning while only training 0.073% and 0.048% of the parameters.

5 Efficiency Evaluation

In this section, we compare the efficiency of our proposed methods with various recently proposed parameter-compact fine-tuning methods under the same computation budget. To this end, we train all methods for 1 epoch on the MNLI dataset. For each method, we select the largest batch size that fits a fixed budget of the GPU memory (24 GB). For all adapter-based methods, we fix the adapter size to 24. For Prompt Tuning, we set the number of prefix tokens to 100. For Intrinsic-SAID, we set $d^{\prime}=1400$ . Finally, we set $n=4$ . In Table 3, we report the percentage of trained parameters per task, training time per epoch, and memory usage of each method. Moreover, Figure 2 shows the trade-off between quantitative performance, percentage of trained parameters, and memory footprint.

Our approaches have several attractive properties. Based on our analysis in Table 1, Compacter and Compacter++ obtain the best combination of high GLUE score averaged across all tasks, plus a substantially lower number of parameters (0.073% and 0.047% respectively). In addition to Compacter++ performing well, its memory requirement is the second best among all methods, reducing memory usage by -41.94% compared to T5BASE. Compacter and Compacter++ also speed up training substantially, by -13.41% and -26.51% relative to T5BASE. On the other hand, BitFit, by not storing intermediate activations, has the lowest memory requirement (-64.2% relative to T5BASE) and is the fastest (-35.06% relative to T5BASE) at the cost of lower quantitative performance (1.53 points lower; see Table 1).

Methods relying on pruning adapters, i.e., Pfeiffer-Adapter and AdapterDrop reduce the memory overhead and improve training time. However, their number of parameters is almost an order of magnitude more compared to Compacter++, with 9.1 $\times$ and 10.5 $\times$ more parameters respectively. Moreover, although, Pfeiffer-Adapter performs on par with full fine-tuning with a slight degradation (Table 1), AdapterDrop obtains a lower performance (-0.65 less on average across all tasks.). We note that dropping adapters from transformer layers is a general technique and could be applied to Compacter for improving efficiency even further, which we leave to future work. Similarly, although Adapter-LowRank reduces the memory overhead and improves the training time, it obtains a lower performance (Table 1) (-0.68 less on average across all tasks.).

At the other end of the spectrum, Intrinsic-SAID and Prompt Tuning methods have the lowest number of parameters. However, they both come with high memory overhead (41.14% and 24.42% relative to full fine-tuning (T5BASE) respectively), are slowest to train, and their performance substantially lags behind full fine-tuning (see Table 1). For Prompt Tuning, high memory costs are due to the fact that the computational complexity of self-attention, which requires storing the full attention matrix for gradient computation, scales quadratically with the sequence length . For Intrinsic-SAID, the high memory requirement is due to storing large random projection matrices, which limits the application of Intrinsic-SAID for fine-tuning large-scale PLMs. Moreover, computing projections via FastFood transform, although theoretically possible in $O(D\log d^{\prime})$ , is slow in practice even with a CUDA implementation. For pretrained language models with a large number of parameters, allocating random projections for the full parameter space is intractable. While using Fastfood transform partially ameliorates this issue by reducing the memory usage from $\mathcal{O}(Dd^{\prime})$ to $\mathcal{O}(D)$ , the memory issue with such methods remains unresolved.

Overall, given the size of large-scale transformer models with millions and billions of parameters, such as T5 , efficient memory usage is of paramount importance for practical applications. Compacter and Compacter++ offer a great trade-off in terms of performance, memory usage, and training time. With regard to our inspiration of von Neumann’s quotation, we thus find that only a comparatively small number of additional parameters are necessary for the practical and efficient adaptation of PLMs.

6 Low-resource Fine-tuning

Compacter++ has substantially fewer parameters compared to T5BASE. In this section, we investigate whether this could help Compacter++ to generalize better in resource-limited settings. We subsample each dataset of GLUE for varying sizes in the range $\{100,500,1000,2000,4000\}$ . Figure 4 shows the results. Compacter++ substantially improves the results in the low-resource setting, indicating more effective fine-tuning in this regime.

Related Work

Adapters Adapters have recently emerged as a new paradigm for fine-tuning pretrained language models . In another line of work, Üstün et al. proposed a multilingual dependency parsing method based on adapters and contextual parameter generator networks , where they generate adapter parameters conditioned on trained input language embeddings. This, however, leads to a large number of additional parameters compared to the base model. Contemporaneously, Mahabadi et al. use a single compact hypernetwork allowing to generate adapter weights efficiently conditioned on multiple tasks and layers of a transformer model. Pilault et al. also proposed a task-conditioned transformer for multi-task learning which is less parameter-efficient. The aforementioned work is complementary to Compacter, and one could potentially combine Compacter with contextual parameter generation to generate adapter modules. Compared to Mahabadi et al. , Compacter++ reduces the parameters by 6.2 $\times$ .

Hypercomplex representations Deep learning advances in the hypercomplex domain are in a nascent stage, and most work is fairly recent . Replacing matrix multiplications in standard networks with Hamilton products that have fewer degrees of freedom offers up to a 4 $\times$ saving of parameter size in a single multiplication operation . Very recently, Zhang et al. extend such methods in a way that they could reduce the parameters of a fully connected layer under a mild condition to $1/n$ , where $n$ is a user-specified parameter. To the best of our knowledge, there is no previous work that attempts to leverage the hypercomplex space for efficient fine-tuning of large-scale language models.

Other parameter-efficient models Li et al. and Aghajanyan et al. study training models in a low-dimensional randomly oriented subspace instead of their original parameter space. Another recent line of work has shown that pretrained models such as BERT are redundant in their capacity, allowing for significant sparsification without much degradation in end metrics . Such methods, however, remain not well supported by current hardware and often perform worse compared to dedicated efficient architectures .

Conclusion

We have proposed Compacter, a light-weight fine-tuning method for large-scale language models. Compacter generates weights by summing Kronecker products between shared “slow” weights and “fast” rank-one matrices, specific to each Compacter layer. Leveraging this formulation, Compacter reduces the number of parameters in adapters substantially from $\mathcal{O}(kd)$ to $\mathcal{O}(k+d)$ . Through extensive experiments, we demonstrate that despite learning 2127.66 $\times$ fewer parameters than standard fine-tuning, Compacter obtains comparable or better performance in a full-data setting and outperforms fine-tuning in data-limited scenarios.

Acknowledgements

We are grateful to Dani Yogatama for feedback on a draft of this manuscript. The authors would like to thank Tuan Le for his assistance in reproducing the results of Zhang et al. . We would like to also thank Armen Aghajanyan for his assistance to reproduce the results of his work . We thank Jue Wang for his comments on an earlier version of this paper. The authors are grateful to Brian Lester, Rami Al-Rfou, Noah Constant, and Mostafa Dehghani for their assistance. Rabeeh Karimi Mahabadi was supported by the Swiss National Science Foundation under the project Learning Representations of Abstraction for Opinion Summarization (LAOS), grant number “FNS-30216”.

References

Appendix A Experimental Details

We run all experiments on the standard GLUE benchmark with Creative Commons license (CC BY 4.0) and the SuperGLUE benchmark . These benchmark consist of multiple datasets: CoLA , SST-2 , MRPC , QQP https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs, STS-B , MNLI , QNLI , and RTE, which is a combination of data from RTE1 , RTE2 , RTE3 , RTE5 , COPA , CB , MultiRC , ReCoRD , BoolQ , and WiC where sentences are selected from VerbNet , WordNet , and Wiktionary. We download all datasets from the HuggingFace Datasets library .

Low-resource fine-tuning

For the experiment conducted in §5.6, we set the number of epochs to 1000, 200, 100, 50, 25, for datasets subsampled to size 100, 500, 1000, 2000, and 4000 respectively. Based on our results, this is sufficient to allow the models to converge. We save a checkpoint every 250 steps for all models and report the results for the hyper-parameters performing the best on the validation set for each task.

Data pre-processing:

Following Raffel et al. , we cast all datasets into a sequence-to-sequence format. We recast STS-B as a 21-class classification task by rounding its target scores to their nearest increment of 0.2.

Computing infrastructure:

We run the experiments in Table 1, 2, 9, and 3 on one Nvidia GeForce RTX 3090, and experiments in §5.6 on one GeForce GTX 1080 Ti GPU.

Training hyper-parameters:

Appendix B Impact of Hyper-parameters

In this section, we study the impact of hyper-parameters for each method reported in Table 1. We report the results in Table 6.

Impact of dimension ( $d^{\prime}$ ) on Intrinsic-SAID Increasing the dimension $d^{\prime}$ for Intrinsic-SAID method often improves results. Though, as discussed in , $d^{\prime}$ is task-dependent so needs to be tuned for every new dataset to achieve optimal performance.

Table 6 shows the results for varying values of $n=\{4,8,12\}$ . We experiment with adapters of bottleneck size $d\in\{24,48,96\}$ .

For the T5BASE model with $k=768$ , the condition $kd>n^{4}$ discussed in §4 is partially satisfied for $d=24$ and $n=4,8$ and fully satisfied for $d\in\{48,96\}$ and $n=4,8,12$ . Note that this condition is satisfied for larger versions of the T5 model, i.e., T5-large (770 million parameters, $k=1024$ ), T5-3B (2.8 billion parameters, $k=1024$ ), and T5-11B (11 billion parameters, $k=1024$ ) with adapter hidden size $d\in\{24,32,48,96\}$ and $n=2,4,8,12$ . Due to the huge computational costs of training these models, we could not run experiments on such a large scale. Nevertheless, we observe substantial parameter reduction using PHM-Adapter.

In Table 6, we report the number of parameters for $d=24$ for all methods. Compared to Adapter, PHM-Adapter with $n=8$ reduces the parameters substantially by 5.2 $\times$ .

Impact of n𝑛n on Compacter:

For Compacter and Compacter++, we observe that the number of trainable parameters is almost constant across different values of $n$ . This is due to the fact that the number of trainable parameters in layernorms (LN) and biases (B) in each LPHM layer make up a high proportion of parameters for our methods. For instance for $n=4$ , for Compacter with 0.073% of trainable parameters, LN and B make up 28.49% and 23.51% respectively of its trainable parameters; for Compacter++ with 0.047% of trainable parameters, LN and B make up 44.01% and 18.15% respectively of its parameters; while for PHM-Adapter with 0.239% of trainable parameters, LN and B make up only 8.63% and 7.12% respectively of its parameters. Consequently, simply removing biases from adapters, and exploring ideas of training language models without layer normalizations can be promising directions on reducing parameters further, which we leave to future work.

Compacter has more than an order of magnitude fewer parameters compared to Adapter, with a parameter reduction at a remarkable rate of 11.4 $\times$ . Compacter++ even reduces the parameters further by 17.7 $\times$ in total.

Appendix C Results with Fine-tuning the Output Layer

Table 7 shows the results for the methods in Table 1 with fine-tuning the output layer. The parameters of the output layer dominate the parameters of each method and thus reduce the relative parameter savings. The standard adapter obtains the largest improvement in performance when fine-tuning the output layer compared to the results in Table 1. In contrast, our proposed methods perform well with or without fine-tuning the output layer.

Appendix D Results on SuperGLUE

Table 8 shows the performance of our proposed methods on SuperGLUE for different values of $n$ . We include the learning rate obtaining the best validation performance for all methods reported in Table 2 in Table 11.

Appendix E Impact of Model Size

Table 9 shows the results of methods using T5SMALL (60M parameters) on GLUE. For all adapter-based methods, we experiment with adapters of bottleneck size of $\{16,32,64\}$ . For our methods, we experiment with $n=\{4,8,16\}$ .

All parameter-efficient fine-tuning methods are performing worse than full fine-tuning with this small model size. This is in contrast to the results of Table 1 and 2, where some parameter-efficient fine-tuning methods were able to perform on par or outperform full fine-tuning with the larger model size of T5BASE (222M parameters). Among all methods, adapters, and our proposed methods perform the best. We report the learning rate performing the best on the validation set of each method in Table 11.