Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, Colin Raffel

Introduction

Pre-trained language models have become a cornerstone of natural language processing, thanks to the fact that they can dramatically improve data efficiency on tasks of interest – i.e., using a pre-trained language model for initialization often produces better results with less labeled data. A historically common approach has been to use the pre-trained model’s parameters for initialization before performing gradient-based fine-tuning on a downstream task of interest. While fine-tuning has produced many state-of-the-art results , it results in a model that is specialized for a single task with an entirely new set of parameter values, which can become impractical when fine-tuning a model on many downstream tasks.

An alternative approach popularized by is in-context learning (ICL), which induces a model to perform a downstream task by inputting prompted examples. Few-shot prompting converts a small collection of input-target pairs into (typically) human-understandable instructions and examples , along with a single unlabeled example for which a prediction is desired. Notably, ICL requires no gradient-based training and therefore allows a single model to immediately perform a wide variety of tasks. Performing ICL therefore solely relies on the capabilities that a model learned during pre-training. These characteristics have led to a great deal of recent interest in ICL methods .

Despite the practical benefits of ICL, it has several major drawbacks. First, processing all prompted input-target pairs every time the model makes a prediction incurs significant compute costs. Second, ICL typically produces inferior performance compared to fine-tuning . Finally, the exact formatting of the prompt (including the wording and ordering of examples ) can have significant and unpredictable impact on the model’s performance, far beyond inter-run variation of fine-tuning. Recent work has also demonstrated that ICL can perform well even when provided with incorrect labels, raising questions as to how much learning is taking place at all .

An additional paradigm for enabling a model to perform a new task with minimal updates is parameter-efficient fine-tuning (PEFT), where a pre-trained model is fine-tuned by only updating a small number of added or selected parameters. Recent methods have matched the performance of fine-tuning the full model while only updating or adding a small fraction (e.g. 0.01%) of the full model’s parameters . Furthermore, certain PEFT methods allow mixed-task batches where different examples in a batch are processed differently , making both PEFT and ICL viable for multitask models.

While the benefits of PEFT address some shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well when very little labeled data is available. Our primary goal in this paper is to close this gap by proposing a recipe – i.e., a model, a PEFT method, and a fixed set of hyperparameters – that attains strong performance on novel, unseen tasks while only updating a tiny fraction of the model’s parameters. Specifically, we base our approach on the T0 model , a variant of T5 fine-tuned on a multitask mixture of prompted datasets. To improve performance on classification and multiple-choice tasks, we add unlikelihood and length normalization-based loss terms. In addition, we develop (IA)3, a PEFT method that multiplies intermediate activations by learned vectors. (IA)3 attains stronger performance than full-model fine-tuning while updating up to 10,000 $\times$ fewer parameters. Finally, we demonstrate the benefits of pre-training the (IA)3 parameters before fine-tuning . Our overall recipe, which we dub “T-Few”, performs significantly better than ICL (even against $16\times$ larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT while requiring dramatically less compute and allowing for mixed-task batches during inference. To facilitate the use of T-Few on new problems and future research on PEFT, we release our code. ${}^{\ref{note:code}}$

After providing background on ICL and PEFT in the following section, we discuss the design of T-Few in section 3. In section 4, we present experiments comparing T-Few to strong ICL baselines. Finally, we discuss related work in appendix B and conclude in section 5.

Background

In this section, we provide am verview of ICL and PEFT with a focus on characterizing the computation, memory, and on-disk storage costs of making a prediction. Real-world costs depend on implementation and hardware, so we report costs in terms of FLOPs for computation and bytes for memory and storage, respectively. Additional related work is discussed in appendix B.

ICL aims to induce a model to perform a task by feeding in concatenated and prompted input-target examples (called “shots”) along with an unlabeled query example. Taking the cycled letter task from Brown et al. as an example, a 4-shot input or context would be “Please unscramble the letters into a word, and write that word: asinoc = casino, yfrogg = froggy, plesim = simple, iggestb = biggest, astedro =”, for which the desired output would be “roasted”. ICL induces an autoregressive language model to perform this task by feeding in the context and sampling from the model. For classification tasks, each label is associated with a string (e.g. “positive” and “negative” for sentiment analysis) and a label is assigned by choosing the label string that the model assigns the highest probability to. For multiple-choice tasks (e.g. choosing between $N$ possible answers to a question), the model’s prediction is similarly determined by determining which choice is assigned the highest probability.

The primary advantage of ICL is that it enables a single model to perform many tasks immediately without fine-tuning. This also enables mixed-task batches, where different examples in a batch of data correspond to different tasks by using different contexts in the input. ICL is also typically performed with only a limited number of labeled examples – called few-shot learning – making it data-efficient.

Despite these advantages, ICL comes with significant practical drawbacks: First, making a prediction is dramatically more expensive because the model needs to process all of the in-context labeled examples. Specifically, ignoring the quadratic complexity of self-attention operations in Transformer language models (which are typically small compared to the costs of the rest of the model ), processing the $k$ training examples for $k$ -shot ICL increases the computational cost by approximately $k+1$ times compared to processing the unlabeled example alone. Memory costs similarly scale approximately linearly with $k$ , though during inference the memory costs are typically dominated by storing the model’s parameters. Separately, there is a small amount of on-disk storage required for storing the in-context examples for a given task. For example, storing $32$ examples for a task where the prompted input and target for each example is $512$ tokens long would require about $66$ kilobytes of storage on disk ( $32$ examples $\times\;512$ tokens $\times\;32$ bits).

Beyond the aforementioned costs, ICL also exhibits unintuitive behavior. Zhao et al. showed that the ordering of examples in the context heavily influences the model’s predictions. Min et al. showed that ICL can still perform well even if the labels of the in-context examples are swapped (i.e. made incorrect), which raises questions about whether ICL is really “learning” from the labeled examples.

2 Parameter-efficient fine-tuning

While standard fine-tuning updates all parameters of the pre-trained model, it has been demonstrated that it is possible to instead update or add a relatively small number of parameters. Early methods proposed adding adapters , which are small trainable feed-forward networks inserted between the layers in the fixed pre-trained model. Since then, various sophisticated PEFT methods have been proposed, including methods that choose a sparse subset of parameters to train , produce low-rank updates , perform optimization in a lower-dimensional subspace , add low-rank adapters using hypercomplex multiplication , and more. Relatedly, prompt tuning and prefix tuning concatenate learned continuous embeddings to the model’s input or activations to induce it to perform a task; this can be seen as a PEFT method . State-of-the-art PEFT methods can match the performance of fine-tuning all of the model’s parameters while updating only a tiny fraction (e.g. 0.01%) of the model’s parameters.

PEFT drastically reduces the memory and storage requirements for training and saving the model. In addition, certain PEFT methods straightforwardly allow mixed-task batches – for example, prompt tuning enables a single model to perform many tasks simply by concatenating different prompt embeddings to each example in the batch . On the other hand, PEFT methods that re-parameterize the model (e.g. ) are costly or onerous for mixed-task batches. Separately, different PEFT methods increase the computation and memory required to perform inference by different amounts. For example, adapters effectively add additional (small) layers to the model, resulting in small but non-negligible increases in computational costs and memory. An additional cost incurred by PEFT is the cost of fine-tuning itself, which must be performed once and is then amortized as the model is used for inference. However, we will show that PEFT can be dramatically more computationally efficient when considering both fine-tuning and inference while achieving better accuracy than ICL.

Designing the T-Few Recipe

Given that PEFT allows a model to be adapted to a new task with relatively small storage requirements and computational cost, we argue that PEFT presents a promising alternative to ICL. Our goal is therefore to develop a recipe that allows a model to attain high accuracy on new tasks with limited labeled examples while allowing mixed-task batches during inference and incurring minimal computational and storage costs. By recipe, we mean a specific model and hyperparameter setting that provides strong performance on any new task without manual tuning or per-task adjustments. In this way, we can ensure that our approach is a realistic option in few-shot settings where limited labeled data is available for evaluation .

As a first step, we must choose a pre-trained model. Ideally, the model should attain high performance on new tasks after fine-tuning on a limited number of labeled examples. In preliminary experiments applying PEFT methods to different pre-trained models, we attained the best performance with T0 . T0 is based on T5 , an encoder-decoder Transformer model that was pre-trained via a masked language modeling objective on a large corpus of unlabeled text data. T0 was created by fine-tuning T5 on a multitask mixture of datasets in order to enable zero-shot generalization, i.e. the ability to perform tasks without any additional gradient-based training. Examples in the datasets used to train T0 were prompted by applying the prompt templates from the Public Pool of Prompts (P3 ), which convert each example in each dataset to a prompted text-to-text format where each label corresponds to a different string. For brevity, we omit a detailed description of T0 and T5; interested readers can refer to Sanh et al. and Raffel et al. . T0 was released in three billion and eleven billion parameter variants, referred to as “T0-3B” and simply “T0” respectively. In this section (where our goal is to design the T-Few recipe through extensive experimentation), we use T0-3B to reduce computational costs. For all models and experiments, we use Hugging Face Transformers .

While T0 was designed for zero-shot generalization, we will demonstrate that it also attains strong performance after fine-tuning with only a few labeled examples. To test T0’s generalization, Sanh et al. chose a set of tasks (and corresponding datasets) to hold out from the multitask training mixture – specifically, sentence completion (COPA , H-SWAG , and Story Cloze datasets), natural language inference (ANLI , CB , and RTE ), coreference resolution (WSC and Winogrande ), and word sense disambiguation (WiC ). Evaluation of generalization capabilities can then be straightforwardly done by measuring performance on these held-out datasets. We also will later test T-Few’s abilities in the RAFT benchmark in section 4.3, a collection of unseen “real-world” few-shot tasks with no validation set and a held-out test set. ANLI, WiC, WSC is licensed under a Creative Commons License. Winogrande is licnsed under an Apache license. COPA is under a BSD-2 Clause license. We could not find the license of RTE and CB but they are part of SuperGLUE which mentions the datasets are allowed for use in research context.

To ease comparison, we use the same number of few-shot training examples for each dataset as Brown et al. , which varies from 20 to 70. Unfortunately, the few-shot dataset subsets used by Brown et al. have not been publicly disclosed. To allow for a more robust comparison, we therefore constructed five few-shot datasets by sampling subsets with different seeds and report the median and interquartile range. We prompt examples from each dataset using the prompt templates from P3 Bach et al. , using a randomly-sampled prompt template for each example at each step. Unless otherwise stated, we train our model for 1K steps with a batch size of 8 and report performance at the end of training.

For evaluation, we use “rank classification”, where the model’s log-probabilities for all possible label strings are ranked and the model’s prediction is considered correct if the highest-ranked choice is the correct answer. Rank classification evaluation is compatible with both classification and multiple-choice tasks. Since model performance can vary significantly depending on the prompt template used, we report the median accuracy across all prompt templates from P3 and across few-shot data subsets for each dataset. For all datasets, we report the accuracy on the test set or validation set when the test labels are not public (e.g. SuperGLUE datasets). In the main text, we report median accuracy across the nine datasets mentioned above. Detailed results on each dataset are provided in the appendices.

2 Unlikelihood Training and Length Normalization

For evaluation, we use rank classification (described in section 3.1) which depends on both the probability that the model assigns to the correct choice as well as the probabilities assigned by the model to the incorrect choices. To account for this during training, we consider adding an unlikelihood loss :

3 Parameter-efficient fine-tuning with (IA)3

In order to compare favorably to few-shot ICL, we need a PEFT method that has the following properties: First, it must add or update as few parameters as possible to avoid incurring storage and memory costs. Second, it should achieve strong accuracy after few-shot training on new tasks. Finally, it must allow for mixed-task batches, since that is a capability of ICL. In order to easily enable mixed-task batches, a PEFT method should ideally not modify the model itself. Otherwise, each example in a batch would effectively need to be processed by a different model or computational graph. A more convenient alternative is provided by methods that directly modify the activations of the model since this can be done independently and cheaply to each example in the batch according to which task the example corresponds to. Prompt tuning and prefix tuning methods work by concatenating learned vectors to activation or embedding sequences and are therefore examples of activation-modifying PEFT methods that allow for mixed-task batches. However, as we will discuss later, we were unable to attain reasonable accuracy with prompt tuning and found that the more performant PEFT methods did not allow for mixed-task batches. We therefore developed a new PEFT method that meets our desiderata.

(IA)3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector. We also note that, in the event that a model will only be used on a single task, the modifications introduced by (IA)3 can also be applied to weight matrices permanently so that no elementwise multiplication is required and the model’s architecture remains unchanged. This possible because element-wise multiplications performed in (IA)3 always co-occur with a matrix multiplication, and $l\odot Wx=(l\odot W)x$ . In this case, our method incurs no additional computational cost compared to the original model.

To validate (IA)3, we compare it to a large variety of existing adaptation methods in our setting of fine-tuning T0-3B on few-shot datasets from held-out tasks. Specifically, we compare with 9 strong PEFT methods: BitFit which updates only the bias parameters; Adapters which introduce task-specific layers after the self-attention and position-wise feed-forward networks; Compacter and Compacter++ which improve upon adapters by using low-rank matrices and hypercomplex multiplication; prompt tuning which learns task-specific prompt embeddings that are concatenated to the model’s input; FISH Mask which chooses a subset of parameters to update based on their approximate Fisher information; Intrinsic SAID which performs optimization in a low-dimensional subspace; prefix-tuning which learns task-specific vectors that are concatenated to the model’s activations; and LoRA which assigns low-rank updates to parameter matrices. Additionally, we include the baselines of full-model fine-tuning and updating only the layer normalization parameters. For certain methods that allow changing the parameter efficiency, we report results for different budgets: 0.2% and 0.02% sparsity for FISH Mask, 10 and 100 learned prompt vectors for prompt tuning, and 20,000- or 500,000-dimensional subspaces for Intrinsic SAID.

The results are shown in fig. 2, with detailed per-dataset results in appendix D. We find that (IA)3 is the only method that attains higher accuracy than the full-model-fine-tuning baseline. While other PEFT methods (e.g. Intrinsic SAID and prompt tuning) update or introduce fewer parameters, (IA)3 performs considerably better. Our results and setting differ with some past work on the PEFT methods we compare against. Mahabadi et al. report that Compacter and Compacter++ outperform full-model fine-tuning, including in the few-shot setting. Lester et al. found that prompt tuning could match full-model fine-tuning, and in subsequent work Wei et al. found that prompt tuning performed well when applied to a multitask fine-tuned model in the few-shot setting. In both cases, we experimented with various hyperparameter choices to try to match past results. We hypothesize the disagreement comes from us using a different model and different datasets. For prompt tuning specifically, we noticed that the validation set performance could fluctuate wildly over the course of training, hinting at possible optimization issues.

4 Pre-training (IA)3

In recent work, Gu et al. , Vu et al. showed that pre-training the prompt embeddings in prompt tuning can improve performance when fine-tuning on downstream few-shot tasks. For pre-training, Gu et al. use a suite of self-supervised tasks applied to unlabeled text data, and Vu et al. consider using embeddings from a separate task or multitask mixture. We follow Vu et al. and simply pre-train the new parameters introduced by (IA)3 on the same multitask mixture used to train T0. We pre-train for 100,000 steps with a batch size of 16 before fine-tuning the (IA)3 parameters on each individual downstream dataset. A full comparison of accuracy with and without pre-training (IA)3 is detailed in appendix E. We find that pre-training improves fine-tuned accuracy from 64.6 to 65.8 and therefore add it to our recipe.

5 Combining the ingredients

Outperforming ICL with T-Few

Having designed and established the T-Few recipe on T0-3B, we now apply it to T0 (with 11 billion parameters) and compare performance to strong few-shot ICL baselines. From this point onwards, we use exactly the same recipe and hyperparameters across all tasks.

First, we evaluate T-Few on the datasets that were held out from T0’s training mixture. We compare against zero-shot learning with T0 (since we found few-shot ICL to performed worse than zero-shot for T0, see appendix F); few-shot ICL with T5+LM (the next-step-prediction language model upon which T0 is based); and few-shot ICL with the 6.7, 13, and 175 billion parameter variants of GPT-3. See appendix F for more details on these baselines. The accuracy on the held-out T0 datasets (described in section 3.1) is shown in table 1 and fig. 3, with per-dataset results reported in appendix F. We find that T-Few outperforms all other methods by a substantial margin. Notably, T-Few achieves a 6% higher accuracy than few-shot ICL with GPT-3 175B despite being about $16\times$ smaller and outperforms the smaller GPT-3 variants by an even larger margin. T-Few also attains significantly higher accuracy than both zero-shot learning with T0 and few-shot ICL with T5+LM.

2 Comparing computational costs

Having established that T-Few significantly outperforms ICL-based models, we now compare the relative costs of each few-shot learning approach. For simplicity, we use the FLOPs-per-token estimates for Transformer-based language models introduced by Kaplan et al. . Specifically, we estimate that a decoder-only Transformer (e.g. the GPT series) with $N$ parameters uses $2N$ FLOPs per token for inference and $6N$ FLOPs per token for training. Encoder-decoder models like T0 and T5 (where the encoder and decoder have the same number of layers and layer sizes) only process each token with either the encoder or decoder (each having roughly half the parameters of the full model), so the FLOPs per token estimates are halved to $N$ and $3N$ FLOPs per token for inference and training. We note that FLOPs are not a direct measurement of real-world computational cost because latency, power usage, and other costs can vary significantly depending on hardware and other factors . However, we focus on FLOPs because it is a hardware-independent metric that closely with real-world costs the hardware setup used for running the different methods we consider would likely vary significantly across methods. We summarize the costs in table 1 and discuss them below. For all estimates, we use the median number of shots (41) across the datasets we consider. Rank evaluation and our unlikelihood loss both require processing every possible output choice to attain a prediction for an unlabeled example. The median combined tokenized sequence length for the input and all possible targets is 103 for the datasets we consider. For in-context examples processed for few-shot ICL, only the correct target is required, producing a median sequence length of 98. Assuming that key and value vectors are cached, processing a single example with ICL therefore involves processing $41\times 98+103$ tokens. A summary of our cost estimates is provided in table 1.

Beyond improved accuracy, the primary advantage of avoiding few-shot ICL is dramatically lower inference costs. Processing a single input and all target choices with T-Few requires $11{\text{e}}9\times 103=1.1{\text{e}}12$ FLOPs, whereas few-shot ICL with GPT-3 175B requires $2\times 175{\text{e}}9\times(41\times 98+103)=1.4{\text{e}}15$ FLOPs – more than 3 orders of magnitude more. Inference costs with ICL using the smaller GPT-3 variants are also dramatically higher than the inference cost of T-Few. As discussed in section 2.1, caching the key and value vectors when the same set of in-context examples is to be reused can reduce the computational cost of ICL. However, this would only result in an approximately $41\times$ reduction, which is not nearly enough to make any of the GPT-3 ICL costs as low as T-Few.

Training cost.

Since T-Few is the only method that involves updating parameters, it is the only method that incurs a training cost. Training an eleven billion parameter encoder-decoder model for 1,000 steps with a batch size of 8 length-103 sequences requires approximately $3\times 11{\text{e}}9\times 1,000\times 8\times 103=2.7{\text{e}}16$ FLOPs. While not insignificant, this is only about 20 times larger than the FLOPs required to process a single example with few-shot ICL using GPT-3 175B. In other words, training T-Few costs as much as using GPT-3 175B to process 20 examples with few-shot ICL. We also found that fine-tuning T0 with T-Few on a single dataset only takes about a half an hour on a single NVIDIA A100 GPU. As of writing, this would cost about $2 USD using Microsoft Azure.https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series

Storage cost.

T-Few also incurs the largest storage cost. When stored as single-precision floats, the parameters added by (IA)3 take up 4.2 MB of space on disk. In contrast, ICL methods only require storing the tokenized in-context examples (typically stored as 32-bit integers), resulting in a smaller $41\times 98\times 32\text{ bits}=16\text{ kB}$ disk space requirement. However, we note that 4.2 MB is dwarfed by the on-disk size of the model checkpoints themselves – storing the (IA)3 adaptation vectors for 10,000 tasks would take about as much space as the T0 checkpoint (41.5 GB).

Memory usage.

During inference, the primary memory cost is incurred by the model’s parameters. The only model smaller than T0 (used by T-Few) is GPT-3 6.7B; otherwise, T-Few will incur a lower memory cost during inference. Additional memory costs are incurred when training T-Few due to the need to cache intermediate activations for backpropagation and for the gradient accumulator variables in Adafactor. However, as mentioned above, it is possible to use the T-Few recipe on a single 80GB A100 GPU.

3 Performance on Real-world Few-shot Tasks (RAFT)

So far, we have evaluated performance on a collection of datasets that were not explicitly designed for benchmarking few-shot learning. To better evaluate T-Few’s performance in the real world, we evaluated our approach on the RAFT benchmark . RAFT consists of 11 “economically valuable” tasks that aim to mirror real-world applications. Importantly, each RAFT datasets has only 50 training examples with no validation set and a (larger) test set with no public labels, so it is impossible to “cheat” by tuning on an unrealistically-large validation set or by peeking at the test set . We apply T-Few to RAFT by using the standard prompts released alongside the dataset. The accuracy of the current top-5 methods is shown in table 2, with further details provided in appendix H. T-Few attains a state-of-the-art accuracy of 75.8% and outperforms the human baseline (73.5% accuracy) for the first time. The next-best model (from Schick and Schütze ) achieves 6% lower accuracy and GPT-3 175B attains only 62.7%. These results validate that T-Few can be readily applied as-is to novel real-world tasks to attain strong performance.

4 Ablation experiments

Given that our T-Few design experiments were on T0-3B, we perform an ablation of some of the ingredients of T-Few on T0. Detailed results are shown in appendix G. While the gains from adding each ingredient does not always significant increase the accuracy on each individual dataset, each ingredient consistently improves the average performance across datasets: Removing pre-training decreases accuracy by 1.6%, removing unlikelihood training and length normalization decreases accuracy by 4.1%, and removing both pre-training and our additional loss terms reduces accuracy by 2.5%.

Conclusion

We introduced T-Few, a parameter-efficient few-shot learning recipe that attains higher accuracy than few-shot ICL at a lower computational cost. T-Few uses (IA)3, a new PEFT method that rescales inner activations with learned vectors. Using (IA)3 produces better performance than fine-tuning the full model while only introducing a tiny amount of additional parameters. T-Few also uses two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices. When applying T-Few as-is (with no task-specific hyperparameter tuning or other changes) to the RAFT benchmark, we attained super-human performance for the first time and outperformed prior submissions by a large margin. Through detailed characterization of computational costs, we found that T-Few uses over 1,000 $\times$ fewer FLOPs during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU. Since all of our experiments were on classification tasks, we are interested in applying T-Few to generative tasks like as summarization and question answering in future work. We hope our results provide a new perspective on how best to perform few-shot learning with large language models.

References

Appendix A Compute resources used

All T0-3B models were trained on 48GB A6000s. Training T0-3B with different PEFT methods took about an hour to train, except for Intrinsic SAID and FishMask which each took about two hours to train. Pre-training (IA)3 took 1 day on 4 A6000s. All T0 models were trained 80GB A100s from DataCrunch https://cloud.datacrunch.io/ and took about half an hour to train each. Pre-training (IA)3 took about 1 day on 4 A100s.

Appendix B Related Work

Currently, prompt tuning is one of the most parameter-efficient methods for large language models . Liu et al. introduce several tricks to improve prompt tuning, An et al. tune prompts along with input embeddings for boost in performance, and Chen et al. improve prompt embeddings through continued pre-training. Given optimization difficulties when training prompt embeddings, Diao et al. recently used black-box optimization to train prompt embeddings without requiring gradients. Several works have analyzed prompt tuning from the perspective of interpretability Khashabi et al. and its similarity to other PEFT methods He et al. . Prompt tuning has been applied to various applications for NLP including continual learning , model robustness , summarization , machine translation , co-training , probing language models , inverse prompting and transfer learning . He et al. recently proposed the use of a hypernetwork to predict prompts for new tasks (rather than training the prompt parameters with gradient descent). Prompt tuning and other PEFT methods have also been explored outside of the context of language models (e.g. vision and vision-and-language models ).

Separately, various studies have considered few-shot full-model fine-tuning with discrete prompts . Recent work has analyzed training with discrete prompts, demonstrating a boost in performance with prompting when training on various numbers of examples , finding that models perform similarly when trained on good and bad prompts , and exploring which prompts work well for few-shot and full-shot setting . There have also been efforts to develop methods that find performant discrete prompts and training prompts using methods similar to prompt tuning .

There has also been a great deal of work on improving ICL. Chen et al. , Min et al. use ICL for meta-learning to perform few-shot learning on new tasks. Lampinen et al. show ICL can improve when explanations are provided and use ICL with text retrieved from the web for open-domain question-answering. Meanwhile, Min et al. analyze how ICL works and show that ICL can still perform well when incorrect labels are provided for the in-context examples.

With the advent of large language models with billions of parameters, there has been a great deal of recent interest in PEFT methods. A small amount of recent work has also begun to explore the compatibility of PEFT methods in the few-shot setting. Mahabadi et al. found that PEFT can outperform standard fine-tuning in the low-resource setting. In concurrent work, Mahabadi et al. compare PEFT to the use of discrete prompts (e.g. PET ) during few-shot fine-tuning and find that PEFT compares favorably. Also concurrently, Moosavi et al. propose a framework for introducing adapters whose architecture and design vary from task to task and demonstrate improved results in few-shot settings. Gu et al. and Vu et al. both explored how pre-training prompt tuning parameters can improve when limited labeled data is available. For few-shot learning, Triantafillou et al. explore learning universal and dataset dependent parameters that can be blended for generalization. Requeima et al. use conditional neural adaptive processes and Li et al. leverage distillation from multiple feature extractors for learning new classes or domains in few-shot learning.

Appendix C Full Unlikelihood Training and Length Normalization Results

Table 3 shows the full results with unlikelihood training and length normalization.

Appendix D Full PEFT Results

We train for 300 steps with a learning rate of $3e^{-4}$ .

We use a reduction factor of $32$ , ReLU nonlinearity, and residual connections. We train for 500 steps with a learning rate of $3e^{-3}$ .

We train for 500 steps with a learning rate of $3e^{-3}$ and hyper complex division factor of 4 $(n=4)$ .

We train for 1000 steps with a learning rate of $3e^{-1}$ and use 10 and 100 prompt embeddings.

We train for 1000 steps with a learning rate of $3e^{-3}$ and adopt the two-layer MLP parameterization in the paper with hidden size 512. We use "Question:" and "Answer:" as initialization text for the prefixes attached to the input and target sequence, respectively.

The Fisher is first computed on the training examples and we keep $0.2\%$ or $0.02\%$ of the parameters. Then, these parameters are trained for 1500 steps with a learning rate of $3e^{-4}$ .

We train for 3000 steps with a learning rate of $3e^{-2}$ . Due to large model size, we use Intrinsic SAID to produce rank-1 updates for 2D weights via an outer product of two vectors.

We use a rank of $4$ with initialization scale of $0.01$ and update all the attention and feedforward module. We train for 1000 steps with a learning rate of $3e^{-3}$ .

Appendix E Full Pre-training Results

Table 8 shows the per-dataset results for of pre-training (IA)3.

Appendix F Full Main Results

We compare against the following baselines:

To measure the improvement in performance conferred through parameter-efficient few-shot learning, we compare to zero-shot evaluation using T0 itself. In preliminary experiments, we found that T0 was not able to perform few-shot ICL – performance actually decreased as we increased the number of in-context examples. This is likely because of the zero-shot format used during multitask prompted fine-tuning and corroborates a recent finding by .

T5+LM.

Since T0 is unable to perform ICL on its own, we also compare to T5+LM, the next-step-prediction language model upon which T0 is based. Specifically, we use the LM-adapted variant of T5.1.1.xxl released by Lester et al. , which has the same architecture and number of parameters as T0. Due to memory constraints and because of its improved performance, we use ensemble ICL for T5+LM . Specifically, we perform one-shot ICL using each example in the training set individually and average the predictions for a given query example. For fair comparison with GPT-3 models, we use the EleutherAI evaluation harness , which was designed to replicate the evaluation setup done by Brown et al. .

GPT-3.

For a strong ICL baseline, we consider models in the GPT-3 family . Specifically, we compare to the 6.7, 13, and 175 billion parameter variants of GPT-3. Because these models have not been publicly released, we report numbers directly from Brown et al. . While GPT-3 is available through the commercial OpenAI API, re-running evaluation through the API would be more than an order of magnitude more expensive than running all of the experiments performed for this paper.

Appendix G Full Ablation Results

Table table 10 shows the T-Few ablation results.

Appendix H RAFT Experiment Details

RAFT consists of 11 tasks: Ade Corpus V2, Banking 77, NeurIps Impact Statement Risks, One Stop English, Overruling, Systematic Review Inclusion, Tai Safety Research, Terms of Service, Tweet Eval Hate, and Twitter Complaints. We use the T-Few recipe on all datasets without putting the labels into the input string except Banking 77. Since Banking 77 has 77 classes which causes memory issues for unlikelihood training, we turn off unlikelihood training for Banking 77. We also feed in all the labels as part of the input string for Banking 77 since there were some labels never seen during training and clean the labels by replacing "." with ",".

Per-dataset results of T-Few and the other top-5 methods on RAFT are shown in table 11.