Making Pre-trained Language Models Better Few-shot Learners

Tianyu Gao, Adam Fisch, Danqi Chen

Introduction

The GPT-3 model Brown et al. (2020) has made waves in the NLP community by demonstrating astounding few-shot capabilities on myriad language understanding tasks. Given only a natural language prompt and a few demonstrations of the task, GPT-3 is able to make accurate predictions without updating any of the weights of its underlying language model. However, while remarkable, GPT-3 consists of 175B parameters, which makes it challenging to use in most real-wold applications.

In this work, we study a more practical scenario in which we only assume access to a moderately-sized language model such as BERT Devlin et al. (2019) or RoBERTa Liu et al. (2019), and a small number of examples (i.e., a few-shot setting), which we can use to fine-tune the weights of the language model. This setting is appealing as (1) such models can be trained on typical research hardware; (2) few-shot settings are realistic, as it is generally both easy to acquire a few annotations (e.g., 32 examples) and efficient to train on them; and (3) updating parameters typically leads to better performance. Inspired by GPT-3’s findings, we propose several novel strategies for expanding its few-shot learning abilities to our setting, considering both classification and—for the first time—regression.

First, we follow the route of prompt-based prediction, first developed by the GPT series Radford et al. (2018, 2019); Brown et al. (2020) for zero-shot prediction and recently studied by PET Schick and Schütze (2021a, b) for fine-tuning. Prompt-based prediction treats the downstream task as a (masked) language modeling problem, where the model directly generates a textual response (referred to as a label word) to a given prompt defined by a task-specific template (see Figure 1(c)). Finding the right prompts, however, is an art—requiring both domain expertise and an understanding of the language model’s inner workings. Even if significant effort is invested, manual prompts are likely to be suboptimal. We address this issue by introducing automatic prompt generation, including a pruned brute-force search to identify the best working label words, and a novel decoding objective to automatically generate templates using the generative T5 model Raffel et al. (2020)—all of which only require the few-shot training data. This allows us to cheaply obtain effective prompts that match or outperform our manually chosen ones.

Second, we adopt the idea of incorporating demonstrations as additional context. GPT-3’s naive “in-context learning” paradigm picks up to 32 randomly sampled examples, and concatenates them with the input. This method is not guaranteed to prioritize the most informative demonstrations, and mixing random examples from different classes together creates long contexts which can be hard to learn from. Additionally, the number of usable demonstrations is bounded by the model’s maximum input length. We develop a more refined strategy, where, for each input, we randomly sample a single example at a time from each class to create multiple, minimal demonstration sets. We also devise a novel sampling strategy that pairs inputs with similar examples, thereby providing the model with more discriminative comparisons.

We present a systematic evaluation for analyzing few-shot performance on 8 single-sentence and 7 sentence-pair NLP tasks. We observe that given a small number of training examples, (1) prompt-based fine-tuning largely outperforms standard fine-tuning; (2) our automatic prompt search method matches or outperforms manual prompts; and (3) incorporating demonstrations is effective for fine-tuning, and boosts few-shot performance. Together, these simple-yet-effective methods contribute towards a dramatic improvement across the tasks we evaluate on, and we obtain gains up to 30% absolute improvement (11% on average) compared to standard fine-tuning. For instance, we find that a RoBERTa-large model achieves around 90% accuracy on most binary sentence classification tasks, while only relying on 32 training examples. We refer to our approach as LM-BFF, better few-shot fine-tuning of language models: a strong, task-agnostic method for few-shot learning.

Related Work

Language model prompting. The GPT series (Radford et al., 2018, 2019; Brown et al., 2020) fueled the development of prompt-based learning, and we follow many of its core concepts. We are also greatly inspired by the recent PET work (Schick and Schütze, 2021a, b), although they mainly focus on a semi-supervised setting where a large set of unlabeled examples are provided. We only use a few annotated examples as supervision, and also explore automatically generated prompts and fine-tuning with demonstrations. Furthermore, we deviate from their evaluation by providing a more rigorous framework, as we will discuss in §3. Finally, there is a large body of work on prompting for mining knowledge from pre-trained models (Trinh and Le, 2018; Petroni et al., 2019; Davison et al., 2019; Talmor et al., 2020, inter alia). Different from these works, we focus on leveraging prompting for fine-tuning on downstream tasks.

Schick and Schütze (2021a) and Schick et al. (2020) explore ways of identifying label words automatically, however, none of these results lead to better performance compared to hand-picked ones. In contrast, our method searches over both templates and label words, and is able to match or outperform our manual prompts. Several other attempts have been made in addition—yet these approaches either operate in limited domains, such as finding patterns to express specific relations Jiang et al. (2020), or require a large number of examples for gradient-guided search Shin et al. (2020); Zhong et al. (2021). Our approach aims to develop general-purpose search methods that rely only on a few annotations.

Fine-tuning of language models.

A number of recent studies have focused on better methods for fine-tuning language models Howard and Ruder (2018); Dodge et al. (2020); Lee et al. (2020); Zhang et al. (2021). These works mainly focus on optimization and regularization techniques to stabilize fine-tuning. Here we use standard optimization techniques, and instead mainly focus our efforts on better prompt-based fine-tuning in a more extreme few-shot setting. We anticipate that results of these studies are largely complementary to ours.

Few-shot learning.

Broadly speaking, our setting is also connected to other few-shot learning paradigms in NLP, including (1) semi-supervised learning Miyato et al. (2017); Xie et al. (2020); Chen et al. (2020), where a set of unlabeled examples are given; (2) meta-learning Yu et al. (2018); Han et al. (2018); Bansal et al. (2020a, b); Bao et al. (2020), where a set of auxiliary tasks are given; and (3) intermediate training Phang et al. (2018); Yin et al. (2020), where a related, intermediate task is given. We deviate from these settings by making minimal assumptions about available resources: we only assume a few annotated examples and a pre-trained language model. Our focus is on understanding how far we can push without any other advantages.

Problem Setup

In this work, we assume access to a pre-trained language model $\mathcal{L}$ that we wish to fine-tune on a task $\mathcal{D}$ with a label space $\mathcal{Y}$ . For the task, we only assume $K$ training examples per class333For regression, we partition the data into two “classes” according to being above or below the median value. for the task’s training set $\mathcal{D}_{\text{train}}$ , such that the total number of examples is $K_{\text{tot}}=K\times|\mathcal{Y}|$ , and $\mathcal{D}_{\text{train}}=\{({x}_{\mathrm{in}}^{i},y^{i})\}_{i=1}^{K_{\text{tot}}}$ . Our goal is then to develop task-agnostic learning strategies that generalize well to an unseen test set $({x}_{\mathrm{in}}^{\text{test}},y^{\text{test}})\sim\mathcal{D}_{\text{test}}$ . For model selection and hyper-parameter tuning, we assume a development set $\mathcal{D}_{\text{dev}}$ , of the same size as the few-shot training set, i.e., $|\mathcal{D}_{\text{dev}}|=|\mathcal{D}_{\text{train}}|$ . This distinction is important: using a larger development set confers a significant advantage (see our experiments in Appendix A), and subverts our initial goal of learning from limited data.444In contrast, Schick and Schütze (2021a, b) do not use a development set, and adopt a set of hyper-parameters based on practical considerations. This is akin to “shooting in the dark” on a setting that we show can have unintuitive outcomes. For all of the following experiments (unless specified otherwise), we take $\mathcal{L}=$ RoBERTa-large and $K=16$ .

Evaluation datasets.

We conduct a systematic study across $8$ single-sentence and $7$ sentence-pair English tasks, including 8 tasks from the GLUE benchmark Wang et al. (2019), SNLI Bowman et al. (2015), and 6 other popular sentence classification tasks (SST-5, MR, CR, MPQA, Subj, TREC). All of the dataset details are provided in Appendix B. For single-sentence tasks, the goal is to make a prediction based on an input sentence ${x}_{\mathrm{in}}=x_{1}$ , such as whether a movie review is positive or not. For sentence-pair tasks, the goal is to take a pair of input sentences ${x}_{\mathrm{in}}=(x_{1},x_{2})$ and predict the relationship between them. We also interchangeably refer to the inputs as < $S_{1}$ > or (< $S_{1}$ >, < $S_{2}$ >). Note that we mainly use SST-2 and SNLI for pilot experiments and model development, making it close to a true few-shot setting, at least for all the other datasets we evaluate on.

Evaluation protocol.

Systematically evaluating few-shot performance can be tricky. It is well-known that fine-tuning on small datasets can suffer from instability Dodge et al. (2020); Zhang et al. (2021), and results may change dramatically given a new split of data. To account for this, we measure average performance across 5 different randomly sampled $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{dev}}$ splits. This issue has also been discussed in Schick and Schütze (2021b)—they suggest using a fixed set of training examples. We argue that sampling multiple splits gives a more robust measure of performance, and a better estimate of the variance. We also observe that hyper-parameters can make a significant difference, thus we sweep multiple hyper-parameters for each data sample, and take the best setting as measured on the $\mathcal{D}_{\text{dev}}$ of that sample (see Appendix C.1).

Prompt-based Fine-tuning

Given a masked language model $\mathcal{L}$ , we first convert input ${x}_{\mathrm{in}}$ to a token sequence $\tilde{x}$ , and the language model $\mathcal{L}$ then maps $\tilde{x}$ to a sequence of hidden vectors $\{\mathbf{h}_{k}\in\mathbb{R}^{d}\}$ . During standard fine-tuning, we usually take $\tilde{x}_{\text{single}}=\texttt{[CLS]}x_{1}\texttt{[SEP]}$ or $\tilde{x}_{\text{pair}}=\texttt{[CLS]}x_{1}\texttt{[SEP]}x_{2}\texttt{[SEP]}$ . For downstream classification tasks with a label space $\mathcal{Y}$ , we train a task-specific head, $\mathrm{softmax}(\mathbf{W}_{o}\mathbf{h}_{\texttt{[CLS]}})$ , by maximizing the log-probability of the correct label, where $\mathbf{h}_{\texttt{[CLS]}}$ is the hidden vector of [CLS], and $\mathbf{W}_{o}\in\mathbb{R}^{\mathcal{|\mathcal{Y}|}\times d}$ is a set of randomly initialized parameters introduced at the start of fine-tuning. Similarly, for a regression task, we can introduce $\mathbf{w}_{o}\in\mathbb{R}^{d}$ and optimize the mean squared error between $\mathbf{w}_{o}\cdot\mathbf{h}_{\texttt{[CLS]}}$ and the gold label. In either case, the number of new parameters can be substantial—for example, a simple binary classification task will introduce 2,048 new parameters for a RoBERTa-large model—making it challenging to learn from a small amount of annotated data (e.g., 32 examples).

An alternative approach to solving this problem is prompt-based fine-tuning, in which $\mathcal{L}$ is directly tasked with “auto-completing” natural language prompts. For instance, we can formulate a binary sentiment classification task using a prompt with input $x_{1}$ (e.g., “No reason to watch it .”) as:

$x_{\mathrm{prompt}}=\text{{[CLS]}~{}$ x_{1} $~{}{It was}~{}{[MASK]}~{}. {[SEP]}}$

and let $\mathcal{L}$ decide whether it is more appropriate to fill in “great” (positive) or “terrible” (negative) for [MASK]. We now formalize this approach for classification and regression (§4.1 and §4.2), and discuss the importance of prompt selection (§4.3).

Let $\mathcal{M}\colon\mathcal{Y}\rightarrow\mathcal{V}$ be a mapping from the task label space to individual words555More generally, we can consider a one-to-many mapping $\mathcal{M}\colon\mathcal{Y}\rightarrow 2^{|\mathcal{Y}|}$ in which we map labels to sets of words. However, we did not find significant gains in our experiments. in the vocabulary $\mathcal{V}$ of $\mathcal{L}$ . Then for each ${x}_{\mathrm{in}}$ , let the manipulation ${x}_{\mathrm{prompt}}=\mathcal{T}({x}_{\mathrm{in}})$ be a masked language modeling (MLM) input which contains one [MASK] token. In this way, we can treat our task as an MLM, and model the probability of predicting class $y\in\mathcal{Y}$ as:

$\begin{aligned} p(y\mid{x}_{\mathrm{in}})&=p\left(\texttt{[MASK]}=\mathcal{M}(y)\mid x_{\mathrm{prompt}}\right)\\ &=\frac{\exp\left(\mathbf{w}_{\mathcal{M}(y)}\cdot\mathbf{h}_{\texttt{[MASK]}}\right)}{\sum_{y^{\prime}\in\mathcal{Y}}{\exp\left(\mathbf{w}_{\mathcal{M}(y^{\prime})}\cdot\mathbf{h}_{\texttt{[MASK]}}\right)}},\end{aligned}$

where $\mathbf{h}_{\texttt{[MASK]}}$ is the hidden vector of [MASK] and $\mathbf{w}_{v}$ denotes the pre-softmax vector corresponding to $v\in\mathcal{V}$ . When supervised examples $\{({x}_{\mathrm{in}},y)\}$ are available, $\mathcal{L}$ can be fine-tuned to minimize the cross-entropy loss. It is important to note that this approach re-uses the pre-trained weights $\mathbf{w}_{v}$ and does not introduce any new parameters. It also reduces the gap between pre-training and fine-tuning, making it more effective in few-shot scenarios.

2 Regression

We assume the same basic setup as in classification, but treat the label space $\mathcal{Y}$ as a bounded interval $[v_{l},v_{u}]$ . Inspired by Mettes et al. (2019), we model the problem as an interpolation between two opposing poles, $\{y_{l},y_{u}\}$ , with values $v_{l}$ and $v_{u}$ respectively. For instance, we can formulate our previous sentiment analysis task as a regression problem in the range $ $, where we slide between “terrible” ($ v_{l}=0 $) and “great” ($ v_{u}=1 $). In this way, we can express$ y$ as a mixture model:

⋅subscript𝑣𝑙𝑝conditionalsubscript𝑦𝑙subscript𝑥in⋅subscript𝑣𝑢𝑝conditionalsubscript𝑦𝑢subscript𝑥iny=v_{l}\cdot p(y_{l}\mid{x}_{\mathrm{in}})+v_{u}\cdot p(y_{u}\mid{x}_{\mathrm{in}}), (2) where $p(y_{u}\mid{x}_{\mathrm{in}})$ is the probability of $y_{u}$ , and $p(y_{l}\mid{x}_{\mathrm{in}})=1-p(y_{u}\mid{x}_{\mathrm{in}})$ . Then we define $\mathcal{M}\colon\{y_{l},y_{u}\}\rightarrow\mathcal{V}$ , and model $p(y_{u}\mid{x}_{\mathrm{in}})$ the same as Eq. (1). We fine-tune $\mathcal{L}$ to minimize the KL-divergence between the inferred $p(y_{u}\mid{x}_{\mathrm{in}})$ and the observed mixture weight, $(y-v_{l})/(v_{u}-v_{l})$ .

3 Manual prompts: the good and the bad

The key challenge is to construct the template $\mathcal{T}$ and label words $\mathcal{M}(\mathcal{Y})$ —we refer to these two together as a prompt $\mathcal{P}$ . Previous works Schick and Schütze (2021a, b) hand-craft both the templates and label words, which usually requires domain expertise and trial-and-error. Table 1 summarizes manual templates and label words chosen for each dataset in our experiments. These templates and label words were designed by intuition, and by considering formats used in previous literature.

To better understand what constitutes a good template or label word, we conduct a pilot study on SST-2 and SNLI. Table 2 shows that different prompts can lead to substantial differences in final accuracy. Specifically, when a template is fixed, the better the label words match the “semantic classes”, the better the final accuracy is (great/terrible $>$ good/bad $>$ cat/dog). In extreme cases where we swap plausible label words (e.g., terrible/great), we achieve the worst overall performance.666It is unclear, however, why RoBERTa thinks that “cat” is more positive than “dog”. The authors tend to disagree. Furthermore, with the same set of label words, even a small change in the template can make a difference. For example, for SNLI, if we put [MASK] at the end, or swap sentence order, we observe a $>$ 10% drop. The above evidence clearly underlines the importance of selecting good templates and label words. Searching for prompts, however, is hard, as the search space can be very large—especially for the template. Even worse, we only have a few examples to use to guide our search, which can easily overfit. We will address these issues next.

Automatic Prompt Generation

We now explore principled ways of automating the search process for label words (§5.1) and templates (§5.2). Our goals are to reduce the human involvement required to design prompts, and to find more optimal settings than those that we manually choose. Here, we assume a classification task, but the process for regression is analogous.

We first study how to construct a label word mapping $\mathcal{M}$ that maximizes accuracy on $\mathcal{D}_{\text{dev}}$ after fine-tuning, given a fixed template $\mathcal{T}$ . Naively searching all possible assignments, however, is (1) generally intractable, as the search space is exponential in the number of classes; and (2) prone to overfitting, as we will tend to uncover spurious correlations given only a few annotations. As a simple solution, for each class $c\in\mathcal{Y}$ , we construct a pruned set $\mathcal{V}^{c}\subset\mathcal{V}$ of the top $k$ vocabulary words based on their conditional likelihood using the initial $\mathcal{L}$ . That is, let $\mathcal{D}_{\text{train}}^{c}\subset\mathcal{D}_{\text{train}}$ be the subset of all examples of class $c$ . We take $\mathcal{V}^{c}$ as

$\displaystyle\underset{v\in\mathcal{V}}{\mathrm{Top}\text{-}k}\left\{\sum_{{x}_{\mathrm{in}}\in\mathcal{D}_{\text{train}}^{c}}\log P_{\mathcal{L}}\Big{(}\texttt{[MASK]}=v\mid\mathcal{T}({x}_{\mathrm{in}})\Big{)}\right\},$

where ${P}_{\mathcal{L}}$ denotes the output probability distribution of $\mathcal{L}$ . To further narrow down the search space, we find the top $n$ assignments over the pruned space that maximize zero-shot accuracy on $\mathcal{D}_{\text{train}}$ (both $n$ and $k$ are hyper-parameters, see Appendix C.2). Then we fine-tune all top $n$ assignments, and re-rank to find the best one using $\mathcal{D}_{\text{dev}}$ . This approach is similar to the automatic verbalizer search methods in Schick and Schütze (2021a); Schick et al. (2020), except that we use a much simpler search process (brute-force) and also apply re-ranking—which we find to be quite helpful.

2 Automatic generation of templates

Next, we study how to generate a diverse set of templates $\{\mathcal{T}\}$ automatically from a fixed set of label words $\mathcal{M}(\mathcal{Y})$ . To address this challenging problem, we propose to use T5 Raffel et al. (2020), a large pre-trained text-to-text Transformer. T5 is pre-trained to fill in missing spans (replaced by T5 mask tokens, e.g., or ) in its input. For example, given the input “Thank you me to your party week”, T5 is trained to generate “ for inviting last ”, meaning that “for inviting” is the replacement for and “last” is the replacement for . This is well suited for prompt generation: we can simply take input sentences from $\mathcal{D}_{\text{train}}$ and let the T5 model construct the template $\mathcal{T}$ , without having to specify a pre-defined number of tokens for it.

Given an input example $({x}_{\mathrm{in}},y)\in\mathcal{D}_{\text{train}}$ , we consider the following simple conversions, denoted as $\mathcal{T}_{\mathrm{g}}({x}_{\mathrm{in}},y)$ , for formulating the T5 model inputs:777We consider putting the label word both before and after the input sentence for single-sentence tasks. However, we find that it is always better to put the label words in the middle (between the two sentences) for sentence-pair tasks.

$\begin{aligned} \texttt{<}S_{1}\texttt{>}&\longrightarrow~{}\texttt{<X>}~{}\mathcal{M}(y)~{}\texttt{<Y>}~{}\texttt{<}S_{1}\texttt{>},\\ \texttt{<}S_{1}\texttt{>}&\longrightarrow~{}\texttt{<}S_{1}\texttt{>}~{}\texttt{<X>}~{}\mathcal{M}(y)~{}\texttt{<Y>},\\ \texttt{<}S_{1}\texttt{>},\texttt{<}S_{2}\texttt{>}&\longrightarrow\texttt{<}S_{1}\texttt{>}~{}\texttt{<X>}~{}\mathcal{M}(y)~{}\texttt{<Y>}~{}\texttt{<}S_{2}\texttt{>}.\end{aligned}$

As shown in Figure 2, we rely on the T5 model to fill in the placeholders. When decoding, our goal here is to find an output that can work well for all examples in $\mathcal{D}_{\text{train}}$ , i.e., the output template $\mathcal{T}$ that maximizes $\sum_{({x}_{\mathrm{in}},y)\in\mathcal{D}_{\text{train}}}{\log P_{\text{T5}}(\mathcal{T}\mid\mathcal{T}_{\mathrm{g}}({x}_{\mathrm{in}},y))}$ , where $P_{\text{T5}}$ denotes the output probability distribution of T5. It can be decomposed according to:

$\displaystyle\sum_{j=1}^{|\mathcal{T}|}\sum_{~{}~{}~{}~{}~{}({x}_{\mathrm{in}},y)\in\mathcal{D}_{\text{train}}}{\log{P_{\text{T5}}\big{(}t_{j}\mid t_{1},...,t_{j-1},\mathcal{T}_{\mathrm{g}}\big{(}{x}_{\mathrm{in}},y\big{)}\big{)}}},$

where $(t_{1},\ldots,t_{|\mathcal{T}|})$ are the template tokens.

We use beam search to decode multiple template candidates. Concretely, we use a wide beam width (e.g., 100) to cheaply obtain a large set of diverse templates. We then fine-tune each generated template on $\mathcal{D}_{\text{train}}$ and use $\mathcal{D}_{\text{dev}}$ to either pick the single template with the best performance (Table 3), or the top $k$ templates to use as an ensemble (Table 4). Though it might appear to be expensive to fine-tune the model on each individual template, this is fast in practice due to the small size of $\mathcal{D}_{\text{train}}$ , and is also fully automated: making it easy to use, compared to manually tuning prompts for each dataset.

Fine-tuning with Demonstrations

In this section, we study whether we can leverage demonstrations when fine-tuning medium-sized LMs, and find better ways to exploit them.

GPT-3’s naive approach to in-context learning simply involves concatenating the input with up to 32 examples randomly drawn from the training set. This approach is suboptimal as (1) the number of available demonstrations is bounded by the model’s maximum input length;888GPT-3 uses a context size of 2,048 while most smaller language models (e.g., RoBERTa) have a context size of 512. and (2) mixing numerous random examples from different classes together creates extremely long contexts which can be hard to leverage, especially for a smaller model. To address these issues, we propose a simpler solution: at each training step, we randomly sample one999We also explored sampling multiple examples per class, but did not observe any improvements. example $\big{(}{x}_{\mathrm{in}}^{(c)},y_{\phantom{t}}^{(c)}\big{)}\in\mathcal{D}_{\text{train}}$ from each class, convert it into $\mathcal{T}\big{(}{x}_{\mathrm{in}}^{(c)}\big{)}$ with [MASK] replaced by $\mathcal{M}(y_{\phantom{t}}^{(c)})$ —we denote this as $\tilde{\mathcal{T}}\big{(}{x}_{\mathrm{in}}^{(c)},y_{\phantom{t}}^{(c)}\big{)}$ —and then concatenate them with ${x}_{\mathrm{in}}$ (Figure 1(c)):

$\displaystyle\mathcal{T}\big{(}{x}_{\mathrm{in}}\big{)}\oplus\tilde{\mathcal{T}}\big{(}{x}_{\mathrm{in}}^{(1)},y_{\phantom{t}}^{(1)}\big{)}\oplus\cdots\oplus\tilde{\mathcal{T}}\big{(}{x}_{\mathrm{in}}^{(|\mathcal{Y}|)},y_{\phantom{t}}^{(|\mathcal{Y}|)}\big{)}.$

Here $\oplus$ denotes concatenation of input sequences. During both training and inference we sample multiple demonstration sets for each ${x}_{\mathrm{in}}$ . Note that both ${x}_{\mathrm{in}}$ and demonstration examples are sampled from the same set $\mathcal{D}_{\text{train}}$ during training. At testing time, we still sample demonstration sets from $\mathcal{D}_{\text{train}}$ and ensemble predictions across all sets.

2 Sampling similar demonstrations

We observe that controlling the construction of the demonstration examples $\{({x}_{\mathrm{in}}^{(c)},y_{\phantom{t}}^{(c)})\}$ is crucial for good final performance. For example, if the set of contrastive demonstrations ${x}_{\mathrm{in}}^{(c)}$ are all dramatically different—from each other, or from the query ${x}_{\mathrm{in}}$ —then it becomes challenging for the language model to decipher meaningful patterns. As a result, the model may simply ignore the context, or even get confused by the additional examples. To address this issue, we devise a simple strategy in which we only sample examples that are semantically close to ${x}_{\mathrm{in}}$ . Specifically, we use a pre-trained SBERT Reimers and Gurevych (2019) model to obtain embeddings for all input sentences (for sentence-pair tasks, we use the concatenation of the two sentences). Here we just feed the raw sentences without the templates into SBERT. For each query ${x}_{\mathrm{in}}$ and each label $c\in\mathcal{Y}$ , we sort all training instances with the label $x\in\mathcal{D}_{\text{train}}^{c}$ by their similarity score to the query $\cos(\mathbf{e}({x}_{\mathrm{in}}),\mathbf{e}(x))$ , and only sample from the top $r=50\%$ instances for each class to use as demonstrations.

Experiments

We present our main results, and address several research questions pertaining to our LM-BFF approach. Implementation details are in Appendix C.

We use a RoBERTa-large model and set $K=16$ in our experiments. A comparison of using RoBERTa vs BERT can be found in Appendix D. For automatic prompt search, in our main table we report automatic template search only (which consistently performs the best, see Table 5). To put our results in perspective, we compare to a number of baselines, namely (1) standard fine-tuning in our few-shot setting; (2) standard fine-tuning using the full training set; (3) simply taking the most frequent class (measured on the full training set); (4) prompt-based zero-shot prediction where we take our manual prompts and use $\mathcal{L}$ “out-of-the-box” without using any training examples; and (5) “GPT-3” in-context learning, where we use the same prompt-based zero-shot setting, but augment the context with randomly sampled 32 demonstrations (and still use RoBERTa-large, not GPT-3).

Table 3 shows our main results using a single prompt, either from our manually designed ones (Table 1) , or the best generated ones. First, prompt-based zero-shot prediction achieves much better performance than the majority class, showing the pre-encoded knowledge in RoBERTa. Also, “GPT-3” in-context learning does not always improve over zero-shot prediction, likely because smaller language models are not expressive enough to use off-the-shelf like GPT-3.

Second, prompt-based fine-tuning can greatly outperform standard fine-tuning, both when using a manual prompt or a generated one. CoLA is one interesting exception, as the input may be a non-grammatical sentence which is out of the distribution of $\mathcal{L}$ . Generally, our automatically searched templates can achieve comparable or even higher results than manual ones, especially for tasks in which constructing strong manual templates is less intuitive (e.g., TREC, QNLI and MRPC).

Finally, using demonstrations in context leads to consistent gains in a majority of tasks. In summary, our combined solution—fine-tuning with automatically searched templates and sampled demonstration sets—achieves a $30\%$ gain on SNLI compared to standard fine-tuning, and $11\%$ gain on average.

Ensemble results.

An advantage of automatic prompt search is that we can generate as many prompts as we want, train individual models, and create large ensembles. PET Schick and Schütze (2021a, b) also ensembles multiple models trained with manual prompts.101010They then use unlabeled data and distillation to get a single model, which is outside of our scope. In Table 4, we make a direct comparison of our searched prompts and PET’s manual prompts on MNLI and RTE (two datasets that we evaluate in common).111111In the PET NLI templates, the hypothesis is put before the premise, which we actually found to be suboptimal. In our experiments, we swap the two and get better results. As the results show, an ensemble with multiple templates always improves performance. An ensemble of the same number of automatic templates achieves comparable or better performance than the ensemble of PET’s manual prompts. Increasing the number of automatic templates brings further gains.

2 Analysis of generated prompts

Table 5 gives the results of using manual vs automatic prompts. For automatic prompts, we compare template search (Auto T), label word search (Auto L), and a joint variant (Auto T + L) in which we start from manual label words, apply Auto T, and then Auto L. In most cases, Auto T achieves comparable or higher performance than manual ones, and is consistently the best variant. Auto L outperforms manual prompts on TREC and MRPC—but is considerably worse on SNLI. Auto T + L is often better than Auto L, but only sometimes better than Auto T. Table 6 shows examples from Auto T and Auto L (A full list in Appendix E). Auto T templates generally fit the context and label words well, but can contain biased peculiarities (e.g., “{Yes/No}, no” in SNLI). For Auto L words, things are mixed: while most look intuitively reasonable, there are also some mysterious abnormalities (e.g., “Hi” for the “entailment” class in SNLI).

3 Analysis of demonstration sampling

Table 7 compares the performance of demonstrations using uniform sampling to selective sampling by SBERT. We acknowledge that SBERT is trained on SNLI and MNLI datasets, thus we also tried a simple sentence encoder using mean pooling of hidden representations from RoBERTa-large. We find that in either case, using selective sampling outperforms uniform sampling, highlighting the importance of sampling similar examples for incorporating demonstrations in context.

4 Sample efficiency

Figure 3 illustrates how standard fine-tuning and our LM-BFF compare as $K$ increases. For a simple task such as SST-2 (also see MR, CR and MPQA in Table 3), despite using only 32 total examples, LM-BFF has already nearly saturated its performance and is comparable to standard fine-tuning over the entire dataset. On the harder task of SNLI, LM-BFF continues to improve as $K$ increases while still maintaining a performance gap over standard fine-tuning, until the two converge around $K=256$ .

Discussion

Reformulating NLP tasks as MLM has exciting implications for few-shot learning, but also has limitations. First, while LM-BFF greatly outperforms standard fine-tuning, Table 3 shows that, overall, the performance still substantially lags behind fine-tuning with thousands of examples, especially for harder tasks. Additionally, just like standard fine-tuning, our results also suffer from high variance. As described in §2, several recent studies have tried to counter instability in few-shot fine-tuning and we expect these methods to also help here.

With respect to automatic prompt generation, despite its effectiveness, we still find it practically challenging to expand the search space, or generalize well based on only approximately 32 examples. This is partly due to our lingering reliance on some manual design—either manual templates (for label word search) or manual label words (for template search), which allows us to get our search off the ground, but does also bias it towards areas of the search space that we might have already imagined.

Finally, it is important to clarify that LM-BFF favors certain tasks which (1) can be naturally posed as a “fill-in-the-blank” problem; (2) have relatively short input sequences; and (3) do not contain many output classes. Issues (2) and (3) might be ameliorated with longer-context language models (e.g., Beltagy et al., 2020). For tasks that are not straightforward to formulate in prompting, such as structured prediction, issue (1) is more fundamental. We leave it as an open question for future work.

Conclusion

In this paper we presented LM-BFF, a set of simple but effective techniques for fine-tuning language models using only a few examples. Our approach proposes to (1) use prompt-based fine-tuning with automatically searched prompts; and (2) include selected task demonstrations (training examples) as part of the input context. We show that our method outperforms vanilla fine-tuning by up to $30\%$ (and $11$ % on average). We concluded by discussing the limitations of our approach, and posed open questions for future study.

Acknowledgements

We thank the members of Princeton, MIT, Tsinghua NLP groups and the anonymous reviewers for their valuable feedback. TG is supported by a Graduate Fellowship at Princeton University and AF is supported by an NSF Graduate Research Fellowship. This research is also partly supported by a Google Research Scholar Award.

References

Appendix A Impact of Development Sets

Table A.1 shows how the size of the development sets can affect the final performance of the model. For “No $\mathcal{D}_{\text{dev}}$ ”, we take the same hyper-parameters from Schick and Schütze (2021a, b): batch size = 16, learning rate = 1e-5 and training steps = 250. We also experiment with a variant that we sample a development set of 10 times larger than the training set. We can see that using larger development sets leads to better performance, and this is why we stick to $|\mathcal{D}_{\text{train}}|=|\mathcal{D}_{\text{dev}}|$ in our few-shot setting.

Appendix B Datasets

For SNLI Bowman et al. (2015) and datasets from GLUE Wang et al. (2019), including SST-2 Socher et al. (2013), CoLA Warstadt et al. (2019), MNLI Williams et al. (2018), QNLI Rajpurkar et al. (2016), RTE Dagan et al. (2005); Bar Haim et al. (2006); Giampiccolo et al. (2007); Bentivogli et al. (2009), MRPC Dolan and Brockett (2005), QQP121212https://www.quora.com/q/quoradata/ and STS-B Cer et al. (2017), we follow Zhang et al. (2021) and use their original development sets for testing. For datasets which require a cross-validation evaluation—MR Pang and Lee (2005), CR Hu and Liu (2004), MPQA Wiebe et al. (2005), Subj Pang and Lee (2004)—we simply randomly sample 2,000 examples as the testing set and leave them out from training. For SST-5 Socher et al. (2013) and TREC Voorhees and Tice (2000), we use their official test sets. We show dataset statistics in Table B.1.

Appendix C Experimental Details

For grid search, we take learning rates from {1e-5, 2e-5, 5e-5} and batch sizes from {2, 4, 8}. These numbers are picked by pilot experiments on the SST-2 and SNLI datasets. We also use early stopping to avoid overfitting. For each trial, we train the model for 1,000 steps, validate the performance every 100 steps, and take the best checkpoint.

C.2 Prompt-based fine-tuning

Table 1 shows all the manual templates and label words we use in experiment. For automatically template generation, we take the T5-3B131313We take the T5 1.0 checkpoint, which is trained on both unsupervised and downstream task data. We compared it to T5 1.1 (without downstream task data) and did not find a significant difference in generated templates. model, which is the largest publicly available one that can fit on a single GPU. For automatically searching label words, we set $k$ to 100 for all tasks except SST-5 and TREC. For SST-5 we set a smaller $k=30$ , as it is a 5-way classification task. For TREC, we observe that filtering $\mathcal{V}^{c}$ using conditional likelihood alone is still noisy, thus we set $k=1000$ , and then re-rank $\mathcal{V}^{c}$ by the nearest neighbors of the original manual label words and take the top 30 per class. We set $n$ to 100 in all experiments. Due to the large number of trials in automatic search, we take a fixed set of hyper-parameters in this part: batch size of 8 and learning rate of 1e-5.

Since the idea of prompt-based fine-tuning is to make the input and output distribution close to the pre-training, the implementation details are crucial. For templates, we put extra space before sentences if it is not at the beginning of the input. Also, we lowercase the first letter of the sentence if it is concatenated with a prefix (e.g., < $S_{2}$ > in Table 1). Also if one sentence is appended any punctuation (e.g., < $S_{1}$ > in Table 1), then the last character of the original sentence is discarded. Finally, we prepend a space for label words in $\mathcal{M}(\mathcal{Y})$ . For example, we use “_great” instead of “great” in the RoBERTa vocabulary, where “_” stands for space.

C.3 Fine-tuning with demonstrations

When using demonstrations, we sample $16$ different sets of demonstrations for each input and average the predicted log probability for each class during inference. We find that further increasing the number of samples does not bring substantial improvement. Additional, we have tried different aggregation methods like taking the result with the maximum confidence and we did not find a meaningful improvement. For selective demonstrations, we take roberta-large-nli-stsb- mean-tokens141414https://github.com/UKPLab/sentence-transformers from Reimers and Gurevych (2019) as our sentence embedding model.

Appendix D Comparisons of BERT vs RoBERTa

Table D.1 compares the results of BERT-large (uncased) and RoBERTa-large in our settings. Pre-trained BERT provides two segment embeddings (A/B) for different parts of input. The common practice, when fine-tuning BERT, is that using only segment A for single-sentence tasks, and using segment A/B for the two sentences in sentence-pair tasks. In our case of incorporating demonstrations, however, we have more than two sentences. Thus we explore the following different strategies for segments: (1) using the A segment for all sentences (1-seg); (2) using the A segment for the original input and the B segment for the demonstrations (2-seg); (3) using different segment embeddings for each sentence ( $n$ -seg), e.g., for SNLI, we use different segments for each premise and hypothesis in both the original input and the demonstrations, which leads to a total number of 8 segment embeddings. This introduces new segment embeddings (randomly initialized and learned during fine-tuning) as the pre-trained BERT only has two.

Table D.1 shows that prompt-based fine-tuning with demonstrations also works for BERT, and 2-seg works the best when incorporating demonstrations. Still, we take RoBERTa-large as our main model, for RoBERTa performs much better than BERT and RoBERTa saves the trouble to tune the usage of segment embeddings.

Appendix E Generated Prompts

We demonstrate the top 3 automatically generated templates and label words for all tasks in Table E.1. In general, most automatic templates are reasonable and grammatically correct. For the label words, the generated results look intuitive for most single sentence tasks. For other tasks, the automatic ones can be counterintuitive in some cases. It is still unclear why the language model picks these words and sometimes they actually work well. We leave this for future study.