Learning To Retrieve Prompts for In-Context Learning

Ohad Rubin, Jonathan Herzig, Jonathan Berant

Introduction

The striking language skills and world knowledge embedded in large pre-trained language models (LMs) (Devlin et al., 2019; Petroni et al., 2019; Raffel et al., 2020; Brown et al., 2020) have recently led to in-context learning, a new paradigm in natural language understanding. Under this paradigm, a language model is given a prompt, which typically contains a few training examples, as well as a test instance as input, and generates the output for the test instance directly, without any update to its parameters. This approach was first introduced in GPT-3 Brown et al. (2020), but has quickly spread to other LMs Lieber et al. (2021); Du et al. (2021); Rae et al. (2021).

An attractive property of in-context learning is that it provides a single model for multiple language understanding tasks. However, Liu et al. (2021a) showed that downstream performance can vary widely depending on the choice of in-context examples. This has sparked interest in prompt retrieval (see Fig. 1), where given a test instance, training examples are chosen for the prompt based on some similarity metric. Recent work has either used off-the-shelf unsupervised similarity metrics, or trained a prompt retriever to select examples based on surface similarity Das et al. (2021).

In this work, we suggest to use language models themselves to label examples that can serve as good prompts, and train a prompt retriever from this signal. To train the retriever (see Fig. 2), we assume access to a training set of input-output pairs and to a scoring LM, i.e., a language model that will be used to score prompts. For each training example $(x,y)$ , we go over other candidate training examples, and estimate the probability, according to the scoring LM, of $y$ conditioned on $x$ and the candidate prompt. We label training examples that lead to high probability as positive and low probability as negative and train a prompt retriever from this data using contrastive learning. We argue that using an LM for labeling examples is a better proxy for training a retriever compared to previously-proposed surface similarity heuristics. Importantly, when creating the training data, we have access to the gold label $y$ , which can be used to obtain a high-quality set of candidate prompts. This leads to good positive examples and hard negative examples, which are beneficial for training with a contrastive objective.

Using a scoring LM to train an efficient retriever for a potentially different test time inference LM is beneficial in two scenarios. First, when the scoring LM is smaller than the inference LM and serves as a proxy for it. This results in cheap and efficient data generation for the retriever, accessible to a wide range of researchers. Second, our approach can be used even when the scoring and inference LMs are identical (e.g., both are GPT-3). This is beneficial when we do not have access to model parameters and can only use it as a service, an increasingly popular paradigm. In this case, we use the LM to train a light-weight retriever that is only tasked with learning a similarity function. More generally, given that the scale of LMs is likely to keep increasing in the foreseeable future, one can view our approach for Efficient Prompt Retrieval, or EPR, as a method for interfacing and learning to interact with large LMs.

We empirically test EPR on three structured sequence-to-sequence tasks, where input natural language utterances are mapped to a meaning representation: MTop (Li et al., 2021) and SMCalFlow(Andreas et al., 2020), which focus on task-oriented dialogue, and Break (Wolfson et al., 2020), a benchmark for mapping questions to a language-based meaning representation. We observe that EPR substantially improves performance compared to prior work on prompt retrieval. When the scoring LM and inference LM are identical (using GPT-Neo (Black et al., 2021)), performance compared to the best baseline improves from 26% to 31.9% on Break, from 57% to 64.2% on MTop, and from 51.4% to 54.3% on SMCalFlow. When using GPT-Neo as a proxy for larger LMs (GPT-J, GPT-3, and Codex), we observe similar gains, where performance improves substantially in all cases.

To conclude, we propose an approach for retrieving training examples for in-context learning in large language models, and show it substantially outperforms prior methods. Given recent developments in scaling LMs, designing efficient methods for interacting with LMs is an important direction for future research. Our code and data are publicly available at https://github.com/OhadRubin/EPR.

Background: Prompt Retrieval

Given a training set $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{n}$ of input-output sequences, and a test example $x_{\text{test}}$ , our goal is to train a retriever, $R(x_{\text{test}},\mathcal{D})$ , that will retrieve a subset of training examples $\mathcal{P}=\{(x_{j},y_{j})\}_{j=1}^{m}\subset\mathcal{\mathcal{D}}$ , where $m\ll n$ . We succinctly refer to $\mathcal{P}$ as the prompt.Prompt often refers to a natural language template filled by an input example (Liu et al., 2021b), but here it denotes the sequence of training examples provided as input to the LM.

Given an inference LM, $g$ , a good prompt should lead to the target output sequence when the test example $x_{\text{test}}$ is concatenated to the prompt $\mathcal{P}$ and passed as a prefix to $g$ . Specifically, decoding from the LM $g([\mathcal{P};x_{\text{test}}])$ should yield $y_{\text{test}}$ . In this work, we focus on structured tasks, such as semantic parsing, where $x$ is a natural language utterance and $y$ is a meaning representation for that utterance.

Prior work

Liu et al. (2021a) investigated the effect of different prompts on the performance of GPT-3 and demonstrated that the choice of in-context examples strongly affects downstream performance. They used an unsupervised sentence encoder to encode training examples, and retrieved for every test instance its nearest neighbors.

Das et al. (2021) trained a supervised prompt retriever for knowledge-base question answering. The retriever was trained with supervision that is tailored for knowledge-base queries, and relies on surface similarity between formal queries. Conversely, our approach takes advantage of the generative LM itself and is thus more general.

Shin et al. (2021) used GPT-3 to select examples for the prompt for few-shot semantic parsing. However, rather than training a retriever, they randomly sample a large set of utterance-program pairs from the training set, and choose those that are similar to the target instance question according to GPT-3. This results in an expensive inference procedure, where GPT-3 is run hundreds of times for each test instance, unlike our approach, which is based on a light-weight sub-linear retriever.

Efficient Prompt Retriever

We now describe our method for training EPR, an efficient prompt retriever for in-context learning. We first describe how to generate labeled data (§3.1), and then how to use the training data for training and inference (§3.2). Fig. 2 provides an overview of the training procedure.

Our approach relies on finding which training examples can serve as good prompts for other training examples. Scoring all pairs of training examples is quadratic in $|\mathcal{D}|$ , and thus prohibitive. Hence, we present a method for choosing a set of candidate examples $\bar{\mathcal{E}}\subset D$ , from which we will choose positive and negative examples for training. Importantly, since we are not at test time and are only generating data for training, we can use the target sequence $y$ to retrieve a good set of candidates. This can be approached using a simple retrieval method, given that our goal is to retrieve examples that are similar to the input in terms of their output sequence, $y$ .

To obtain a high-quality candidate set of training examples, we take advantage of an unsupervised retriever, $\bar{\mathcal{E}}=R_{u}((x,y),\mathcal{D})$ . For the choice of the unsupervised retriever, we experiment with BM25 Robertson and Zaragoza (2009), a sparse retriever that relies on surface text similarity, and SBERT (Reimers and Gurevych, 2019), which is based on dense sentence encoding. For both, we experimented with passing the retriever the training pair $(x,y)$ or the target sequence $y$ only, and found that using $y$ leads to slightly higher performance.

Once we retrieve the set of candidates $\bar{\mathcal{E}}=\{\bar{e}_{1},\cdots,\bar{e_{L}}\}$ for a training example $(x,y)$ ,We omit the dependence of $\bar{\mathcal{E}}$ on $(x,y)$ for simplicity. we score each candidate $\bar{e}_{l}\in\bar{\mathcal{E}}$ independently with a scoring LM, $\hat{g}$ , which serves as a proxy for the inference LM, $g$ . Specifically, the score for a candidate prompt is

which is the probability under the LM, $\hat{g}$ , of the output sequence conditioned on the candidate prompt and input sequence. This indicates how helpful this candidate is for decoding the target (independent of all other candidates). We argue this score is a better proxy for the utility of a training example at inference time compared to prior approaches.

We apply this scoring function to all training examples, and define for each training example a set of positive examples $\mathcal{E}_{\text{pos}}$ , which includes the top- $k$ candidates in $\bar{\mathcal{E}}$ according to $s(\bar{e}_{l})$ , and a set of negative examples $\mathcal{E}_{\text{neg}}$ , which includes the bottom- $k$ candidates in $\bar{\mathcal{E}}$ according to $s(\bar{e}_{l})$ . This should lead to relevant positive examples, assuming that the set of candidates, $\bar{\mathcal{E}}$ includes good prompt candidates and hard negatives, since all candidates have high similarity with $(x,y)$ according to $R_{u}(y,\mathcal{D})$ . With positive and negative examples at our disposal, we can now apply contrastive learning, which we describe next.

2 Training and Inference

Our training procedure proceeds exactly like the contrastive learning procedure from DPR Karpukhin et al. (2020). This procedure results in an input encoder $E_{X}(\cdot)$ , which receives the sequence of input tokens, $x$ , and a prompt encoder $E_{P}(\cdot)$ , which receives a candidate prompt, namely, a concatenation of the tokens in an input-output pair. Both encoders are initialized with BERT-base Devlin et al. (2019), and the output vector representation is given by the CLS token, as usual. The goal of training is to learn a similarity metric such that given a test example $x_{\text{test}}$ , it will be similar to training examples that lead to decoding of $y_{\text{test}}$ .

An advantage of this approach is that for batch size $B$ the effective batch size is of order $B^{2}$ , with the in-batch negatives trick (Henderson et al., 2017).

Inference

After training the input encoder and prompt encoder, we encode the entire set of training examples with $E_{P}(\cdot)$ in a pre-processing step using FAISS (Johnson et al., 2017). At test time, given an input sequence, $x_{\text{test}}$ , we compute its encoding $E_{X}(x_{\text{test}})$ , and then use maximum inner-product search over the training data to find the $L$ most similar training examples, sorted by their inner product (from high to low): $\mathcal{P}=(e_{1},\dots,e_{L})$ . The final prompt $\mathcal{P}^{\prime}$ is determined by $C$ , the maximal context size supported by the inference LM, $g$ . Specifically, $L^{\prime}\leq L$ is the largest $L^{\prime}$ such $\sum_{i=1}^{L^{\prime}}|e_{i}|+|x_{\text{test}}|+|y^{\prime}|\leq C$ , where $|y^{\prime}|$ is the desired maximal length of the generated output. Finally, we return the output of greedy decoding on $g([e_{L^{\prime}};e_{L^{\prime}-1};\dots;e_{1};x_{\text{test}}])$ .

We note that while at training time we score each training example independently, at test time the language model observes a prompt, i.e., a sequence of examples. We leave modeling the dependence between different training examples to future work.

Experimental Results

We now compare EPR to a wide range of unsupervised and supervised baselines, both when the scoring LM, $\hat{g}$ , is smaller than the inference LM, $g$ , and when they are identical.

We focus on tasks that map utterances to meaning representations, where in-context examples can be used to learn the mapping from inputs to outputs. Examples from each dataset and the number of examples are in Table 1.

Break (Wolfson et al., 2020): A dataset mapping complex natural language questions into a language-based meaning representation, where a question is decomposed into an ordered list of atomic steps. We use the low-level Break subset, containing 44K/7K/8K examples in its training/development/test sets.

MTop (Li et al., 2021): A semantic parsing dataset, focused on task-oriented dialogue, where commands are mapped to complex nested queries across 11 domains. Similar to past work Pasupat et al. (2021), we use the English subset of MTop, containing 16K/2K/4K examples in its training/development/test sets.

SMCalFlow (Andreas et al., 2020): A large English-language task-oriented dataset that covers tasks such as calendar, weather, places, and people. The meaning representation is a dataflow program, which includes API calls, function composition and complex constraints. SMCalFlow includes 15K development set examples and 134K training examples, from which we sample a random set of 44K examples for training.

2 Baselines and Oracles

We consider the following unsupervised baselines, which are applied at test time only.

Random: we randomly sample examples from the training set $\mathcal{D}$ .

SBERT: We use SentenceTransformers, a library providing BERT-based sentence embeddings.https://www.sbert.net/index.html. Specifically, we use paraphrase-mpnet-base-v2, a 110M parameter model to encode the test utterance $x_{\text{test}}$ and retrieve the examples with the most similar utterances as in-context examples.

BM25: We use the classical sparse retrieval method BM25 Robertson and Zaragoza (2009), which is an extension of TF-IDF, to retrieve for each test utterance $x_{\text{test}}$ the training examples with the most similar utterance.

BruteForce: We apply the prompt selection method for few-shot semantic parsing from Shin et al. (2021). Given a test example $x_{\text{test}}$ , we sample 200 training examples. For each training example $(x_{i},y_{i})$ , compute $\operatorname{Prob}_{g}(x_{\text{test}}\mid x_{i})$ , and use the highest scoring examples for the prompt. Similar to us, this approach uses the inference LM to choose prompts. However, it does so at test time, which results in slow inference.

Next, we describe baselines that use the training set, $\mathcal{D}$ , to train a prompt retriever. All supervised methods share the following template. First, a candidate set $\bar{\mathcal{E}}$ of $L=50$ examples is retrieved with the unsupervised retriever $R_{u}(y,\mathcal{D})$ . We use BM25 as an unsupervised retriever, since it outperformed SBERT (see §4.4). We then score each candidate prompt $\bar{e}_{l}\in\bar{\mathcal{E}}$ with some scoring function, and label the top- $k$ prompts as positive examples and the bottom- $k$ as negative examples ( $k=5$ ). Different supervised methods only differ in the scoring function itself.Results for $k\in\{1,5,10\}$ and $L\in\{50,100\}$ are in Appendix A.

DR-BM25: Here, we use the original BM25 scores for labeling positive and negative examples and training a dense retriever.

Case-based Reasoning (CBR): We adapt the scoring function from Das et al. (2021), which focused on knowledge-base question answering. They define the weight for a pair of logical forms to be the F1 score between the two sets of relations appearing in those logical forms, and use this weight to softly label their data. Since in our setting we do not assume logical forms, we define the score between two output sequence $y_{i}$ and $y_{j}$ to be the F1 between the two sets of tokens in $y_{i}$ and $y_{j}$ , omitting stop words.

Efficient Prompt Retrieval (EPR): Our full approach from §3.

Last, we also consider two oracle models.

BM25-Oracle: We score test examples with BM25 using the gold output sequence $R_{\text{BM25}}(y_{\text{test}},\mathcal{D})$ . This provides an upper-bound on what can be learned by DR-BM25. EPR can potentially outperform this oracle, since its training signal goes beyond surface text similarity.

LM-Oracle: We use the procedure for labeling training data at test time. Given a test example $(x_{\text{test}},y_{\text{test}})$ , we first retrieve $L$ candidate training examples with $R_{\text{BM25}}(y_{\text{test}},\mathcal{D})$ , we then sort the candidates with the scoring LM $\hat{g}$ , estimating the probability of $y_{\text{test}}$ given $x_{\text{test}}$ and the candidate prompt. This provides an upper bound for EPR, since EPR is trained to emulate this behaviour.

3 Experimental Details

In this work, we only train a dense retriever, but use scoring and inference LMs. For our scoring LM, $\hat{g}$ , we use GPT-Neo (Black et al., 2021), a 2.7B-parameter LM trained on The Pile (Gao et al., 2021), an 825 GB English text corpus, constructed from a wide range of high-quality resources.

In addition, we consider the following inference LMs: (a) GPT-J Wang and Komatsuzaki (2021): a 6B-parameter LM, also trained on The Pile. The advantage in this setup, is that GPT-J was trained on the same corpus as GPT-Neo. (b) GPT-3 Brown et al. (2020): A 175B-parameter model, trained mostly on a filtered subset of common crawl. (c) Codex Chen et al. (2021): A GPT-3 175B-parameter model finedtuned on code from GitHub. Since our tasks involve mapping from utterances to programs or meaning representations, Codex might potentially perform well at in-context learning.

For all LMs, we use a maximum context size of $C=$ 2,048 tokens.

Evaluation

On Break, we evaluate performance on the development set with LF-EM (Hasson and Berant, 2021), which is a better metric compared to Normalized Exact Match (NEM), the official metric, as it measures whether two meaning representations are semantically equivalent. On the test set, we use NEM. On MTop and SMCalFlow, we evaluate with Exact Match (EM), i.e., whether the string output by the inference LM is identical to the reference string.

We evaluate EPR in two settings: (a) LM-as-a-service, and (b) LM-as-a-proxy. In the first setting, we use GPT-Neo as both the scoring LM and inference LM. In this setting, we evaluate on the full development sets of Break, MTop, and SMCalFlow. In the latter setting, as we access GPT-3 and Codex through a paid API, we sample a random subset of 1,000 development examples from each dataset and evaluate each model once on this subset.

4 Results

Table 2 reports results where the scoring and inference LMs are identical. EPR substantially outperforms all other baselines. Specifically, when comparing to the best baseline, it improves performance from 26.0 to 31.9 on Break, from 57.0 to 64.2 on MTop, and from 51.4 to 54.3 on SMCalFlow. This shows that using the LM itself to label examples is an effective approach for obtaining a strong prompt retriever. Table 3 shows test results on Break and MTop corroborating that EPR substantially improves performance compared to BM25 and CBR.

For the unsupervised methods, the Random baseline shows that random sampling of training examples leads to poor performance. BM25 outperforms SBERT for prompt retrieval, and consequently we use BM25 in all of our supervised approaches to retrieve the set of candidates, $\bar{\mathcal{E}}$ . Last, BruteForce performs worse than BM25. We assume this is since the training sets are large ( $\sim$ 14-120K examples), and sampling 200 examples does not cover examples that are useful for GPT-Neo.

Interestingly, EPR outperforms BM25-Oracle on MTop and SMCalFlow and is comparable on Break. This is surprising since BM25-Oracle has access to the output sequence $y_{\text{test}}$ at test time, illustrating that the signal provided by the scoring LM for training goes beyond surface text similarity. The performance of LM-Oracle is substantially higher than EPR, showing that the supervision provided by the scoring LM is strong, and training a better retriever from this signal can substantially enhance performance.

We further evaluate our models in the one-shot setup, i.e., when the prompt given to the inference LM includes the highest scoring example only. In this setup, the inference LM is applied in the same setting as when we generate labeled data, where we go over each prompt candidate independently. Since train and test time are now closer, we can expect the advantage of EPR to be more pronounced.

Table 4 shows the results. Indeed, EPR outperforms the best baseline by 8.5%, and BM25-Oracle by 5%. In addition, we examine AnyCorrect-Oracle, which tests whether any of the candidates returned by BM25 leads to the correct output. AnyCorrect-Oracle reaches 53.6%, 20 points above LM-Oracle. This shows the high quality of candidates provided by BM25 (applied on the $y$ ), as one can reach more than 50% LF-EM with just a single prompt. Moreover, it hints that a better scoring function can potentially further improve performance.

LM-as-a-proxy

Table 5 shows results when the scoring LM is GPT-Neo and the inference LM is a larger LM. First, the trends are similar to the LM-as-a-service setup, i.e., EPR substantially outperforms prior baselines, including our best unsupervised baseline, BM25, and the best supervised baseline, CBR, by 2-8 points on all datasets and all pre-trained models. Thus, GPT-Neo serves as a good proxy for choosing training examples.

To further validate this finding, we evaluate the performance of GPT-J on Break with GPT-Neo as the scoring LM compared to using GPT-J itself as the scoring LM. We find performance improves slightly from 31.5 to 33.6. Analogously, when using Codex as the scoring LM and inference LM performance remains roughly the same: 29.5 $\rightarrow$ 29.3. Thus, using a smaller LM (GPT-Neo) is an effective strategy for training a retriever that will be applied on other LMs. Zooming in on different inference LMs, GPT-J performs slightly better than GPT-Neo across the board, since it was trained on the same data and using the same procedure as GPT-Neo. Codex outperforms GPT-3, which can be explained by the fact that it was trained on code, and our datasets involve mapping to programs or meaning representations. Surprisingly, GPT-J outperforms Codex (except on MTop) and GPT-3 despite being 30x smaller. This can perhaps be explained by the fact that GPT-J was trained on a different dataset (The Pile (Gao et al., 2021)).

Analysis

Table 6 shows an example from Break where EPR decodes the correct output, while CBR does not. All training examples retrieved by EPR perform an argmax (argmin in the original utterance), and return in the final step “a code”, while the third example retrieved by CBR does not perform an argmax or argmin, and do not involve “a code”. We provide additional examples in App. A.

Figure 3 shows a t-SNE (Hinton and Roweis, 2002) projection of the embeddings learned by EPR for the training examples of Break, with a link to an interactive version, where we applied the OPTICS Ankerst et al. (1999); Schubert and Gertz (2018) clustering algorithm. Examining clusters shows that EPR captures both lexical and structure similarity. Examples for clusters are also available in App. A.

Prompt copying

We analyze how the LM utilizes in-context prompts. Specifically, is the target output copied from one of the prompts or is it a composition of different prompt fragments, which result in generalization to new structures.

To achieve this, we define two types of copying. (a) Exact copying measures if the generated output exactly matches one of the examples in the prompt, and (b) Abstract copying, that quantifies if the structure of the decoded output matches any of the structures seen in the prompt. Specifically, we eliminate the effect of non-structural elements such as entities and function arguments. We replace every sequence of words in the logical form that appears in the input utterance with the string [MASKED] for both the target utterance and in-context examples. If the masked logical form that the LM decoded appears in the set of masked examples defined by the prompt, we say that the LM copied that abstract pattern.

Table 7 presents the results on the validation set for each of our three datasets, as well as the accuracy on each subset. We observe that the rate of copying is much higher in MTop and SMCalFlow compared to Break, where in MTop and SMCalFlow abstract copying reaches more than 80%. Moreover, accuracy on examples where copying occurred is much higher compared to accuracy where no copying happened. For example, on MTop, 84.5% of the examples were abstractly copied, and on that subset of examples, EPR achieves 71.6% EM, compared to 64.2% on the entire validation set. Nevertheless, even though accuracy is much lower in cases where no copying occurred, accuracy is not negligible, which shows that some form of generalization to new structures is taking place.

Another follow-up question is whether the model copies patterns from prompts uniformly or does it attend mostly to the ones with high retrieval score. To answer this, we look at the subset of examples where copying occurred. We then identify for each example the highest-ranking prompt that was copied from, and define the distance of that prompt by dividing the rank by the number of prompts that fit in that example. Figure 4 shows the distribution over distances for the Break dataset. We observe that copying happens mostly from highly-ranked prompts.

Related Work

Our understanding of in-context learning has grown substantially recently. Saunshi et al. (2021) suggests that by conditioning on a prompt, the task of predicting the next word approaches linear separability. Xie et al. (2021) suggests that in-context learning occurs when the model infers a shared latent concept between examples in a prompt. Levine et al. (2021) present a pre-training scheme theoretically motivated by the bias of in-context learning, that gives significant improvements. Recently, Min et al. (2022) showed that the model does not rely on the ground truth input-label mapping provided in the demonstrations as much as previously thought.

Retrieval

Research on training dense retrievers has skyrocketed recently, propelled by interest in open-domain question answering Chen et al. (2017); Lee et al. (2019); Karpukhin et al. (2020); Guu et al. (2020); Khattab and Zaharia (2020); Qu et al. (2021). Work on retrieval-based methods has also spread more widely to other knowledge-intensive tasks Lewis et al. (2020), e.g., fact verification Samarinas et al. (2021).

Similar to us, Pasupat et al. (2021) proposed to use retrieval in semantic parsing. However, they focus on controlling the output generated by a model. Retrieval methods have also been successfully used in language modeling Khandelwal et al. (2020); Borgeaud et al. (2021); Alon et al. (2022) and machine translation Khandelwal et al. (2021).

Prompts

Developing methods for interacting with LMs and extracting desired behaviours has attracted considerable attention, under the umbrella term prompting. In this work, prompts are a set of in-context training examples, but substantial effort has also been devoted to casting natural language tasks as language modeling by phrasing the target task in natural language (see survey in Liu et al. (2021b)). Such approaches include prompt engineering through manual patterns Petroni et al. (2019); Schick and Schütze (2021), decoding methods (Min et al., 2021; Zhao et al., 2021; Holtzman et al., 2021), and methods for extracting either hard Shin et al. (2020); Haviv et al. (2021) or soft Li and Liang (2021); Zhong et al. (2021); Qin and Eisner (2021) prompts automatically.

Prompt retrieval for supervised models

In parallel to this work, adding training examples as additional input has been shown to be useful for supervised models as well. Wang et al. (2022) and Xu et al. (2021) used BM25 to retrieve and augment the input with similar examples from the training set. Fine-tuning the model with the additional inputs improved performance on tasks such as summarization and question answering. Such methods can also potentially benefit from a stronger retriever.

Conclusions

Large pre-trained LMs are becoming an inseparable part of the natural language understanding eco-system. However, accessing their weights or updating them can be prohibitive for many researchers. In this work, we propose EPR, a method for learning to retrieve good prompts for in-context learning, by using language models themselves as the scoring function. This allows us to train a light-weight retriever and substantially improve performance on three challenging tasks.

More broadly, given that large LMs models are likely to play a prominent role in developing language understanding models, it is important to develop approaches for interacting with such models effectively. EPR can be viewed as a step in this direction.

Acknowledgement

We thank Ori Ram and Itay Itzhak for helpful suggestions and meaningful discussions. This research was supported in part by The Yandex Initiative for Machine Learning, and The European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800). This work was completed in partial fulfillment for the Ph.D degree of Ohad Rubin.

References

Appendix A Appendix

Since the selection procedure for in-context examples is dynamic, the number of in-context examples differs for each test instance. In Figure 5, we plot the histogram of the number of examples we fit in $C=2,048$ tokens.

Effect of hyperparameters

We test the effect of $k$ , the number of prompts labeled as positive or negative, and $L$ , the number of prompts retrieved by the unsupervised retriever. Table 8 shows that performance is is generally robust w.r.t these hyperparameters.

Training details

To train EPR, we use the Adam optimizer Kingma and Ba (2015) with batch size 120 and learning rate 1e-4 on eight RTX 3090. We run training for 30 epochs. We used the default DPR hyperparameters without tuning. We used the final epoch of the model to perform model selection, and applied minimal learning rate tuning on the validation set of Break.

Risk assessment

Large language models have been shown to exhibit various kinds of bias (Bender et al., 2021), since EPR is trained on the signal obtained from such large LMs, it might also exhibit these biases.

Additional examples

Tables 9, 10, and 11 provide more examples for cases where EPR is correct while CBR is incorrect along with the top-3 prompts for each method.