Reframing Instructional Prompts to GPTk's Language

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, Hannaneh Hajishirzi

Introduction

Prompting language models (LMs) Liu et al. (2021a) has made NLP modules accessible to non-expert users through plain text instructions We focus on instructional prompts Efrat and Levy (2020) as opposed to exemplar prompts which are already well-studied Brown et al. (2020); Lu et al. (2021). of NLP tasks. Such task instructions written by non-expert users are often long and contain abstract descriptions which are not easy to follow for LMs, as evident by their low performance Efrat and Levy (2020); Mishra et al. (2022). However, it is not quite clear whether this is due to the inherent difficulty of the target tasks or an artifact of the complex phrasing of their language instructions.

In this analysis, we aim to understand the sensitivity of LMs to the framing of instructional prompts. In particular, we study several reframing techniques to frame instructional prompts differently so that LMs achieve better understanding of the task. These reframing techniques are motivated by various empirical intuitions such as ease of understanding concise and concrete instructions and those that contain little abstract statements about human commonsense or their background knowledge. For example, Fig.1 shows a reframing example which involves decomposing a task into multiple sub-tasks. The intended task here is writing questions that require entity coreference Dasigi et al. (2019). While GPT3 fails in solving the original task instruction (the yellow box at the top), it succeeds when the task is decomposed to four simpler and easier sub-tasks.

We provide analysis for five diverse reframing techniques. These include incorporating low-level patterns about the target task, decomposing and itemizing instructions, stating the task constraints, and providing specialized instructions (examples in Table 2).

We analyze reframed instructions over 12 tasks from Natural Instructions Mishra et al. (2022), which contains a variety of NLP tasks and their instructions. Empirically, we compare the quality of LMs (GPT2/3 Radford et al. 2019; Brown et al. 2020) in two settings: raw vs reframed instructions. In particular, we observe that the reframed prompts have notable performance gains over raw instructions (the gap between the red and blue trends in Fig.2) with an average of 14% and 17% gains when using GPT3-instruct in the few-shot and zero-shot setups, respectively. Furthermore, the average gains across tasks remain consistent across different models hinting at consistency of reframed prompts on various architectures. This is in contrast to the widely-used fine-tuning approaches which need to be performed separately for each model. Reframing prompts by model designers can be particularly effective when evaluated on large LMs, where fine-tuning can be prohibitively expensive (such as GPT3). In particular, we observe that, reframed prompts on GPT3-instruct score roughly $17\%$ higher than GPT2Large that is supervised with $1k$ instances (i.e., 200 $\times$ more data).

While reframing instructions are not algorithmic, nonetheless, we view this systemic analysis as a preliminary stepping stone in this direction. We hope that this study will lead to the development of algorithmic better few-shot learning methods that generalize across models, thereby leading to more effective ways of reaping the investments already poured into creating massive LMs.

Contributions: (a) This work is inspired by the sensitivity of LMs to the framing of their instructional prompts. Driven by many empirical analysis, we identify several guidelines for model designers to reframe instructional prompts and provide illustrative use cases associated with each type of reframing technique. (b) Extensive experiments on diverse tasks show that reframing gives rise to superior performance and improved sample complexity over raw task instructions, across a range of models sizes. (c) Our experiments quantify the contribution of the prompting techniques and analyze various parameters that contribute to their success.

Related Work

Our work is related to designing discrete prompts and tuning continuous prompts in recent literature.

Discrete Prompts Constructing effective discrete prompts for language models to perform NLP tasks is an active area of research Schick and Schütze (2021); Le Scao and Rush (2021); Tam et al. (2021); Logan IV et al. (2021); Reynolds and McDonell (2021). Most such works focus on light-weight changes to the original prompt Liu et al. (2021a). Unlike the earlier literature, we focus on framings of complex instructions, which often lead to reframed prompts that are often very different from the original raw instructions. While our proposed prompt-reframing is not quite algorithmic, the principles behind them are relatively simple, which can hopefully motivate algorithmic solutions in future.

Our goal is fundamentally different from the meta-training with instructions Mishra et al. (2022); Sanh et al. (2022); Wei et al. (2022). Such approaches depend on labeled data (language prompts for thousands of tasks) which can be costly to collect. Additionally, they require fine-tuning models which can be costly for larger LMs. Exploring effective framings of language instructions can provide alternative ways of utilizing LMs.

Continuous Prompts Tuning continuous prompts leads to the making of space-efficient models compared to fine-tuning model parameters Liu et al. (2021b); Lester et al. (2021). Despite being algorithmic, these models require propagating gradient information across the whole architecture, leading to high computational costs, which is a key bottleneck when it comes to large LMs such as GPT3. While our proposal requires human intervention, it provides model designers with several relatively easy rules-of-thumb to come up with language prompts that work effectively with large LMs.

This section describes our reframing principles and then describes the guidelines to operationalize them. Reframing principles are obtained by probing instructions of various tasks in the training split of Natural Instructions Mishra et al. (2022) to understand different failure modes associated with prompting in GPT3.

We observe that GPT3 fails to follow instructions when it is provided with long prompts that often contain repeated information, abstract notions, analogies, complex statements requiring human commonsense and their domain knowledge (see examples in Table 2 and 4). Humans typically find these helpful for describing their tasks. For example, some content intended to motivate the task or repetition for the sake of emphasis, might be unnecessary or even redundant for a model.

1 Reframing Principles

We observe that short prompts that contain concrete statements and avoid terms associated with background knowledge improve GPT3’s response to instructions. We recursively apply this observation and provide a set of reframing principles to resolve various issues on GPT3’s failures with prompting, backed by extensive empirical analysis on GPT3. The principles have light resemblance to how basic tasks are formulated and taught to kids.

Use Low-level Patterns: Instead of using terms that require background knowledge to understand, use various patterns about the expected output.

Itemizing Instructions: Turn descriptive attributes into bulleted lists. If there are any negation statements, turn them into assertion statements.

Break it Down: Break down a task into multiple simpler tasks, wherever possible.

Enforce Constraint: Add explicit textual statements of output constraints.

Specialize the Instruction: Customize the instructions so that they directly speak to the intended output.

We operationalize each of the above principles in terms of 5 reframing techniques. The degree of reframing (the amount of change applied to the raw instructions) varies significantly across the reframing techniques: the simplest one adds an enforcement statement at the end whereas the other extreme involves completely changing the task as a whole (e.g., decomposing it into multiple tasks).

2 Reframing Techniques

We explain each of the reframing techniques in three parts (1) model failure states a potential weakness of LM with reference to examples in Table 4 (2) approach describes our suggested approach and intuition behind it, according to our empirical observations (3) example illustrates the application of the suggested technique in reference to Table 2. In designing these techniques, we used a development set that contains all the positive examples included as part of the instructions of each task in Natural Instructions.

Model failure While humans have an incredible ability in understanding and acting with respect to abstract descriptions, LMs tend to ignore most of them or just repeat the content of such instructions in their output (copy instruction in Table 4.) Approach Find low-level patterns among the dev set examples and extrapolate those by adding similar patterns ( $C_{1}$ ). Example Table 2 (row 1) illustrates the CosmosQA Huang et al. (2019) question generation task. The raw task instruction consists of various high-level statements such as “commonsense”, “complex”, “interesting”, “easy for humans and hard for AI machines”, whereas the reframed task consists of various low-level patterns about the expected output such as “what may happen”, “in the future, will..”, “why might”, which generally improve GPT3’s performance in generating valid questions.

2.2 Itemizing Reframing

Model failure LMs cannot follow long paragraphs stating multiple requirements (first instruction bias in Table 4) and do not perform well when the requirements are formulated as a negative statement (negation challenge in Table 4). Approach Turn long descriptions into bulleted lists of several statements ( $C_{2}$ ). Additionally, turn negative statements to positive ones. For example, reformulate “don’t create questions which are not answerable from the paragraph” into “create questions which are answerable from the paragraph”. Example Table 2 (row 2) illustrates the WinoGrande Sakaguchi et al. (2020) sample generation task where the raw instructions contain several requisites (do’s and don’ts) that are hard for models to follow. Reframing the instructions into a structured list improves the model response.

2.3 Decomposition Reframing

Model failure Tasks with implicit multi-step reasoning are challenging for models, even after itemizing reframing (3.2.2) (multi-step task challenge in Table 4). Approach Wherever possible, decompose a task into multiple different sub-tasks which can be executed either sequentially or in parallel ( $C_{3}$ ) and hence, make them relatively easier for models. Example In Table 2 (row 3), the task is to generate samples for the Winogrande Sakaguchi et al. (2020) dataset. Decomposition of the task into 5 sequential steps improves GPT3’s response.

2.4 Restraining Reframing

Model failure A common mistake of GPT3 occurs when the task definition deviates from its pre-trained objective (predicting next words) (conventional-task bias in Table 4). For example, when predicting question types GPT3 often answers the question instead of generating its type. Similarly, in reading comprehension tasks, GPT3 sometimes answers a question based on its background knowledge instead of answering from the given passage. Approach Append a statement to the task instruction that expresses a constraint about the output generation ( $C_{4}$ ). Example Table 2 (row 4) illustrates the DROP Dua et al. (2019) answer type generation task where the objective is to generate a valid answer type among “Number”, “Date” and “Span” for a given question. Adding an enforcement statement tends to improve the model output by constraining it to the provided types.

2.5 Specialization Reframing

Model failure LMs ignore generic instructions such as “answer the following question” and sometimes misconceive the output format when the given instruction contains redundant text (misconceive output format in Table 4). Approach Reformulate the instructions so that they directly describe the low-level task needed to be done and drop all the repeated and generic statements ( $C_{5}$ ). Example Table 2 (row 5) illustrates a task of numerical reasoning problems that involve natural language sentences describing additions and subtractions. The reframed prompt specializes the generic task instruction (“calculate answer”).

We evaluate the proposed reframing techniques on the evaluation tasks from Natural Instructions Mishra et al. (2022), which consists of 12 tasks categorized into 6 categories. Following the original setup, we use ROUGE-L Lin (2004) as the evaluation metric in our experiments. Table 2 contains the list of evaluation task used in this study.

Empirical Results

For evaluation we use various models of the GPT family: GPT2, GPT2Large, GPT2XL, GPT3 and GPT3-instruct Brown et al. (2020); Radford et al. (2019) https://beta.openai.com/docs/engines/ and BART-base Lewis et al. (2020). We evaluate the models according to the following setups:

GPT $k$ w/ raw instructions: We follow the setup of Mishra et al. (2022) who experiment with GPT3-instruct on their raw instructions. Overall the prompts provided to the model consist of three segments (in this order): (a) task instructions, (b) examples (input and outputs) and (c) a new input for which we expect model’s response. We experiment with three different variants of the baselines, depending on the number of examples in their prompts: (i) Few-Shot: We experiment with 5 examplesThese 5 positive examples are part of instructions in each task of Natural Instructions, and sometimes the number of positive examples is less than 5. which is a more realistic few-shot setup. (ii) Max. ex.: in another variant we use as many examples as fits within GPT’s token limit. (iii) Zero-Shot: in this setup, we do not incorporate any example while prompting the models with the instructions. Finally, we build variants of these baselines by conducting ‘schema selection’ where we experiment with 12 different encodings of the instruction Mishra et al. (2022) and select the best performing one for each task. GPT $k$ w/ reframed instructions: The model designer applies various reframing techniques (Section 3.2) on tasks in Natural Instructions. Similar to the raw instructions baseline, we use 5 examples in our reframed tasks. In our setup, model designer is an author who follows the guidelines (§3.2) by observing 5 examples in the development set and reframes instructions. This process was done in interaction with GPT3-instruct via the development examples. This took roughly $15$ minutes per task and per reframing type. Similar to the setup with raw instructions, the ultimate encoded prompts contained a concatenation of the following (in this order): reframed instructions, positive examples and the instance input. GPT $k$ w/ calibration: This method extends the recent calibration approach introduced by Zhao et al. (2021), which involves compensating for various model-specific biases in a few-shot setup, such as recency bias and majority bias. Zhao et al. (2021) perform calibration by masking input instances with ‘N/A’ tokens, estimating the bias using model prediction probabilities and then compensating the bias while feeding the input instance during prediction. We extend calibration to our instruction setup by masking the input instance in our instruction encoding with an ‘N/A’ token and calibrating biases associated with GPT3-instruct. Supervised baseline: While the conventional setup of supervised learning has been successful for reasonably sized models, it is prohibitively expensive for large models like GPT3. We train medium-sized LMs (e.g., BART-base Lewis et al., 2020) on $5k$ examples of each task and evaluate on unseen instances of the corresponding task.

A summary of our experiments Scripts to reproduce our results are public. is provided in Fig.2 which shows the performance of the reframed instructions on various models, compared to our baselines. Furthermore, Table 3 provides a more granular comparison of few-shot, zero-shot and supervised models per task category, all on GPT3-instruct and in terms of ROUGE-L. Below are several takeaways from these experiments.

Table 3 shows that reframing outperforms the original raw instruction baseline with 14% (44% $\rightarrow$ 58%) and 17% absolute gains (33% $\rightarrow$ 50%) in few-shot and zero-shot setups, respectively. Additionally, it outperforms the schema selection baseline with 11% (47% $\rightarrow$ 58%) and 13% absolute gains (37% $\rightarrow$ 50%) in few-shot and zero-shot setups, respectively. It also outperforms the calibration and max-examples with schema selection baseline by 12% (46% $\rightarrow$ 58%) and 8% (50% $\rightarrow$ 58%), respectively. The gains are spread across task categories, with the highest gains in Answer Generation (AG), Classification (CF), and Verification (VF) categories.

As Fig.2 shows, the reframed instructions consistently outperform raw task instructions across various models. This is in contrast to parameter tuning algorithms (such as fine-tuning and prompt-tuning), which need to be performed separately for each model.

The average performance associated with supervised baselines is higher than the reframing method. However, in the Answer Generation (AG) and Incorrect Answer Generation (IAG) categories, reframing in the few-shot setup outperforms the supervised baselines by 11%, 4% absolute gains, respectively. A similar observation can be made in Fig.2, where reframed prompts with GPT3-instruct have notably higher performance than the supervised mid-size model (GPT2Large), which uses $200\times$ more data.

2 Analyses

Fig.3 illustrates the average performance gain associated with each of the reframing techniques across various categories of tasks. We apply various reframing techniques on each task of Natural Instructions. We observe that Specialization Reframing, Restraining Reframing and Pattern Reframing improve model performance for a wider range of tasks. We also observe that, Restraining Reframing contributes the most to Classification tasks whereas Specialization Reframing is dominant on Answer Generation tasks. Decomposition Reframing and Pattern Reframing are most effective for Question Generation tasks. Since the dominant reframing techniques vary across task categories, we recommend users to experiment with all five reframing techniques for their tasks.

We observe that reframed instructions are usually shorter than the original instructions. A natural question that might arise is whether there is a correlation between the length reduction and the performance improvement, as a result of applying reframing. Fig.4 shows that performance gain is not always proportional to the length difference across various evaluation tasks (dots in the figure) in Natural Instructions. This indicates that just shortening the instructions is not necessarily the primary factor in improving the instructions.

Concluding Remarks

We analyze failure of GPT3 on raw vs. reframed instructions. We samples 100 examples across various tasks for the analysis. Fig.5 illustrates the distribution of errors. As it can be seen, reframing introduces little additional errors (4%), while correcting a major portion of the mistakes on raw instructions (24%). We further manually analyze this subset (mistakes of raw instruction corrected by reframing) to better understand the dominant errors patterns and the reframing that corrects them (Table 4). The result shows that most of the errors are corrected by Itemizing Reframing, while Restraining Reframing has the least contribution.

Inspired by GPT3’s poor performance in following task instructions, we study reframing them. We introduce five approaches that reformulate task instructions to make them easier, while maintaining their human readability. Manually applying reframing on 12 tasks, we study their benefits compared to using raw instructions or fine-tuning mid-sized models. Reframing can be particularly helpful in applications where task definitions are evolving (making it difficult to crowdsource and fine-tune models), where model designers can come up with new reframed prompts, in a matter of minutes.

We hope that this study will inspire further investigation of potentially-unconventional approaches to exploit the knowledge harnessed by increasingly large LMs where fine-tuning and its alternatives are prohibitively expensive.

We thank OpenAI for providing academic access to the GPT3 API, the Beaker team for their support with experiments and the anonymous reviewers for their helpful feedback. The support of DARPA SAIL-ON, DARPA CHESS program, NSF IIS-2044660, ONR N00014-18-1-2826, and Paul G. Allen Foundation is gratefully acknowledged.

Appendix A Supplemental Material

Table 5 contains examples of error patterns where model performance improves with reframing over raw instructions. Table 5 exemplifies each type of error mentioned in Table 4.

In our qualitative analysis (Section 5.2 and Figure 5), we find that 4% of the errors are caused by refaming of raw instructions and 31% of the errors are the failures of raw instructions that are retained by reframing. Table 6 shows the dominant patterns among such errors.

A.2 GPT3-instruct Outputs to Raw and Reframed Instructions

We explain each of the reframing techniques by illustrating how they solve various error patterns produced by raw instructions.

Table 7 shows how raw instruction in its detailed form can not help GPT3 produce the valid questions for the CosmosQA question generation task. Table A.2.1 illustrates how reducing the raw instruction content (retaining only the Definition) still does not help model to perform the task and how reframing helps the model to perform the task. Table 9 and A.2.1 shows similar behavior for the MCTACO question generation task.