Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations

Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, Taeuk Kim

Introduction

Large-scale language models Rae et al. (2021); Chowdhery et al. (2022); Smith et al. (2022); Thoppilan et al. (2022) have shaped the NLP scene by introducing in-context learning (ICL) Brown et al. (2020) as a novel approach to adapt language models for downstream tasks without explicit fine-tuning. ICL enables language models to learn and predict from task-specific prompts that contain demonstrations in the natural language format, despite the language models were only trained to predict the next word token. Inspired by the new discovery, a flurry of recent work has investigated ways to explain and exploit the ICL mechanism (Schick and Schütze (2021a); Lu et al. (2022); inter alia), but it remains elusive.

Min et al. (2022b) have recently re-evaluated the role of input-label correspondence in demonstrations for ICL. Specifically, the authors have shown that the correct mapping between input and its label contributes less to the final performance than we thought compared to other aspects, including the format of demonstrations and the awareness of the input and label space. This finding is intriguing and has been sensational, as it is counter-intuitive to the expectation of how statistical learning typically works in supervised settings, and therefore it shows a potential of exploiting (few-shot) in-context learning given no real training data. For example, prior work established the strong impact of example ordering (Zhao et al., 2021), hence in-context learning being less sensitive to the correctness of label demonstrations, which forms the basis of supervised learning, seems contradictory.

However, we encountered cases where the observation is inconsistent with the recent finding on the matter (Figure 1). Specifically, we found that the difference between the performance from the ground-truth label demonstration and that from entirely incorrect labels was as large as 80% (accuracy) for the hate speech dataset de Gibert et al. (2018) on GPT-J Wang and Komatsuzaki (2021). Similar observations were found with the larger GPT-3 Brown et al. (2020) model and other datasets (TREC Li and Roth (2002)). These cases illustrate how sensitive in-context learning can be to label demonstrations depending on the ICL settings. Thus, we cast a doubt on whether the trend can be generalized in diverse configurations, raising a call for an in-depth analysis of the phenomenon.

In this paper, we revisit the findings of Min et al. (2022b) and take a closer look into the importance of ground-truth labels for in-context learning. First, we point out limitations of the existing work. Then, we introduce novel metrics, namely Label-Correctness Sensitivity and Ground-Truth Label Effect Ratio (GLER), to reveal that the input-label correspondence plays a more vital role in contextual demonstration than previously considered. Furthermore, we show that the trend contradictory to the previous discovery becomes salient if we diverge the experimental settings (e.g., datasets, metrics, and templates) from the previous work. We observe the same trend in various language models, such as GPT-J and GPT-3 Brown et al. (2020).

In addition, this paper uses statistics to provide a systematic and complementary perspective to the existing findings on the label-demonstration impact. To be specific, we combine linear regression and auxiliary metrics to conduct all-around and deeper analyses on how the ICL classification performance changes against label-demonstration corruption. To do so, we define the notion of sensitivity to quantify the degree to which the downstream classification performance changes when a model is subject to a fixed amount of label corruption. As a result, we demonstrate several noticeable patterns that support the claim that there is a considerable relationship between the performance and label correctness. It is worth noting that this trend was not clearly visible in the previous work, where the results of each dataset are macro-averaged rather than individually analyzed.

However, insensitivity, or robustness, towards the incorrectness of label-demonstrations is a useful property to have for many situations. For example, when augmenting an extremely small number of (e.g., less than four) examples using data augmentation techniques, exhibiting performance resilience towards prompt templates that consist of noisy synthetic examples as demonstrations is desirable. We further analyze how different factors of ICL, such as the inference method, the underlying language model, and the adoption of advanced ICL strategies, affect the performance sensitivity towards noises in input-label demonstrations, paving the way for a new approach to exploiting the demonstration insensitivity.

In summary, our contributions are as follows.

We re-examine the recent findings on the phenomenon that the ICL performance is insensitive towards input-label demonstrations.

We propose two new quantifiable metrics, sensitivity and GLER, to measure the impact of ground-truth label demonstrations on ICL.

We conduct a thorough examination of how different components of ICL could impact the model’s insensitivity towards label noises, allowing future work to exploit such property.

Looking Deeper into Ground-Truth Labels

Demonstrations of ground-truth labelsHere, label demonstrations refer to the demonstration of input-label correspondence and not the demonstration of label space., correctly paired with inputs, have been known to be a crucial factor of supervised learning, but a recent work by Min et al. (2022b) purportedly revealed the possibly counter-intuitive nature of label demonstrations in in-context learning (ICL). Specifically, the findings implied that the correctness of input-label correspondence in in-context demonstrations is not as important as we have thought. We name this phenomenon input-label insensitivity. Although the finding was supported by reasonably large-scale experiments, covering various experimental variables such as datasets, language models, in-context learning types, etc., we found that, through deeper analysis of the experiments, input-label insensitivity is not consistent across all experimental settings.

This section highlights the limitations of the existing work, proposes new metrics to quantify the impact of input-label correspondence, and finally presents deeper analyses of the ICL experiments utilizing the newly proposed metrics.

Min et al. (2022b) showed that replacing ground-truth labels in prompt demonstrations with incorrect labels marginally affects the mean-aggregated overall performance on selected datasets. Although the input-label insensitivity phenomenon was less prominent on GPT-J with the direct ICL method, the ICL still performed better when entirely incorrect labels were given than the absence of demonstrations (the zero-shot baseline), allegedly supporting the input-label sensitivity idea (Min et al., 2022b). However, we argue that there are mainly two limitations to the existing claim.

The existing claim suffers from over-generalization in two regards: (1) the mean-aggregated results fails to capture the insensitivity behavior in individual tasks and (2) the proposed experimental settings in the existing work is not general enough to be fully supportive of the claim. Mean-aggregation does not paint the full picture without the information on the variance. Furthermore, individual analyses on large-scale tasks are needed to obtain precise insight into input-label sensitivity. Our deeper analyses on the ICL experiments (§2.4) provide more evidence of this claim.

The second over-generalization is supported by the existence of a counter-example: higher input-label sensitivity observed from a slight varied but equally valid experimental settings (Figure 2). The subfigure on the left corresponds to the result of an existing set of experimental settings, where the Noisy Channel method Min et al. (2022a) was used for ICL, the macro-F1 score for the evaluation metric, and the five classification datasets listed in the existing work. The subfigure on the right has been obtained using (Direct) method, the accuracy score as metric and results were aggregated from all 17 datasets listed in the existing work (see AppendixA).

Existing work relies on human judgement to determine the input-label sensitivity, which could be subjective. Furthermore, we are not only interested in whether the input-label insensitivity phenomenon exists but also how insensitive the ICL is towards the demonstrations, enabling us to exploit the phenomenon. Hence, a set of systematic quantification methods is needed to perform the deeper analyses.

2 Key Concepts

This subsection establishes key concepts and notations related to our analysis on the impact of input-label demonstrations and the downstream ICL performance. xx and cc denote the input and the label respectively. They exist in each respective input (X\mathcal{X}) or label space (C\mathcal{C}) associated with the dataset or task. A language model PP predicts the next token given the preceding tokens: P(xtx<t)P(x_{t}|x_{<t}). In ICL, a prompt P\mathcal{P} is designed to elicit particular behaviors from the language model. For example, to utilize the language model as a text classifier, a prompt template T\mathcal{T} takes a set of examples Dex={(x1,c1),...,(xk,ck)}\mathcal{D}_{\text{ex}}=\{(x_{1},c_{1}),...,(x_{k},c_{k})\} and a test input xx to produce the prompt P\mathcal{P}. The prompt is then fed into the language model to produce the most plausible continuation: argmaxxP(xP)\text{argmax}_{x^{\prime}}P(x^{\prime}|\mathcal{P}). A task-specific verbalizer V\mathcal{V} is designed to interpret the generated output xx^{\prime} into the label space C\mathcal{C}. We measure the performance yy of the language model PP and the prompt template T\mathcal{T} on a test set Dtest\mathcal{D}_{\text{test}}.

Our analyses mainly involve manipulating T\mathcal{T} and the example set Dex\mathcal{D}_{\text{ex}} to set-up baselines and conduct ablation studies. Key experimental set-ups include: No Demo, or denoted as “zero-shot”, represents zero-shot predictions, where the prompt template T\mathcal{T} ignores Dex\mathcal{D}_{\text{ex}} and only uses the test input xx: P(cx)P(c|x). The example set Dex\mathcal{D}_{\text{ex}} in α\alpha%-Correct consists k×a/100k\times a/100 correct input-label pairs and k×(1a/100)k\times(1-a/100) incorrect pairs where (0a100)(0\leq a\leq 100). For Random Label, the labels cc in Dex\mathcal{D}_{\text{ex}} are replaced by uniform samples from the label space C\mathcal{C}, and it is one of the key baselines of our studies. Additional details on the set-up variations are presented in Appendix A.

3 Metrics for Measuring the Impact of Input-Label Demonstrations

This section proposes two new metrics to quantify the impact of input-label demonstrations in ICL.

We define label-correctness sensitivity, or sensitivity for short, as the degree of which the downstream classification performance changes when the model is subject to a fixed amount of label corruption. Sensitivity in the context of in-context learning demonstrations can be computed by conducting the single-scalar linear regression analysis on a performance metric (e.g., accuracy or F1-score) yy against the percentage of examples that are labelled correctly (ss):

where β0\beta_{0} is the bias and β1\beta_{1} is the coefficient of label correctness. The scalar value of the weight parameter β1\beta_{1} is interpreted as the sensitivity measure. The data points for linear regression were obtained by following the experimental protocol proposed by Min et al. (2022b). The sensitivity measure can be interpreted as a linearly interpolated measure of performance degradation for each unit decrease in label correctness.

Another way to understand the impact of labels, namely correct or ground-truth labels, is to quantify how much the ground-truth labels improve the ICL performance compared to the random-label baseline. The higher the gap, the bigger the impact the ground-truth labels have on the performance. The gap is then normalized by the performance difference between ground-truth labels and the absence-of-demonstration baseline (zero-shot):

where yGTy_{\text{GT}} is the ground-truth label performance, yRLy_{\text{RL}} the random-label baseline (Random-Label), and yy_{\emptyset} the zero-shot performance. The denominator in Equation 1 is intended to allow the GLER metric to be compared across different tasks. Additionally, we clip GLER to be bounded between 0 and 1.

4 Deeper Analyses

This subsection performs deeper analyses using the aforementioned metrics to reveal additional insights into input-label insensitivity.

All of our experiments mentioned in the rest of the paper generally follows the experimental settings in Min et al. (2022b), where α\alpha%-Correct is mainly utilized to conduct sensitivity analysis. However, there are key differences: (1) we do not employ label-length normalization (in our experiments length normalization does not always increase the performance), and there are minor template T\mathcal{T} design differences, including how the separator token interacts with the model and the dataset-specific implementation of data preprocessor; (2) we use accuracy, instead of F1-score, as the primary evaluation metrics for ICL performance. However, we do report the full results in Appendix A, along with the full details of the setup.

4.2 Label Correctness Does Affect Performance

To analyze the overall sensitivity of performance under the variation of label correctness, we aggregate sensitivities across all 17 classification datasets and the results are shown in Table 1. The results show that the aggregated sensitivity is significantly high with good fit (in the range of 0.81-0.86) for all configurations. When tested on our specific setup, the sensitivity was as high as 0.309, implying that, on average, there was a 0.309% drop in accuracy for each percentage drop in label correctness.

The trend of sensitivity, which is more apparent in our quantitative analysis, may have been overlooked due to the relative dwarfing effect from zero-shot (or “no demo") results in prior studies. The results also show that the sensitivity is lower in the Channel method,We hypothesize that this observation is attributed to the fact that, while generating longer sentences, prediction distribution from Channel model are more affected by the pre-trained prior rather than the current context. suggesting that sensitivity can be significantly lowered with the employment of more advanced ICL methods.

4.3 Label Demonstration Impact is Highly Varied Across Tasks and Settings

Although the aggregating analysis shows a general sensitive trend towards demonstration correctness, individual analyses shed deeper insight into the distribution of task sensitivities. Individual sensitivity plots are illustrated in Figure 4. Sensitivity can vary from small negative values (indicating increasing performance under increasing label corruption) to value as high as 0.815 (for the hate speech dataset), suggesting that summarizing the trend for all tasks and datasets may be difficult and that certain datasets may possess distributional properties that allow models to more easily exploit label demonstrations. This high-variance observation is valid for other metrics (GLER and the ground-truth label performance) as well. Further analyses are available in §3

4.4 Sensitivity and Task Difficulty

Tasks where the model struggle to exploit in-context demonstrations may exhibit low sensitivity towards them, since understanding patterns in demonstrations is inherently linked with the ability to absorb demonstrative label-supervision. To confirm our theory, we conduct an analysis on the sensitivities of 17 datasets against the task difficulty. We define task difficulty as the relative performance of ground-truth label demonstrations compared to a baseline. Specifically, relative performance yrely_{\text{rel}} is computed by yrel=yGTybaseliney_{\text{rel}}=y_{\text{GT}}-y_{\text{baseline}}. We consider the random baseline.

Our analysis (Figure 16) shows that the model’s performance sensitivity is strongly related to the difficulty of the task. The tasks, where the model exhibits low sensitivity (i.e. <0.1<0.1), struggle to achieve meaningful classification performance. This suggests that designing experiments with datasets that can be meaningfully solved using in-context learners may be more important than previously understood. Hence, the sensitivity measure by itself is insufficient for benchmarking the impact of input-label demonstrations.

When Do the Ground-Truth Labels Actually (Not) Matter?

As revealed in our deeper analyses (§2.4), many factors including datasets and the choice of the ICL method can significantly affect the label-sensitivity. Gaining more understanding of the mechanism by which the input-label correspondence impacts the downstream ICL performance could enable us to systematically exploit the label-insensitivity phenomenon. For example, few-shot ICL models can be improved to tolerate label noises from synthetic data samples generated in the joint input and label spaces Yoo et al. (2021).

To understand the conditions that reduce the label sensitivity, we conduct a series of experiments that investigate different factors contribute to the phenomenon quantified using the metrics proposed in §2.3. Namely, we consider the particular technical choice in carrying out ICL (whether to employ the noisy channel method (Min et al., 2022a) and the likelihood calibration (Zhao et al., 2021)), various properties of the prompt template (the number of in-context examples and the verbosity), and the model size.

Recall that the sensitivity measure is the nominal coefficient of the linear line fitted on the performance-versus-label-corruption data points. Since baselines can vary depending on the experimental setting, hyperparameters and the datasetFor example, under the same conditions (GPT-J and Direct inference), the random-label accuracy baseline is 28.08 for TREC and 53.58 for SST2., comparing the nominal sensitivity alone can be inconclusive, as the same degree of absolute improvement has different implications depending on the baseline level. To account for the variations in the characteristics of the task and the model, we consider GLER and the ground-truth label performance as the auxiliary measures in the following studies.

1 Techniques for In-context Learning

In-context learning, as first proposed by Brown et al. (2020), is a straightforward parameter-free approach, where the downstream task of interest is expressed as natural text demonstrations and used to conditionally generate from a language model. Recently, Min et al. (2022a) proposed Noisy Channel (denoted as Channel) that exploits the language generation capability of language models for discriminative tasks using the Bayes’ Theorem. We compare the two ICL methods on all three (sensitivity, GLER, and the ground-truth label ICL accuracy) measures.

Results (Figure 7) show that Channel reduces the label-sensitivity on average compared to the original Direct method while maintaining the Accuracy on similar levels. The label insensitivity effect is observed in both GPT-NeoX and GPT-J.

Another recent advance in ICL, namely Calibrate Before Use (CBU), involves calibrating the output likelihood of the word tokens that correspond to the labels Zhao et al. (2021). We conduct the same set of experiments with CBU applied and report all three metrics. As shown in Figure 9, the calibration technique reduces the label sensitivity while generally improving the ICL performance on both GPT-J and GPT-NeoX. Applying CBU can be an effective way to reduce label sensitivity while not sacrificing the performance.

2 Prompt Templates

Various design choices in in-context prompt templates have significant impact on the downstream ICL performance Reynolds and McDonell (2021). A well-designed and verbose prompt template (e.g., a prompt with detailed description of the task) could allow in-context label demonstrations to have relatively less impact on ICL, thereby reducing the label-demonstration sensitivity.

This section mainly explores (1) the number of in-context examples and (2) the level of task description details. To quantify the impact of the number of in-context examples, we conduct the same set of experiments with varying number of in-context examples, ranging from 1 to 16. Results (Figure 11(a)) unsurprisingly show that the number of prompt examples is positively linked to all three metrics. Although sensitivity rises with the number of examples, this is due to the final ICL performance and the impact of ground-truth labels improving with more demonstration examples.

We also hypothesize that the level of task details contained in the prompt template also serves to relatively weaken the label demonstration impact. Results in Figure 11(b) confirm our hypothesis.

3 Model Sizes

The scale of the language model could influence how susceptible the model is to label noises within input-label demonstrations. The larger the model is, the more prior knowledge the model could leverage to reduce label sensitivity. To study whether this is the case, we analyze five different sizes of GPT-style language models, ranging from GPT-2 XL to GPT-3Note that the general trend along the model scale persists with mixed language model architectures, as reported by Srivastava et al. (2022). The choice of models and the corresponding number of parameters are listed in Figure 12. Results show that sensitivity is generally correlated with the model size, but we also observe a plateauing phenomenon after the GPT-J 6B scale. However, the results on the ICL performance with ground-truth label demonstrations shows that the performance scales well beyond the 6B mark,

Discussion

This section provides additional evidence that the demonstration of ground-truth labels can be more important than the previous finding suggests and that existing interpretation of the experimental results may have been obfuscated by the entanglement of various aspects of demonstrations.

Input-label correspondence is just one of the aspects of possible in-context label demonstrations, the others including label-space demonstration. However, it is unclear whether label-space and input-label correspondence can complement each other in the absence of explicit demonstration of the other. For example, pretrained language models may be able to deduce sentence-sentiment mappings from the mentions of sentiment labels alone through inductive bias.

Prior work (Min et al., 2022b) showed significant performance degradation in the absence of both aspects of label demonstration, but the results beg the question: could the significant degradation have been caused by complete lack of label demonstration? To find out, we conduct additional ablation studies to study the performance under the demonstration of input-label pairings but not of the explicit label space which we call prior-free label experiments.

Specifically, we study the case where class labels are replaced with prior-free labels while maintaining the correspondence between the input and the labels. For example, “positive” and “negative” labels in sentiment analysis can be replaced with “0” and “1” labels respectively, which do not reveal the information about the labels themselves. However, language models can still capture mild label-associations in abstract symbols through inductive bias (Ouyang et al., 2022). To diversity “prior-free” choices, we consider (1) random tokens from the language model’s word space, (2) alphabets, and (3) numerical labelsWe exclude “0” since it is often associated with the state of nil.

As shown in 13, results on prior-free labels outperform that of the random labels (with random input-label mappings), indicating that language models are capable of capturing the input-label correspondence even in the absence of label-space demonstrations. Among the prior-free results, we note that the alphabetical and numerical labels outperform random-token labels. This could be explained by the fact that, since random word tokens may introduce unintended biases through misleading association with unrelated word semantics, abstract labels provide better prior-free environment.

2 Change in label distribution may result the higher sensitivity.

The distribution of labels in demonstration is one of the critical factor for the prediction Zhao et al. (2021). When data imbalance exists, corrupting the labels cause distributional shift which may lead performance change regardless of the input-label mappings. High sensitivity in imbalanced dataset may be due to this unintentional distributional shift. To analyze the impact of distributional shift, we conducted additional experiments using label balanced demonstrations for imbalanced dataset (hate_speech18, ethos-race, ethos-national_origin, ethos-religion).

As shown in 15, using balanced demonstrations degrade the performance and sensitivity when compared to demonstrations sampled from data distributions which supports our suspicion. On the other hand, average sensitivity are 0.189 and 0.308 (for GPT-NeoX and GPT-J respectively) even in balanced demonstrations setting which supports the importance of input-label demonstrations.

Related Work

As the scale of language models becomes larger Rae et al. (2021); Chowdhery et al. (2022); Smith et al. (2022); Thoppilan et al. (2022), fine-tuning becomes prohibitively expensive due to the space and time complexities. As an alternative, in-context learning (ICL) Brown et al. (2020) has shown to be an effective parameter-free learning strategy by prompting language models with task-specific prompt templates. Since then, a plethora of works has investigated both the properties of the learning mechanism Schick and Schütze (2021b); Reynolds and McDonell (2021); Kim et al. (2021); Zhao et al. (2021); Lu et al. (2022); Min et al. (2022b). Although numerous efficient fine-tuning strategies have been proposed in the past Li and Liang (2021); Hu et al. (2022); Lester et al. (2021), the absence of an explicit training step in ICL has enabled it to retain its own class of adapting large-scale language models.

Conclusion and Future Work

In this work, we took a closer look at how input-label relationships affect the in-context learning performance. To quantitatively analyze the impact of input-label mappings in in-context learning, we proposed novel metrics, GLER and input-label sensitivity. Through extensive experiments, we found that the integrity of the input-label mapping is a crucial factor in performing ICL. We also conducted ablation studies to reveal various conditions that allow ICL to improve insensitivity towards label corruptions (while still maintaining a healthy performance). For future work, based on the current findings, we will investigate whether we could exploit data augmentation for extremely low-resource situations for ICL.

Limitations

As it is widely known that performance of the PLMs is highly sensitive to the choice of the prompts Brown et al. (2020); Lu et al. (2022); Zhao et al. (2021). Prompt engineering to find the optimal prompt was not feasible considering the amount of datasets and settings that we experimented. The findings from this work may differ depending on the choice of prompts. However, to minimize this limitations the templates and prompts are adopted from well studied previous works as much as possible.

According the full analysis from Min et al. (2022b), other components of demonstrations not covered in this paper (e.g., input-space demonstrations) exhibit even stronger impacts on ICL. Although our experiments were designed to analyze solely the impact of input-label correspondence, disentangling diverse aspects of demonstrations is highly difficult as mentioned in section 4. Other factors such as label distribution may have unexpectedly influenced the results.

We use Huggingface implementation of GPT-NeoX. To our knowledge, current version of GPT-NeoX in Huggingface under performs when compared to the original implementations from Black et al. (2022).

Acknowledgement

This work was mainly supported by SNU-NAVER Hyperscale AI Center and partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [No.2020-0-01373, Artificial Intelligence Graduate School Program (Hanyang University), No.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)]. Last but not least, we would like to express gratitude to Yejin Choi for the insightful discussions and feedback.

References

Appendix A Details on Our Experimental Settings

We mainly experiment with GPT-Neox 20B Black et al. (2022) and GPT-J 6B Wang and Komatsuzaki (2021) which are publicly released, decoder-only, dense LMs. However, in Section 3.3 we also include GPT2-XL 1.5B Radford et al. (2019), GPT-Neo 2.7B Black et al. (2021), GPT-3 175B Brown et al. (2020).

A.2 Full Dataset

We evaluate on 17 text classification datasets covering diverse tasks including sentiment analysis, paraphrase detection, natural language inference, hate speech detection and diverse domains including science, social media, finance, and more. All datasets are from Huggigface datasets Lhoest et al. (2021). Full list and details about the datasets are provided in Table 2.

As mentioned in Section 2.4.4, sensitivity highly depends on relative performance. In order to effectively capture correlation between sensitivity and diverse factors in Section 3, we evaluate on subset of 8 datasets, datasets with high relative performance, in Section 3. 8 datasets include glue-sst2, glue-rte, super_glue-cb, trec, financial_phrasebank, medical_questions_pairs, sick, and tweet_eval-hate. Due to limited resources, we only run experiments on 6 datasets in Section 3.3.

A.3 Metric

We use accuracy as our primary metric. Accuracy is commonly used metric in multi-class classification which intuitively show how well the model performs. F1 score takes into account how the data is distributed thus it is useful when you have data with imbalance classes. However, F1 is less intuitive since it measures the trade-off between precision and recall. Moreover, F1 score can vary regarding the averaging method in multi-class classification.

A.4 Template

We use 3 types of templates regarding engineering cost and verbosity of templates. First, as a baseline template we used minimal template following Ye et al. (2021); Min et al. (2022b). We use minimal template throughout the paper. For ablation 3.2, we also evaluate manual templates and Verbose template. Templates are adopted from prior works Brown et al. (2020); Zhao et al. (2021); Min et al. (2022b); Bach et al. (2022) if possible. Details and examples regarding the templates are in Table 3. Additionally, for Section 3.1 CBU experiment we use Manual template as the baseline since in our preliminary experiments, applying CBU in Minimal template degrade the performance in some cases.

Even though we use the same minimal template as Min et al. (2022b), there are minor difference in dataset-specific implementation of data preprocessor. (e.g., input sentences of glue_mrpc dataset used in Min et al. (2022b) have prefix "sentence1: ") Therefore, LMs may have slightly different behavior with same the dataset.

A.5 Other details

Unless otherwise specified, we use k = 16 examples as demonstrations which are sampled at uniform from the training data. We run all experiments 5 times using different seeds. Due to limited resources, we only run experiments once for GPT-3. For all models expect for GPT-3, we used implementation and models from Huggingface transformers library Wolf et al. (2020). For GPT-3 we used OpenAI API, assuming that model "davinci" is GPT-3 175B. When calculating the probability of label tokens, we do not normalize the score by the length of the tokens unlike in Min et al. (2022b). Our implementation is available at https://github.com/juny116/ICL-DeepSpeed.

A.6 Corrupting input-label mapping

To see the detail impact of the ground truth input-label mapping, we revisit the experiments from Min et al. (2022b) Specifically, we replace fix amount of correct labels to incorrect labels in demonstrations and compare the end task performance.

No demonstrations is a zero-shot prediction made via argmaxyCP(yx)argmax_{y\in C}P(y|x), where xx is the test input and CC is a small discrete set of possible labels. Verbalizers are used for mapping tokens to class.

Demonstrations w/ a%a\% correct labels consist k×a/100k\times a/100 correct pairs and k×(1a/100)k\times(1-a/100) incorrect pairs where (0a100)(0\leq a\leq 100). A concatenation of k input-label pairs where a%a\% labels are correct is used to make a prediction via argmaxyCP(yx1,y1,...,xk,yk,x)argmax_{y\in C}P(y|x_{1},y_{1},...,x_{k},y_{k},x).

Demonstrations w/ random label is formed with replacing correct labels to random labels that are randomly sampled at uniform from CC. Since the labels are sampled at uniform from CC, the distribution of labels in demonstration may change from sampled inputs.

Demonstrations w/ shuffled label is formed with randomly shuffling correct labels to other labels within the sampled k inputs. The distribution of labels in demonstration does not change from sampled inputs.

Majority class baseline is a ratio of majority class within the test data. Since there are some datasets that have distributional imbalance, this can be a good indicator of how well the in-context learning is working.

Appendix B Full Results

Full experiment results on 17 datasets with GPT-NeoX are in Table 4 and results on 17 datasets with GPT-NeoX are in Table 5.

Appendix C More Results on the Sensitivity vs Task Difficulty Plot

Figure 16 shows scatter plots of sensitivities of 17 datasets against the corresponding task difficulties measured using the relative performance with respect to accuracy and F-1 scores. The Direct approach is colored in orange and the Channel approach is colored in blue. The dashed vertical line indicates a neutral performance level where there is no difference with the random baselines. The best-fit linear lines show a general trend of increasing sensitivity with less task difficulty. Low sensitivity is strongly related to high task difficulty. Also, the Channel approach helps in alleviating hyper-sensitivity towards task difficulty.

Appendix D Label-Correctness Correlation

The first step of understanding the interaction between performance and input-label demonstration is quantifying the correlation between the two variables. Although we considered this metric as one of the foundation quantifying measures, we omit the analyses results due to space constraints. The Pearson correlation analysis on GPT-J and the Direct approach (Figure 17) shows that the label-correctness correlation is strong (i.e. larger than 0.9) for most tasks on all performance measures. The macro-average correlation across 18 tasks is 0.895 with a p-value of 0.057, strongly supporting the linkage.