On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

Seongjin Shin, Sang-Woo Lee, Hwijeen Ahn, Sungdong Kim, HyoungSeok Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woomyoung Park, Jung-Woo Ha, Nako Sung

Introduction

NLP community has been surprised by emergence of in-context learning ability of a large-scale language model (LM) such as GPT-3 Brown et al. (2020) despite no duplication between downstream task data and the pretraining corpus. Indeed, in-context learning uses a natural language description and a few examples to prime a language model. Then the language model can predict the answer of a new example without updating the language model’s parameters. Since the release of GPT-3, various large-scale in-context language models have been proposed Black et al. (2021); Kim et al. (2021); Zeng et al. (2021); Rae et al. (2021); Hoffmann et al. (2022); Chowdhery et al. (2022).

There still remain many questions on language models’ in-context learning capability despite these successful reports. For example, the relationship between the choice of a pretraining corpus and downstream in-context learning task accuracy is unknown. Previous studies argue pretraining with the corpus similar to the downstream task improves the downstream performance, but these observations are often limited to the case where a pretrained language model is finetuned for the downstream task Gururangan et al. (2020); Lee et al. (2020); Micheli et al. (2020).

In addition, analysis on the relation between the validation perplexity of a language model and in-context learning performance is still less investigated. Previous research on in-context learning implicitly assumes that perplexity is predictive of in-context learning performance by showing scaling law property of their model Kaplan et al. (2020); Brown et al. (2020); Kim et al. (2021). Rae et al. (2021) also use perplexity for the hyperparameter selection on corpus reweighting in the pretraining of their in-context learner. However, their explicit correlations are less discovered.

Motivated by this lack of in-depth analysis on the relationship between in-context learning and corpus properties, we vary the sources and sizes of pretraining corpora and analyze their impact on in-context learning, using HyperCLOVA, which is a Korean-centric large LM. Kim et al. (2021). We mainly discover in-context few-shot learning as in the previous work Kim et al. (2021) but also explore in-context zero-shot learning. We use HyperCLOVA corpus, which is a large-scale pretraining corpus mainly in Korean collected by Kim et al. (2021), as a base corpus from which we derive pretraining corpora for our experiments.

Corpus Source: In-context learning performance depends heavily on corpus sources, and with some sources, in-context learning does not work effectively. For example, the model trained only on a subcorpus of blog (Blog) achieves competitive in-context few-shot learning performance, but training on a subcorpus of community website (Cafe) or online news articles (News) hardly yields in-context few-shot learning ability.

Corpus Combination: In-context learning ability can emerge by fusing two corpora, even when each on its own does not result in in-context learning. For example, while training only on KiN corpus, which consists of QnA websites, or Ency corpus, which consists of Encyclopedia websites, in-context few-shot learning ability was not observed, but training on both corpora makes the emergence of in-context few-shot learning.

Domain Relevance: Pretraining with a corpus related to a downstream task seems to help in-context zero-shot learning performance, but is not indicative of the competitive in-context few-shot learning performance. For example, training on only News corpus makes a relatively good in-context zero-shot learning ability on a news-related downstream task, e.g., news topic classification based on its title, KLUE-YNAT Park et al. (2021), but does not yield in-context few-shot learning ability.

Perplexity: Although perplexity and in-context learning accuracies correlate well when training a single model, perplexity alone does not reflect the difference in in-context learning accuracies across different language models. This is prominent particularly when they were trained using different pretraining corpora. For example, Cafe model, the model trained with Cafe corpus, has the second lowest validation perplexity on various domain sources after Blog model, but fails to emerge in-context few-shot learning.

Related Work

Brown et al. (2020) demonstrate the concept of in-context learning, where a few training examples and/or task descriptions are provided together with a new input for a large-scale LM to produce a target of this input, without requiring any parameter update. A few training examples are used in the in-context few-shot learning setting, whereas no training example is used in the in-context zero-shot setting. A few follow-up studies have tried to improve the in-context learning ability Zhao et al. (2021); Holtzman et al. (2021). On the other hand, another group of papers tries to explain the mechanism of in-context few-shot learning Min et al. (2022); Xie et al. (2022).

2 Domain Relevance on Pretraining Corpus

Previous studies argue a better downstream accuracy is observed with a pretraining corpus more similar to the downstream task corpus Gururangan et al. (2020); Lee et al. (2020); Micheli et al. (2020). However, these observations are limited to the case where a pretrained language model is finetuned for the downstream task.

There are a few studies on the effects of different corpus on the relationship between pretraining and in-context learning. A notable example is Codex, where GPT-3 is trained on Github corpus so that the model can generate code from comments Chen et al. (2021a). However, the corpus used for Codex is limited to code comments and the corresponding code. We study the effect of pretraining corpus on in-context learning performance using various domains.

3 Quantity and Quality of Pretraining Corpus

There have been several studies on the quantity and quality of pretraining data. Raffel et al. (2020) conduct an ablation study on different pretraining corpus on T5, and their filtered C4 corpus makes T5 perform better in downstream tasks. As with GPT-3, researchers generally improve the quality of their language model through data filtering Brown et al. (2020); Kim et al. (2021). Our research differs from the existing work in that we focus on in-depth analysis of how the amount of data and the corpus source affect in-context learning.

4 Multi-task Learning

Multi-task learning approaches, which explicitly finetune on the in-context learning objective by using numerous NLP tasks, are proposed recently to tackle zero/few-shot transfer to the unseen task at test time Wei et al. (2021); Sanh et al. (2021); Chen et al. (2021b); Min et al. (2021).

Unlike the studies in a finetuning paradigm, many properties of the in-context learning related to pretraining corpus are still unknown. As the previous multi-task studies show that diverse tasks improve the ability of in-context learning, our study shows that diverse pretraining corpora strengthen the ability of in-context learning.

Task Definition

We use the variants of HyperCLOVA with various parameter sizes and pretraining corpus. We mainly experiment with models with 1.3B parameters, but we also include the result for 6.9B-sized models. All models have a maximum sequence length of 2,048.

We emphasize that all models use the same vocabulary across all our experiments. We use the morpheme-aware byte-level BPE tokenizer trained with HyperCLOVA corpus Kim et al. (2021) for all models. We train multiple models with different portions of HyperCLOVA corpus to investigate the effects of the source and size of the corpus on in-context learning ability.

2 Pretraining with Different Corpus

We analyze the effect of seven subcorpora in the HyperCLOVA corpus: Blog, Cafe, News, Comments, KiN, Modu, and Ency. Table 1 summarizes the characteristics of the subcorpora. Blog, Cafe, and News are taken from blog, community sites, and online news articles of NAVERhttps://www.naver.com/, a Korean web portal service, respectively. Comments is the comment threads related to the three subcorpora mentioned above. KiN comes from NAVER’s online community QnA service similar to Quora. Ency is a collection of encyclopedic texts including Korean Wikipedia. Modu consists of five public datasets constructed by National Institute of the Korean Languagehttps://corpus.korean.go.kr/, including 3.2B of news, 2.1B of written language, 0.4B of spoken language, 0.2B of web corpus, and 0.02B tokens of messenger. Others was excluded to investigate the explicit effects of domain corpus sources on in-context learning because Others is the corpus where various subcorpora are taken from multiple heterogeneous sources. Tables 12 and 13 in Appendix show the examples of seven pretraining corpus in Korean and English, respectively. ALL denotes the original HyperCLOVA corpus including Others.

For corpora with less than 150B tokens, we assign 99% of each corpus to the pretraining corpus and randomly extract 10,000 examples from the remaining 1% to the validation corpus for measuring validation perplexity. For corpora with more than 150B tokens, we make the training corpora 150B tokens via random sampling and construct a validation set with 10,000 examples randomly sampled from the remaining. As a result, the maximum training set size of each corpus is 150B tokens.

The validation set for each corpus consists of 10,000 examples and is used for the early stopping of models trained with each corpus. However, we combine all validation set to make the entire validation set contains 70,000 examples for seven domains, and the entire validation set is used for calculating perplexity, as described in Section 3.5.

3 Downstream Tasks

We evaluate in-context learning performance of each corpus-specific model on four Korean downstream task datasets used in Kim et al. (2021): NSMChttps://github.com/e9t/nsmc, KorQuAD Lim et al. (2019), AI Hub translationhttps://aihub.or.kr/aidata/87, and YNAT Park et al. (2021). NSMC is a binary sentiment classification dataset on movie review. KorQuAD is a machine reading comprehension dataset similar to SQuAD 1.0 Rajpurkar et al. (2016). AI Hub translation dataset consists of Korean-English parallel sentences from news, government websites, legal documents, etc. YNAT is a topic classification problem with seven classes.

We think that three datasets for downstream tasks are closely related to the HyperCLOVA corpus. Passages which construct KorQuAD are taken from Korean Wikipedia, which is also a part of the Ency. YNAT is a topic classification task of news headlines, so the downstream task is deeply related to the News corpus. A significant portion of parallel sentences for AI Hub translation dataset also comes from news articles. KiN corpus is also related to the translation task. About 2.5% of QnA data in KiN includes Korean questions on the English language, as a foreign language. These question-style passages often include Korean-English sentence pairs in the passage. Vocabulary overlap between downstream tasks and HyperCLOVA corpus is depicted in Figure 1.

4 Experimental Details

We try our best to make the same hyperparameter of Kim et al. (2021), including global batch size, training step, maximum sequence length, learning rate, and so on. In our experiments, the models are trained for 72K steps with a global batch size of 1,024. We note that under this setting, the number of tokens that were actually used in pretraining is 150B. Therefore, we set the maximum size of training corpus to 150B as in Section 3.2.

In most experiments, validation perplexity decreases monotonically as training goes on. Thus, we use the checkpoint at 72K step. The only exception is the Ency model. The Ency model has a minimum validation loss at 12K steps, which is likely to be caused by overfitting to pretraining data due to a small size of the data. Therefore, we use early-stopping checkpoints at the 12K steps for the report.

For optimization, AdamW Loshchilov and Hutter (2019) with the learning rate of 2.0e-4 and the cosine learning rate scheduling are used. We use the mixed precision training. Models are trained on the Nvidia Superpod which consists of 1,024 A100 GPUs spread across 128 nodes. Using Superpod, it spends around 18 hours to train 1.3B model with 72K steps.

For classification tasks such as NSMC and YNAT, we use a rank classification approach Wei et al. (2021), where we compare pre-defined outputs (“positive” and “negative”) and take the one with higher probability. KorQuAD and AI Hub are free-form completion tasks, where we directly generate output tokens using the greedy decoding.

In the few-shot experiments, the number of shots is set to 70, 4, 4, and 70 for NSMC, KorQuAD, AI Hub, and YNAT, respectively. Downstream tasks are performed 12, 1, 3, and 6 times with different random seeds for NSMC, KorQuAD, AI Hub, and YNAT, respectively. We report the average performance. Random seed influences the sampling of shots from training data and their order. The reason KorQuAD has only one random seed is described in Appendix D. Appendix D also includes the examples of the few-shot prompts used in our experiments. These all experimental settings in the few-shot experiments, from the number of shots to the number of random trials, basically come from the experimental setting of Kim et al. (2021). However, we change the number of trials of YNAT from 3 to 6, because we found that the standard derivation of YNAT is relatively high.

5 Measuring Validation Perplexity

We report validation perplexity in various tables and figures to verify our argument. We use the term “PPL” to denote validation perplexities on the validation set. The validation set consists of 70,000 examples from seven corpus sources, as described in Section 3.2. We emphasize that, for calculating PPL, all experiments use the same vocabulary and validation set.

In Tables 2 and 4, we use Italic font for the results from a multi-domain model, which is pretrained with two or more mixed corpora. Since a multi-domain model trains more domains than a single-domain model, the PPLs of multi-domain models are generally lower than those of single-domain models. To keep readers from directly comparing PPLs between a single-domain and a multi-domain model, we use italic font for the results of a multi-domain model.

Experimental Results

We perform intensive experiments to answer these four main questions:

How large do the source and the size of pretraining corpora have the effects on emerging in-context learning ability? (Section 4.2 and 4.3)

What is the effect of combining various corpora? (Section 4.4)

How large does domain relevance of corpus influence on model performances of the downstream task? (Section 4.5)

How strong is the correlation between validation perplexity and in-context learning of language models? (Section 4.6)

Tables 2 and 4 show the in-context few-shot results on various pretraining corpus sources and different corpus combination, respectively. Tables 3 and 5 depict the in-context zero-shot results of some models in Tables 2 and 4, respectively. All results in Tables 2, 3, 4, and 5 come from models with 1.3B parameters. Tables 8 and 9 in Appendix A present the standard derivation values on the results of Tables 2 and 4.

In Tables 2, 3, 4, and 5, Purple-underline denotes the score is below the mean performance value of ALL and Majority baseline in Table 2, and Teal-bold denotes the score is above. We use this mean value of Majority and ALL in Table 2 as the performance basis to prevent the in-context learning performance of each model from being distorted by the high basis performance of two classification tasks such as NSMC and YNAT.

Tables 2 and 6 include in-context few-shot results on various pretraining corpus sizes. In Table 6, for example, 56B and 6B correspond to the 1/10 and 1/100 of the original HyperCLOVA corpus with 560B tokens, respectively. The 56B tokens and 6B tokens models are trained with around 3 and 25 epochs, respectively, so that both models can be trained with 72K training steps. On the other hand, Table 2 compares 27B tokens models trained with different corpus sources to show the results in controlled corpus size.

2 Effect of Corpus Source

It is noticeable that in-context learning ability emerges differently depending on pretraining corpus sources, as shown in Tables 2, 3, 4, and 5. For example, Blog model makes competitive in-context few-shot learning performance to ALL model, while each of Cafe and News models hardly shows in-context few-shot learning ability from Table 2. It is also noticeable that Modu model performs better than Cafe and News model although the size of Modu corpus is less than 1/10 of Cafe or News corpus, showing the corpus size is not the only factor to predict in-context learning performance. Likewise, it is also interesting that Cafe+News model also shows poor performance despite the same size to Blog and ALL, as shown in Table 4.

These differences in in-context learning are dramatic compared to the finetuning results we expect in general. For a comparative experiment between in-context learning and finetuning in our setting, we also finetuned the experimented models with LoRA Hu et al. (2021). As Table 11 in Appendix C shows, the performance differences in finetuning are much smaller than in the case of in-context learning.

3 Effect of Corpus Size

Table 6 shows that reducing the corpus size from 150B to 56B does not decrease the performance severely despite training with 1/10 of corpus. However, the performance degradation of 6B tokens model is remarkable comparing to ALL model. Nevertheless, it is noticeable that 6B tokens model still performs much better than Cafe+News model, which trains 150B tokens of Cafe and News corpus.

We can also see the similar results for three Blog models of different sizes in Table 2. Blog and Blog 54B achieve similar performance. However, like in ALL 6B, Blog 27B performs quite worse than Blog 54B.

Figure 3 shows the comparison between 1.3B-sized model and 6.9B-sized model. In the 6.9B-sized models, the in-context few-shot performance with 56B tokens does not decrease significantly compared to 150B tokens, as in the 1.3B-sized models.

4 Effect of Combining Corpora

One of our main goals is to investigate the effects of combining multiple corpora from various sources on in-context learning performance. Table 4 shows that in-context few-shot learning ability can be emerged by combining two corpora, even if each of both corpora cannot provide in-context few-shot learning ability. For example, KiN+Ency model succeeds to make in-context learning ability in most tasks, while each of KiN and Ency fails in most tasks. Likewise, Cafe+KiN model succeeds to make in-context few-shot learning ability, while each of Cafe and KiN fails in most tasks. In-context zero-shot abilities of these models follow similar patterns as shown in Table 5.

This phenomenon is related to the argument that in-context learning emerges by multi-task learning. According to the argument, as the language modeling objective function requires a language model to learn variety of next word prediction tasks, the generalization pushes in-context learning ability on unseen tasks. In the example of KiN+Ency model, KiN+Ency may learn in-context learning ability of MRC task, by learning next word prediction tasks of both Ency (Wikipedia) and KiN (QnA).

Unlike these positive cases, we observe that combining corpora does not assure the emergence of competitive in-context learning. For example, from the case of Cafe+News in Table 4, even if the mixed corpus model shows slightly better performance on KorQuAD than each of two corpora, its in-context few-shot performances on NSMC, KorQuAD, and YNAT are still below the basis. Furthermore, the performances on NSMC and YNAT even decrease.

5 Effect of Domain Relevance

Speaking of the few-shot results, Table 2 shows that the close relationship between a pretraining corpus and a downstream task does not always guarantee in-context few-shot learning ability on the downstream task. KiN and Ency do not perform well on KorQuAD task, although KorQuAD is an MRC task from Korean Wikipedia, Ency includes Korean Wikipedia, and KiN consists of question answering pair, respectively. Likewise, News does not perform well on YNAT task, although YNAT consists of news headline queries. Table 4 further shows that News+KiN+Ency model shows more degenerated F1 score on YNAT than KiN+Ency, even though a large amount of News corpus is added to News+KiN+Ency model.

For further investigation, we analyze vocabulary statistics of each corpus. Figure 1 shows the vocabulary overlapping ratio between pretraining corpora and downstream tasks. The result shows that high vocabulary overlap between a pretraining corpus and a downstream task does not indicate high downstream task performance. Although the Modu corpus has a large vocabulary overlapping ratio to AI Hub, in-context learning performances of the Modu model on the translation tasks are much lower than Blog and KiN.

The counter example of above supports is AI Hub task performance of KiN model. KiN model learned the pattern of Korean-English sentence pairs, since the corpus includes a lot of Korean questions on English language. While KiN model does not work well in other downstream tasks, the performance on AI Hub translation is competitive and makes the best performance in Ko\rightarrowEn among seven pretraining corpora.

In the zero-shot setting, on the other hand, domain relevance seems to affect more positively. For example, training the News corpus helps in-context zero-shot learning in KLUE-YNAT consistently. As shown in Tables 3 and 5, the models whose training corpus includes the News corpus (i.e., News, Cafe+News, and News+KiN+Ency) even perform better than the model trained whole HyperCLOVA corpus.

In the case of KiN and AI Hub, zero-shot performance increase for AI Hub tasks of the KiN model is less significant than few-shot. However, adding KiN corpus into the pretraining corpus in the experiments of Table 5 (i.e., KiN+Ency, Cafe+KiN, and News+KiN+Ency) makes a consistent performance increase, and the model outperform ALL.

6 Perplexity and Downstream Task

Figure 2 presents the scatter plots of PPL (xx-axis) and in-context few-shot learning performance (yy-axis) on five downstream tasks for single corpus models and the ALL model. In Figure 2, we normalized in-context few-shot learning performance by dividing ALL model performance for calibrating various task metrics. Because we observe less explicit tendency of correlation between validation perplexity and in-context performance, we argue that it is difficult to hypothesize better perplexity assures emerging of in-context few-shot learning ability.

According to Table 2, Blog model shows both the lowest PPL and the best in-context learning performance, and Ency model shows both the highest PPL and the worst in-context learning performance. On the contrary, while Cafe model and KiN model shows the second and third lowest PPL, in-context few-shot learning ability was not observed. These results show that the perplexity does not serve as a strong predictor of in-context few-shot learning performance in comparing models trained using different corpora. Table 2 also shows that the corpus size affects in-context few-shot learning performance more than PPL. Blog 27B performs notably worse than Blog, but PPL relatively does not decrease as much.

Speaking of zero-shot results, it seems Table 3 shows that in-context zero-shot learning performances relatively more correlate with perplexity than the few-shot cases. Nevertheless, Modu still has both relatively high perplexity and relatively high in-context zero-shot learning performances.

Table 7 shows validation perplexity scores for each subcorpus. Each row corresponds to the model and each column corresponds to the validation set’s subcorpus. Each validation set except All in Table 7 consists of 10,000 instances, and is the part of our main validation sets, consists of 70,000 instances.

On the other hand, Figure 4 shows that PPL and in-context learning performance correlate well in the perspective of training a single model. We can find that the correlation trends between the cases in the training and the cases between the corpus domain are different.

Discussion

Our knowledge can be used to increase the performance of in-context learning when the corpus is small or/and there exists demand for collecting more corpus. In the case of XGLM Lin et al. (2021), which is a concurrent work on multilingual GPT-3, achieved better in-context learning performance for many languages. However, it does not reach the performance of a single language model. We hope our observation can give insight into what types of pretraining to be collected more, both for multilingual model and low-resource language model.

Another notable example comes from Gopher Rae et al. (2021), which is a concurrent work on state-of-the-art in-context learner. Rae et al. (2021) determine the ratio between subcorpora based on the perplexity of the validation corpus. They implicitly claim that this ratio results in better downstream task performance, but do not address explicit evidence for this. On the other hand, we are in a position to doubt the strong correlation between perplexity and in-context learning, especially in the few-shot setting. We hope our findings contribute to making better in-context learners along with other research.

Conclusion

This paper investigates the effects of the source and the size of the training corpus on in-context learning ability, using the HyperCLOVA corpus. Our discoveries include that corpus sources play a crucial role in whether or not in-context learning ability will emerge in a large-scale language model.

One direction for future work is to investigate linguistic properties of corpus sources which make a competitive in-context learning model. For example, quantifying the difference between two corpora can shed light on how to select suitable corpora for NLP practitioners who build large-scale language models. In addition, intensive studies on different corpus sources other than the HyperCLOVA corpus can help understand the properties of in-context learning.

Broader Impact Statement

We present multiple pieces of evidence that models using only a part of the pretraining corpus are comparable with those trained with the entire corpus in terms of in-context performances. Although we leave the validation on larger-scale models, such as tens of billion parameters, to future work, our analysis presents a hint to effectively training LMs with smaller corpora. This approach can contribute to alleviating severe energy consumption issues caused by large-scale LMs.

Meanwhile, our study relates to the misuse and fairness of large-scale LMs. For example, reweighting domain-specific corpus might cause LMs to be biased inherent in the domain corpus. Therefore, alleviating domain corpus bias would be a valuable future direction.

Acknowledgment

The authors thank all the members of CLOVA, AI Lab for devoted supporting and discussion. In particular, they thank Joonsuk Park and Seok Ho Yoon for proofreading.

References

Appendix A Details on Experimental Results

Tables 8 and 9 show standard derivation value on Tables 2 and 4. Table 10 shows score difference with ALL in addition to in-context learning scores on Table 2. Figure 5, supporting Table 7, shows the validation perplexity of different model from different corpus.

Appendix B Details on Pretraining Corpus

Tables 12 and 13 show example instances of seven pretraining corpus in Korean and English, respectively.

For preprocessing steps of our pretraining corpus, we use HyperCLOVA corpus which is also used in Kim et al. (2021) as described in Section 3.1. Therefore, we share the preprocessing steps of Kim et al. (2021). Appendix A in Kim et al. (2021) describes their preprocessing methods on data descriptoin, data clearning, data anonymization, and data postprocessing.

We additionally introduce the deduplication preprocess of HyperCLOVA corpus, which is used in Kim et al. (2021). The deduplication preprocess was applied to construct HyperCLOVA corpus to prevent explicit duplication within and between subcorpora Kim et al. (2021). According to the response of Kim et al. (2021), they use an in-house search engine and an in-house engineering trick to detect document pairs that are very similar to each other. There are two pipelined steps: (1) removing duplicates within subparts of the corpus, and then (2) removing duplicates between subparts of the corpus. Therefore, documents with high overlap do not exist throughout the documents. Here, the number of subparts is 29. These 29 subparts are categorized into the eight domains we deal with in the paper (i.e., Blog, News, Cafe, Comments, KiN, Modu, Ency, and Others). Overall, there is no explicit overlap between each corpus, since very similar documents have already been removed from the corpus. The overlap between eight HyperCLOVA subcorpora is quite small. There were many overlaps within the subpart of the corpus. However, the overlap between subparts of the corpus was only 0.024% of the total, according to the counts in the second pipelined step of deduplication between subparts.

Appendix C Experiments on LoRA

Table 11 shows the results of LoRA Hu et al. (2021) finetuning on some models in Tables 2 and 4.

Appendix D Examples of Few-shot Prompt

Tables 14, 16, 18, and 19 show the example few-shot prompt of NSMC, KorQuAD, AI Hub, and YNAT, respectively. Tables 15, 17, and 20 show the translated version for NSMC, KorQuAD, and YNAT, respectively.

On the other hand, the number of random seed is one for KorQuAD. We explain why evaluation on KorQuAD with many random seeds is difficult, from the perspective of prompt design. The way we make randomness on trials is to change few-shot examples in the prompt. However, in the case of Kim et al. (2021) and in our case, there are no alternative examples to put into the prompt. The prompt examples of KorQuAD are one document and a few question-answer pairs, and not a few document-question-answer triples. In other words, in the prompt of KorQuAD, the number of the document is one. Thus, the document is used for both few-shot question-answer pairs and a query question for the inference. In KorQuAD, there are five corresponding question-answer pairs in each document. In the experimental setting of ours and Kim et al. (2021), four question-answers are put into the prompt and one question is used for the test. Therefore, there are no other question-answer pairs to replace the four pairs.

Appendix E Generalization to Other Languages

Someone can ask whether our results can be extended to other languages, including English. We have left experiments on non-Korean language as future work. However, we describe some explanations below to defend our experiments on the Korean language and to discuss why experiments on other languages are practically non-trivial.

First, we think our findings are basically generalizable to other languages. From the perspective of pretraining and in-context learning, fundamental differences between Korean and English were limitedly reported. For example, XGLM Lin et al. (2021), a concurrent work on, also does not show critical evidence on language-specific properties.

Second, It is non-trivial to control various aspects of corpora for our purpose. Most corpus for in-context few-shot learners comes from crawled website which is not easy to distinguish from its original source. For example, 82% of OpenAI GPT-3 Corpus Brown et al. (2020) is a filtered version of Common Crawl. In this regard, we used relatively a well-refined corpus which consist of several subcorpus from a single web service. (Please see also Section B.1 of this letter.) On the other hand, we have interests to extend our work onto Pile dataset Gao et al. (2020), by controlling the subcorpora in the direction our study pursuits, in the future.