When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi

Introduction

Large language models (LMs; Brown et al. 2020; Raffel et al. 2020) have been shown to be competitive on diverse NLP tasks, including knowledge-intensive tasks that require fine-grained memorization of factual knowledge Chowdhery et al. (2022); Yu et al. (2022). Meanwhile, LMs have also been shown to have limited memorization for less frequent entities Kandpal et al. (2022), are prone to hallucinations Shuster et al. (2021), and suffer from temporal degradation Kasai et al. (2022); Jang et al. (2022). Incorporating non-parametric knowledge (i.e., retrieved text chunks) largely helps address those issues stemming from reliance on LMs’ parametric knowledge—knowledge stored in their parameters Izacard et al. (2022b)—but it is unclear whether it is strictly superior or complementary to parametric knowledge. Understanding when we should not trust LMs’ outputs is also crucial to safely deploying them in real-world applications Kadavath et al. (2022).

This work conducts a large-scale knowledge probing of LMs on factual knowledge memorization, to understand when we should and should not rely on LMs’ parametric knowledge, and how scaling and non-parametric memories (e.g., retrieval-augmented LMs) can help. In particular, we aim to address the following research questions:

How much factual knowledge is memorized by LMs and what factors affect the memorization? (Section 4)

To what extent can non-parametric memories alleviate the shortcomings of parametric memories of LMs? (Section 5)

Can we build a system to adaptively combine non-parametric and parametric memories? (Section 6)

We hypothesize that factual knowledge frequently discussed on the web is easily memorized by LMs, while the knowledge that is less discussed may not be well captured and thus they require retrieving external non-parametric memories. We evaluate ten large LMs of three families (i.e., GPT-Neo, OPT, and GPT-3) with varying scales on the open-domain question answering (QA) task in a zero- or few-shot prompting manner. We construct a new dataset, PopQA, consisting of 14k questions to cover factual information in the long tail that might have been missed in popular QA datasets Kwiatkowski et al. (2019). We use Wikipedia page views as a measure of popularity and convert knowledge triples from Wikidata, with diverse levels of popularity, into natural language questions, anchored to the original entities and relationship types. We also use EntityQuestions Sciavolino et al. (2021), an open-domain QA dataset with a long-tail distribution.

On both datasets, LMs’ memorization ( $RQ_{1}$ ) is often limited to the popular factual knowledge and even GPT-3 davinci-003 fails to answer the majority of the long-tail questions. Moreover, on such questions, scaling up models does not significantly improve the performance (e.g., for the 4,000 least popular questions in PopQA, GPT-j 6B has 16% accuracy and GPT-3 davinci-003 has 19% accuracy). This also suggests that we can predict if LMs memorize certain knowledge based on the information presented in the input question only.

We next investigate whether a semi-parametric approach that augments LMs with retrieved evidence can mitigate the low performance on questions about less popular entities ( $RQ_{2}$ ). Non-parametric memories largely improve performance on long-tail distributions across models. Specifically, we found that retrieval-augmented LMs are particularly competitive when subject entities are not popular: a neural dense retriever Izacard et al. (2022a)-augmented GPT-neo 2.7B outperforms GPT-3 davinci-003 on the 4,000 least popular questions. Surprisingly, we also find that retrieval augmentation can hurt the performance of large LMs on questions about popular entities as the retrieved context can be misleading.

As a result, we devise a simple-yet-effective retrieval-augmented LM method, Adaptive Retrieval, which adaptively combines parametric and non-parametric memories based on popularity ( $RQ_{3}$ ). This method further improves performance on PopQA by up to 10%, while significantly reducing the inference costs, especially with larger LMs (e.g., reducing GPT-3 API costs by half), indicating the potential for future research in more efficient and powerful retrieval-augmented LMs.

Related Work

Petroni et al. (2019) demonstrate that large pre-trained LMs such as BERT Devlin et al. (2019) memorize the significant amount of world knowledge in their parameters (parametric knowledge), and Roberts et al. (2020) show that fine-tuned T5 without any reference documents (closed-book QA) can achieve competitive performance on open-domain QA. More recent and powerful LMs Brown et al. (2020); Chowdhery et al. (2022) further improve performance on diverse knowledge-intensive tasks, leveraging their strong parametric memories Kandpal et al. (2022); Yu et al. (2022). However, relying solely on their parameters to encode a wealth of world knowledge requires a prohibitively large number of parameters and the knowledge can become obsolete quickly Kasai et al. (2022); Jang et al. (2022). Recent work shows that augmenting LMs with non-parametric memories (i.e., retrieved text chunks) enables much smaller models to match the performance of larger models Izacard et al. (2022b); Khandelwal et al. (2020); Min et al. (2022), although Chen et al. (2022) and Longpre et al. (2021) show that even those models can ignore non-parametric knowledge and rely on parametric knowledge.

Understanding memorization.

Several prior work establishes a positive relationship between string frequency in pre-training corpora and memorization Carlini et al. (2022); Razeghi et al. (2022). Concurrent to our work, Kandpal et al. (2022) show that the co-occurrence of the question and answer entities in pretraining corpora has a positive correlation with models’ QA accuracy on popular open-domain QA benchmarks such as Natural Questions Kwiatkowski et al. (2019). This work, instead, attempts to predict memorization using the variables available in the input question only and uses popularity to obtain a proxy for how frequently an entity is likely to be discussed on the web. Importantly, by constructing a new dataset, we can conduct fine-grained controlled experiments across a wide range of popularities, allowing the investigation of hypotheses that might have been missed in prior analysis using existing open QA datasets. We further analyze the effectiveness and limitations of retrieval-augmented LMs and introduce Adaptive Retrieval. Prior work investigates the effectiveness of deciding when to use non-parametric memories at the token level in $k$ NN LM He et al. (2021); Drozdov et al. (2022). This work is the first work to study the effectiveness of deciding whether to retrieve for each query and show their effectiveness in retrieval-augmented LM prompting.

Evaluation Setup

We evaluate LMs’ ability to memorize factual knowledge through closed-book QA tasks with few-shot samples. We evaluate LMs on our new dataset, PopQA (Figure 2), and EntityQuestions, both of which have long-tail distributions (Figure 3).

Focus: factual knowledge. Among diverse types of world knowledge, this work focuses on factual knowledge Adams (2015) of entities—knowledge about specific details of the target entities. We define factual knowledge as a triplet of (subject, relationship, object) as in Figure 2 left.

Task format: open-domain QA. We formulate the task as open-domain QA Roberts et al. (2020): given a question, a model predicts an answer without any pre-given ground-truth paragraph.Some work conducts knowledge probing of encoder-only models by filling out [MASK] tokens Petroni et al. (2019). We use decoder-only models and thus do not use this fill-in-the-blank scheme. As in Kandpal et al. (2022), we study few-shot settings and prompt LMs without any parameter updates, instead of fine-tuning them on QA datasets such as in Roberts et al. (2020).

Metrics: accuracy. We mark a prediction as correct if any substring of the prediction is an exact match of any of the gold answers.

2 Dimensions of Analysis

We hypothesize that factual knowledge that is less frequently discussed on the web may not be well-memorized by LMs. Previous research often uses the term frequency of object entities in pretraining corpora to understand memorization Févry et al. (2020); Kandpal et al. (2022); Razeghi et al. (2022). Instead, we investigate whether it’s possible to predict memorization based on the input information only, and then apply the findings for modeling improvements, unlike prior analyses. Therefore, our work focuses on the other two variables in a factual knowledge triple: the subject entity and the relationship type.

Subject entity popularity. We use the popularity of the entities measured by Wikipedia monthly page views as a proxy for how frequently the entities are likely to be discussed on the web, instead of using the occurrence of entities or strings in the pretraining corpus Carlini et al. (2022); Kandpal et al. (2022); Razeghi et al. (2022). Calculating frequencies over large pretraining corpora requires massive computations to link entities over billions of tokens, or can result in noisy estimations.Moreover, several recent models like GPT-3 do not release their pretraining corpora, and it is an open question whether the frequencies in pretraining corpora reflect the frequencies in their private corpora. Our initial studies show that this is much cheaperWe can get page views by calling Wikipedia API. and aligns well with our intuition.

Relationship type. We also consider the relationship types as key factors for factual knowledge memorization. For example, even given the same combinations of the subject and object entities, model performance can depend on the relationship types; relationship types widely discussed can be easier to be memorized, while types that are less discussed may not be memorized much.

3 Benchmarks

PopQA. In our preliminary studies, we found that existing common open-domain QA datasets such as Natural Questions (NQ; Kwiatkowski et al. 2019) are often dominated by subject entities with high popularity, and it is often hard to identify relationship types due to diverse question surface forms. To enable a fine-grained analysis of memorization based on the aforementioned analysis dimensions, we construct PopQA, a new large-scale entity-centric open-domain QA dataset about entities with a wide variety of popularity, as shown in Figure 3.

To construct PopQA, we randomly sample knowledge triples of 16 diverse relationship types from Wikidata and convert them into natural language questions, using a natural language template (depicted in Figure 2). We verbalize a knowledge triple $(S,R,O)$ into a question that involves substituting the subject $S$ into a template manually written for the relationship type $R$ . The full list of templates is found in Table 2 of the Appendix. The set of acceptable answers to the question is the set of entities $E$ such that $(S,R,E)$ exists in the knowledge graph. We tried various templates and found that the results were fairly robust to the templates. Since PopQA is grounded to a knowledge base, links to Wikidata entities allow for reliable analysis of popularity and relationship types.

EntityQuestions. We test on another popular open-domain QA dataset, EntityQuestions Sciavolino et al. (2021), which also covers a long-tail entity distribution. They use Wikipedia hyperlink counts as a proxy of the frequency of entities and sample knowledge triples from WikiData, from the frequency distributions. Unlike PopQA, EntityQuestions doesn’t provide entity annotations, so we only use 82% of the questions, where the mention of the subject entity has a unique match with a Wikidata entity.

Memorization Depends on Popularity and Relationship Type

We evaluate a range of LMs with varying numbers of parameters, to quantify how much factual knowledge they memorize and how different factors affect those memorization behaviors ( $RQ_{1}$ ).

We evaluate ten models with a varying scale of model size: OPT (Zhang et al. 2022; 1.3, 2.7, 6.7, and 13 billion), GPT-Neo (Black et al. 2022; 1.3, 2.7, 6, and 20 billion), and GPT-3 (Brown et al. 2020; davinci-002, davinci-003) on our benchmark without any fine-tuning.We did not explore widely-used encoder-decoder models such as T5, as their supervised pretraining consists of QA.

Instructions and demonstrations.

We use a simple template “Q: A:” to format all of our questions for generative prediction. More sophisticated instructions were attempted in preliminary experiments but they did not improve upon the simple template significantly enough to merit using them, especially given that they may overfit to the model. While we use zero-shot prompting for GPT-3 to reduce API costs,Using 15-shot prompts for GPT-3 would cost upwards of $3000 for the combination of vanilla, Contriever, BM25, and GenRead evaluations on davinci-002 and davinci-003. we use 15-shot prompting for all GPT-neo and OPT models.

2 Results

The top left column of Figure 4 illustrates the overall performance on PopQA. As shown, even without using in-context examples, larger LMs exhibit reasonable performance: GPT-3 achieves 35% accuracy, and GPT-Neo 20B achieves 25% accuracy. This indicates that large LMs memorize factual knowledge in their parameters to some extent. This section examines which types of knowledge are better memorized and what factors influence memorization.

Subject entity popularity predicts memorization.

Figure 4 (bottom) shows that there is a positive correlation between subject entity popularity and models’ accuracy for almost all relationship types. This supports our hypothesis that subject entity popularity can be a reliable indicator of LMs’ factual knowledge memorization. In general, the correlations between subject entity popularity and accuracy are stronger for larger LMs; GPT-3 003 shows the highest positive correlation (roughly 0.4) while GPT-Neo-1.3B shows relatively weak positive correlations (approximately 0.1).

Relationship types affects memorization.

We find that models have a higher average performance for some relationship types than for others. While this is evidence that factual knowledge of some relationship types are more easily memorized than others, we also observe that questions of certain relationship types can be easily guessed without memorizing the knowledge triple. Specifically, certain relationship types (e.g., nationalities) allow models to exploit surface-level artifacts in subject entity names Poerner et al. (2020); Cao et al. (2021). Additionally, models often output the most dominant answer entities for questions about relationship types with fewer answer entities (e.g., red for the color relationship type). In Figure 4, relationships with lower correlation (e.g., country, sport) often shows higher accuracy, indicating that on those relationship types, models may exploit surface-level clues. On the other hand, for relationship types with relatively low accuracy (e.g., occupation, author, director), larger LMs often show a high correlation. Further details are in Appendix C.1.

Scaling may not help with tail knowledge.

As seen in the left column of Figure 4, there are clear overall performance improvements with scale on the PopQA dataset. However, Figure 5 shows that on both PopQA and EntityQuestions, most of scaling’s positive effect on parametric knowledge comes from questions with high popularity. Specifically, for the questions about the entities whose $\log_{10}{(\rm popularity)}$ is larger than 4, there is an improvement in accuracy as model size increases (red and yellow lines), while performance on questions with lower popularity remains relatively constant (blue and green lines). For the 4,000 least popular questions, GPT-Neo 6B, 20B, and GPT-3 davinci-003 have 15%, 16%, and 19% accuracy, respectively.

This somewhat dampens prior works’ findings that scaling up models significantly improves their factual knowledge memorization Roberts et al. (2020); Kandpal et al. (2022). We hypothesize that this is because their evaluations are often conducted on QA datasets with popular entities. 30 PopQA and 26 EntityQuestions questions had popularity less than the smallest popularity bin, and are excluded to avoid showing results for small sample sizes. In sum, scaling lowers the threshold of popularity for knowledge to be reliably memorized, but is not projected to move the threshold far into the long tail for practical model scales.

Relationship type results breakdown.

Figure 6 provides a closer look at the relationship between popularity, accuracy, and relationship type; it shows model accuracy over the popularity distributions for director and country. For the first two types, we can see a clear positive trend between popularity and accuracy across models, and as the model size gets larger, the LMs memorize more. On the other hand, in the “country” relationship type, no models show trends, while overall the accuracy is high, indicating the LMs often exploit artifacts to answer less popular questions. We show example models’ predictions in Appendix Section C.3.

Non-parametric Memory Complements Parametric Memory

Our analysis indicates that even the current state-of-the-art LMs struggle with less popular subjects or certain relationship types, and increasing the model size does not lead to further performance improvements. In light of this, we extend our analysis to non-parametric sources of knowledge, as outlined in Section ( $RQ_{2}$ ). Specifically, we investigate the effectiveness of retrieval-augmented LMs Borgeaud et al. (2022); Lewis et al. (2020), which leverage non-parametric memories (i.e., retrieved text) to improve performance.

Augmenting input. In this work, we try a simple retrieval-augmented LM approach, where we run an off-the-shelf retrieval system off-line to retrieve context from Wikipedia relevant to a question,We use Wikipedia dump from December 2018. and then we concatenate the retrieved context with the original question. Although increasing the context size often leads to performance gains Izacard and Grave (2021); Asai et al. (2022), we only use the top one retrieved paragraph for simplicity.

Retrieval models. We use two widely-used retrieval systems: BM25 Robertson et al. (2009) and Contriever Izacard et al. (2022a). BM25 is a static term-based retriever without training, while Contriever is pretrained on large unlabeled corpora, followed by fine-tuning on MS MARCO Bajaj et al. (2016). We also experiment with a parametric augmentation method, GenRead Yu et al. (2022), which prompts LMs to generate rather than retrieve a contextual document to answer a question. We use the ten LMs in Section 4, resulting in 40 LMs and retrieval-augmented LMs.

2 Results

Figure 7 shows that augmenting LMs with non-parametric memories significantly outperforms unassisted vanilla LMs. A much smaller LM (e.g., GPT-Neo 2.7B) augmented by the Contriever retrieval results outperforms vanilla GPT-3. Large LMs such as GPT-3 also enjoy the benefits of non-parametric memories. Contriever gives 7% accuracy gains on top of GPT-3 davinci-003. GenRead shows little-to-no performance improvement over vanilla parametric knowledge for smaller models, while the technique shows sizeable gains for GPT-3, especially davinci-003. In addition to its limited effectiveness with smaller LMs, GenRead has potentially prohibitive inference time costs, with GPT-NeoX 20B taking 70 seconds per query.

Non-parametric memories are effective for less popular facts.

How does retrieval augmentation lead to such significant improvements? Figure 9 shows the relationship between the entity popularity and models’ QA performance. It can be seen that retrieval-augmented LMs guided by Contriever or BM25 have a clear advantage over unassisted vanilla LMs, especially on less popular entities, resulting in a significant performance gain. Overall, Contriever-guided LMs outperform BM25-based ones on PopQA, while the BM25-based models perform better on the least popular entities, consistent with the findings from Sciavolino et al. (2021). On the other hand, for more popular entities, parametric knowledge shows equal or higher accuracy, indicating that the state-of-the-art LMs have already memorized the answers, and augmenting input with retrieved-context doesn’t help much or even hurts the performance. Interestingly, GenRead generally outperforms vanilla LMs despite relying on LMs’ parametric memory. This demonstrates the effectiveness of elicitive prompting Wei et al. (2022); Sun et al. (2022) as observed in prior work. However, like vanilla LMs, GenRead shows low performance on less popular entities.

Non-parametric memories can mislead LMs.

We conduct an in-depth analysis of why retrieval-augmented models suffer in more popular entities. We hypothesize that retrieval results may not always be correct or helpful, and can mislead LMs. To test this hypothesis, we group the questions based on two axes: whether unassisted GPT-3 davinci-003 predict correctly or not, and whether retrieval-augmented predictions are correct or not. For each of the four categories, we calculate recall@1 (whether a gold answer is included in the top 1 document; Karpukhin et al. 2020).

Table 1 shows recall@1 for each group with percentages of the questions falling into each of the categories. For 10% of questions, retrieval-augmentation causes the LM to incorrectly answer a question it could otherwise answer correctly. We found that on those questions, recall@1 is significantly lower than the overall recall@1 (0.14 vs 0.42 overall), indicating that failed retrieval can result in performance drops. Conversely, for the 17% of questions for which retrieval causes the LM to correctly answer a question it would otherwise have failed to answer, the recall@1 is 0.88. We include examples of both cases in Appendix Section C.3.

Adaptive Retrieval: Using Retrieval Only Where It Helps

While incorporating non-parametric memories helps in long-tail distributions, powerful LMs have already memorized factual knowledge for popular entities, and retrieval augmentation can be harmful. As outlined in ( $RQ_{3}$ ), can we achieve the best of both worlds? We propose a simple-yet-effective method, Adaptive Retrieval, which decides when to retrieve passages only based on input query information and augments the input with retrieved non-parametric memories only when necessary. We show that this is not only more powerful than LMs or retrieval-augmented LMs always retrieving context, but also more efficient than the standard retrieval-augmented setup.

Adaptive Retrieval is based on our findings: as the current best LMs have already memorized more popular knowledge, we can use retrieval only when they do not memorize the factual knowledge and thus need to find external non-parametric knowledge. In particular, we use retrieval for questions whose popularity is lower than a threshold (popularity threshold), and for more popular entities, do not use retrieval at all.

Using a development set, the threshold is chosen to maximize the adaptive accuracy, which we define as the accuracy attained by taking the predictions of the retrieval-augmented system for questions below the popularity threshold and the predictions based on parametric knowledge for the rest. We determine the popularity threshold independently for each relationship type.

2 Results

Figure 9 shows the results when we adaptively retrieve non-parametric memories based on the per-relationship type thresholds. We can see that adaptively retrieving non-parametric memories is effective for larger models. The best performance on PopQA is using GPT-3 davinci-003 adaptively with GenRead and Contriever, yielding 46.5% accuracy, 5.3% higher than any non-adaptive method.

The threshold shifts with LM scale.

While Adaptive Retrieval shows performance gains for larger models, smaller models do not realize the same benefits; as shown in Figure 9, the performance gain from Adaptive Retrieval is much smaller when we use models smaller than 10 billion. Why does this happen? Figure 10 shows that smaller LMs almost always retrieve, indicating that there are not many questions for which small LMs’ parametric knowledge is more reliable than non-parametric memory. In contrast, large models typically retrieve much less. For example, GPT-3 davinci-003 only retrieves for 40% of questions when paired with BM25, and even the much smaller GPT-neox 20B does not retrieve documents on more than 20% of the questions. On EntityQuestions (Appendix Figure 15) all of the LMs retrieve much more, as the questions are mostly about less popular entities.

Adaptive Retrieval reduces inference-time costs.

We also found that Adaptive Retrieval improves efficiency; if we know we do not need to retrieve documents, we can skip retrieval components and the input length becomes shorter, which improves latency in both retrieval and language model components. Figure 11 shows the inference latency of GPT-J 6B and GPT-neox 20B, and API costs of GPT-3. Especially for larger LMs, concatenating retrieved context results in significantly increased latency (e.g., for GPT-J 6B, the inference time latency almost doubles). Adaptive retrieval enables reducing inference time up to 9% from standard retrieval. We also observe cost reduction on EntityQuestions, as shown in Figure 12.

Discussion and Conclusions

This work conducts large-scale knowledge probing to examine the effectiveness and limitations of relying on LMs’ parameters to memorize factual knowledge and to understand what factors affect factual knowledge memorization. Our results show that memorization has a strong correlation with entity popularity and that scaling up models on long-tail distributions may only provide marginal improvements. We also demonstrate that non-parametric memories can greatly aid LMs on these long-tail distributions, but can also mislead LMs on questions about well-known entities, as powerful LMs have already memorized them in their parameters. Based on those findings, we devise simple-yet-effective Adaptive Retrieval, which only retrieves when necessary, using a heuristic based on entity popularity and relationship types. Our experimental results show that this method is not only more powerful than LMs or previous retrieval-augmented LMs but also more efficient.

Limitations

This work focuses on entity-centric factual knowledge and demonstrates that LMs’ memorization is heavily affected by the popularity of the entities and the aspect of the entities being asked in the questions. It is important to emphasize that for running controlled experiments, we have relied on two synthetic datasets, and the extent to which our results apply to naturally occurring factual knowledge has not been firmly established. While we can be fairly confident about the relationship between scaling, retrieval, popularity, relationship type, and performance for the kinds of knowledge studied here, the effectiveness of Adaptive Retrieval will depend on many details of the question answering pipeline. Moreover, our work depends on a definition of popularity that is time-dependent and may not perfectly reflect how frequently entities are discussed on the web. Wikipedia page views are one possible definition of popularity for which we observe our results, and we invite others to improve upon it in future work. Further research can expand upon this simple approach, perhaps drawing on insights from Kadavath et al. (2022) to improve the effectiveness of Adaptive Retrieval.

It is an open question if the same findings are applicable to other types of world knowledge such as commonsense. We conjecture that the concept of the subject topic (entity), as well as the aspect (relationship type), can be applied with some minor modifications, which future work can quantify memorization following our scheme.

Ethical Considerations

Recent work Huang et al. (2022) shows that LMs memorize personal information available on the web, which has significant security issues. Our evaluation focuses on the memorization of general entity-centric knowledge, but our findings can be applicable to those areas. Our findings suggest that LMs are likely to have less reliable knowledge of minority groups. Parrish et al. (2022) established that models often rely on stereotypes to answer in uncertain cases, so our results indicate that LMs are likely to rely on stereotypes disproportionately for minority groups. Future work could investigate whether retrieval augmentation reduces bias in these cases.

Acknowledgements

We thank the UW NLP group members for their helpful discussions, and Joongwon Kim, Wenya Wang, and Sean Welleck for their insightful feedback on this paper. This research was supported by NSF IIS-2044660, ONR N00014-18-1-2826, ONR MURI N00014- 18-1-2670, and Allen Distinguished Award. AM is funded by a Goldwater Scholarship and AA is funded by the IBM PhD Fellowship.

References

Appendix

Appendix A Details of PopQA Constructions

In this work, we use the following 16 relationship types, and the authors of this paper manually annotated templates to verbalize knowledge triple to natural language questions. We show the final list of the templates used to create PopQA in Table 2.

Figure 3 shows the distribution of subject popularity of PopQAand EntityQuestions versus the popular NQ benchmark. NQ may have multiple entities so the distribution of the least popular entity per question is shown. Subject entities from NQ were extracted using TagMe Ferragina and Scaiella (2010) on the NQ-open development set with a score threshold of 0.22. TagMe returns the title of a Wikidata entity which can be directly used to find popularity.

Knowledge triples sampling.

In the construction of the PopQAdataset, knowledge triples are sampled with higher weight given to more popular entities, otherwise, the distribution would be dominated by the tail and we would not have enough high-popularity entities to complete our analysis. Specifically, when considering whether to sample a particular knowledge triple, we include the knowledge triple if and only if $f>\exp(8R-6)$ , where $R\sim U(0,1)$ is a unit uniform pseudo-random number and $f$ is the exact match term frequency of the subject entity’s aliases in an 800 MB random sample of C4. To increase diversity, once 2000 knowledge triples of a particular relation type have been sampled, they are no longer sampled.

Appendix B Experimental Details

GPT-3 API usage totaled to $275. We ran 14,282 questions through two GPT-3 davinci models using four different methods: vanilla experiments cost$ 13 ( $0.46 per 1000 questions), Contriever-augmented experiments cost$ 88 ( $3.08 per 1000 questions), BM25-augmented experiments cost$ 81 ( $2.80 per 1000 questions), and GenRead experiments cost$ 93 ($3.25 per 1000 questions).

To run experiments using LMs larger than two billion parameters, we use a single V100 Volta GPU with 32GB GPU memories. We use int8bit Zeng et al. (2022) quantization with OPT 13 billion and GPT-Neo 20 billion models to make them fit our GPUs. In our preliminary experiments using GPT-Neo 6 billion, we did not observe a notable performance drop by using the quantization.

Constructing few-shot contexts.

For PopQA, we sample few-shot examples stratified by relationship type to diversify the samples: for each of the 15 relationship types other than the one in the test question, we sample one random question-answer pair to include in the context. For EntityQuestions, we take a simple random sample of 15 question-answer pairs because there are more than 16 relationship types.

Details of deciding thresholds.

We 75% of PopQAto determine a popularity threshold for each relation type. Using brute force search, we select the threshold to maximize the adaptive accuracy, which we define as the accuracy attained by taking the predictions of the retrieval-augmented system for questions below the popularity threshold and the predictions based on parametric knowledge for the rest.

We then evaluate adaptive accuracy using the learned thresholds on the remaining 25% of PopQA, and repeat with 100 different random splits and take the mean to obtain the reported adaptive accuracy measurement.

Appendix C Detailed Results

Figure 16 shows the full result of per-relationship type accuracy for all relationship types in PopQA. Figure 17 shows the correlations for all relation types. Figures 19 and 18 show the same results for the EntityQuestions dataset.

Negative correlations of capital on EntityQuestions.

As shown in Figure 19, the capital relationship types on in EntityQuestions, while on PopQA, this relationship shows relatively high correlations. We found that in EntityQuestions, this capital relationship type has many low-popularity questions whose answers are included in subject entity names (e.g., subject="canton of Marseille-Belsunce", object="Marseille"). This causes performance to have a U-shaped relationship with popularity for the capital relationship type, so if most of the questions sampled come from the top half of popularity, the linear correlation will be positive, and vice versa.

C.2 Retrieval-augmented LM results

Figure 13 shows the overall performance of 40 LMs and retrieval-augmented LMs on PopQA. Retrieval-augmentation largely improves performance across different LMs, and much smaller models (GPT-Neo 1.3B) can perform on per with GPT-3. Figure 14 shows the results on EntityQuestions. Due to computational and time constraints, we were only able to run vanilla and Contriever results for most models.

Adaptive Retrieval for EntityQuestions.

Figure 15 shows the proportion of questions above the retrieval threshold for various models using Adaptive Retrieval on EntityQuestions. Because EntityQuestions has a large quantity of low-popularity questions, models (especially smaller ones) must rely heavily on retrieval.

Full results on all relationship types.

Figure 20 shows the full results on PopQA of the retrieval-augmented LMs and unassisted LMs on 16 relationship types using three different LMs as backbones. Figure 21 shows these results for GPT-3 davinci-003 on EntityQuestions.

C.3 Qualitative Results

Table 3 shows several examples on PopQA, where GPT-3 davinci-003 answers correctly while the Contriever-augmented version fails to answer. Along with the low recall@1 of 0.14 for this group, Table 3 suggests that the most common reason retrieval can be harmful is that it retrieves a document about a mistaken entity, such as a person with the same name as the subject, or an entity that simply is not relevant to the question (as in the case of “Noel Black”).

Table 4 shows several examples on PopQA, where GPT-3 davinci-003 answers correctly only when augmented with Contriever. The recall@1 for this case is 0.88, which is significantly higher than the overall recall. Note that in the second example, the retrieval caused the LM to answer correctly, but only by coincidence: the subject entity “Pierre” actually refers to the city in South Dakota, not the Basketball player. Otherwise, retrieval appears to be helpful because it provides the relevant information directly.