Generate rather than Retrieve: Large Language Models are Strong Context Generators

Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, Meng Jiang

Introduction

Knowledge-intensive tasks, such as open-domain question answering (QA) and fact checking, require access to a large amount of world or domain knowledge (Petroni et al., 2021). These tasks are even challenging for humans without access to an external knowledge source such as Wikipedia. A common thread of existing methods for knowledge-intensive tasks employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from Wikipedia and then conditions the prediction of the answer on these documents along with the question (Karpukhin et al., 2020; Lewis et al., 2020; Izacard & Grave, 2021). Nevertheless, these methods mainly suffer from three drawbacks. First, candidate documents for retrieval are chunked (e.g., 100 words) and fixed, so the retrieved documents might contain noisy information that is irrelevant to the question. Second, the representations of questions and documents are typically obtained independently in modern two-tower dense retrieval models (Karpukhin et al., 2020), leading to only shallow interactions captured between them (Khattab et al., 2021). Third, document retrieval over a large corpus requires the retriever model to first encode all candidate documents and store representations for each document. These two operations limit the parameters of dense retrievers and the size of embedding vectors, and thus cannot enjoy the world knowledge or deduction capabilities of large language models (Levine et al., 2022).

In this paper, we propose to leverage large language models, such as InstructGPT (Ouyang et al., 2022), to directly generate contextual documents for a given question, instead of retrieving relevant documents from an external corpus, such as Wikipedia. Our approach has two main advantages. First, we show that generated contextual documents contain the correct answer more often than the top retrieved documents. We believe this is because large language models generate contextual documents by performing deep token-level cross-attention between all the question and document contents, resulting in generated documents that are more specific to the question than retrieved documents. Second, we show that our approach significantly outperforms directly generating answers from large language models despite not incorporating any new external information. This is mainly because the task of generating document-level contexts is close to the objective of causal language modeling pre-training, so the world knowledge stored in the model parameters can be better utilized.

We show, on multiple datasets, that generated documents are more likely to contain correct answers than the top retrieved documents. Notably, in dense retrieval methods, as more documents are retrieved, the recall of documents containing the correct answer increases (Karpukhin et al., 2020). However, the recall performance does not scale as well with generated documents because even with sampling methods, generated documents tend to contain duplicate information. In order to improve the recall performance of generated documents, we propose a novel clustering-based prompt method. We synthesize a prompt with in-context demonstrations of question-document pairs sampled from diverse clusters. These prompts result in generated documents that cover different perspectives of the question and improve the scaling of performance as more documents are generated per question.

In contrast to the retrieve-then-read pipeline, our method is essentially a generate-then-read pipeline. Specifically, it first prompts a large language model to generate contextual documents based on a given question, and then reads the generated document to produce the final answer. The reader can still be a large model (e.g., InstructGPT (Ouyang et al., 2022)) used under a zero-shot setting, or a small one (e.g., FiD (Izacard & Grave, 2021)) fine-tuned with generated documents on the training split of the target dataset. We evaluate our proposed method on three different knowledge-intensive tasks and demonstrate its effectiveness on both zero-shot and supervised settings.

Overall, our main contributions can be summarized as follows:

We propose a novel generate-then-read pipeline for solving knowledge-intensive tasks, i.e., replacing the process of retrieving documents from Wikipedia or searching for related documents on Google, by prompting a large language model to generate relevant contextual documents.

We propose a novel clustering-based prompting approach to generate multiple diverse contextual documents that increases the likelihood of covering the correct answer. We demonstrate this approach can significantly improve performance on end QA and other downstream tasks.

We conduct extensive experiments with three knowledge-intensive NLP tasks under both zero-shot and supervised settings. Notably, our method can match or even outperform retrieve-then-read pipeline methods, without retrieving any documents from any external knowledge source.

Related Work

Mainstream methods for solving knowledge-intensive NLP tasks employ a retrieve-then-read model pipeline. Given a question, this model first leverages a retriever over a large evidence corpus (e.g. Wikipedia) to fetch a set of relevant documents that may contain the answer. A reader is then used to peruse the retrieved documents and predict an answer. Recent follow-up work has mainly focused on improving the retriever (Karpukhin et al., 2020; Qu et al., 2021; Sachan et al., 2022) or the reader (Izacard & Grave, 2021; Cheng et al., 2021; Yu et al., 2022), or training the system end-to-end (Lewis et al., 2020; Singh et al., 2021). Early retrieval methods mainly employed sparse retrievers, such as BM25 (Chen et al., 2017). Recently, ORQA (Lee et al., 2019) and DPR (Karpukhin et al., 2020) have revolutionized the field by utilizing dense contextualized vectors for document indexing, leading to superior performance to traditional approaches. We propose an alternative approach which forgoes retrieval, instead extracting the knowledge from the model parameters of a large language model. We show that our approach is can be combine with dense retrievers to outperform both methods independently. Our method can also be combined with any reader mechanism, allowing generated context documents to be plugged into any current knowledge-intensive NLP pipelines.

2 Generator as Retriever for Obtaining Contextual Documents.

Recent works have investigated using auto-regressive language models to generate identifier strings for documents, as an intermediate target for retrievals, such as entity names (De Cao et al., 2020) or distinctive n-grams that can be mapped to full passages (Bevilacqua et al., 2022). However, one needs to create the identifiers, hence the structure was not thoroughly evaluated on a large-scale benchmark (Bevilacqua et al., 2022). Other works have demonstrated that the knowledge stored in the parameters of pre-trained language models could be “retrieved” to some extent by directly generating text (Petroni et al., 2019; Roberts et al., 2020). However, the previous work only used generation for query expansion (Mao et al., 2021), which did not exploit the potential of directly generating contextual documents for open-domain questions. Different from the above approaches that aimed to train a generator model to produce contextual document identifiers (which is still using the original Wikipedia text) or provide data augmentation to retrievers, our work directly generates contextual documents for given questions.

3 NLP Models Enhanced by Large Language Model Outputs.

A line of recent work has shown that relevant knowledge can be elicited from large language models, especially for those domains that lack appropriate knowledge bases with sufficient coverage (Liu et al., 2022b; Fang et al., 2022). For example, Liu et al. (2022b) proposed leveraging GPT-3 to generate relevant contexts, then providing the contexts as additional input when answering a commonsense question. Another line of work focused on prompting a large language model to generate a series of intermediate reasoning steps, often referred to as chain-of-thought (Wei et al., 2022b; Kojima et al., 2022; Li et al., 2022). The prompt consists of an instruction (e.g., Let’s think step by step!), a few demonstrations that are fixed for each task, and a new-question placeholder. The demonstrations are human-written, and each consists of a question in the style of the task and a series of intermediate reasoning steps that is helpful for answering the question. Our work does not require any human annotation, but adds to this line of work of leveraging model generated text to guide further generations. In our case, we apply this approach to knowledge-intensive tasks, which have not been explored by previous work.

Proposed Method

In this section, we present details of our proposed novel generate-then-read (GenRead) pipeline for solving various knowledge-intensive tasks. Specifically, it first prompts a large language model to generate contextual documents with respect to a given query, then reads the generated documents to predict the final answer. The reader can either be a large model (e.g., InstructGPT) used for the zero-shot setting, or a small one (e.g., FiD) fine-tuned with generated documents on the training split of the target dataset. We introduce the zero-shot setting in $\S$ 3.1 and supervised setting in $\S$ 3.2.

Under the zero-shot setting, there is no training data – neither questions nor contextual documents. When tested on the open-domain QA task, most existing large language models directly encode the given question and predict the answer (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022). Specifically, the question $q$ , associated with some text prompt, is input to the model, which then generates the answer, denoted as $p(a|q,\theta)$ , where $\theta$ represents the pre-trained model parameters. In practice, the maximum a posteriori estimation (MAP) is the final answer, i.e., $\hat{a}=\operatorname*{arg\,max}_{a}p(a|q,\theta)$ . However, this way of directly asking large language models to output answers often leads to poor performance, as it leaves a considerable amount of additional world knowledge unexploited (Levine et al., 2022). On the contrary, the zero-shot retrieve-then-read pipeline first uses an off-the-shelf retriever to fetch relevant documents from an external knowledge source such as Wikipedia, then asks the large language model to read the documents and predict the answer.

In this work, we improve the performance by introducing an additional auxiliary generated document variable $d$ , and then extend the model to have the form $p(a|q)=\sum_{i}p(a|d_{i},q)p(d_{i}|q)$ . In practice, we cannot sum over all possible documents $d$ . Therefore, the most common approach is to compute the MAP estimate $\hat{d}=\operatorname*{arg\,max}\hat{p}(d)$ using beam search, and then to approximate the sum over $d$ with this single value. This two step approach, we label it as a generate-then-read pipeline.

Step1: Generate. In this step, we first prompt a large language model (e.g., InstructGPT (Ouyang et al., 2022)) to generate documents based on the given question. For example, the input to the language model could be “Generate a background document to answer the given question. {question placeholder}”. We can use any decoding strategy (e.g., greedy decoding, beam search), but we used greedy decoding throughout the zero-shot experiments for simplicity and reproducibility.

Step 2: Read. In the second step, we use generated sentence $\hat{d}$ along with the input question to produce the final answer from the large language model. This is actually the same setting as “zero-shot” reading comprehension, as widely studied in existing works (Brown et al., 2020; Lazaridou et al., 2022). We choose appropriate prompts from P3 (Bach et al., 2022), such as “Refer to the passage below and answer the following question. Passage: {background placeholder} Question: {question placeholder}”. Finally, the language model is fed the prompted text to generate an answer.

2 Supervised Setting

Although large language models demonstrate impressive performance on zero-shot learning abilities, their performance still lag behind the supervised setting. Therefore, we also explore how the generated documents from large language models can benefit the supervised setting. As directly fine-tuning large language models on downstream datasets could be prohibitively expensive, we leverage a small reader model such as FiD to peruse the generated documents under the supervised setting.

Under the supervised setting, scaling the size of retrieved documents can lead to better performance (Karpukhin et al., 2020; Izacard & Grave, 2021). This is mainly because retrieving more documents can cover more relevant information and knowledge, i.e., a higher recall score. Nevertheless, asking a large language model to generate multiple high-quality contextual documents is a challenging task. Dense retrieval methods can fetch multiple documents covering different perspectives of the answer. Compared to dense retrievers, simply prompting a large language model to generate multiple contextual documents often leads to low knowledge coverage, since the contents generated by multiple decoding passes from the same input tend to be similar. Sampling decoding methods, such as nucleus samplingWe treated nucleus sampling as a baseline to generate multiple documents, in which we set $p=.95$ . (Holtzman et al., 2020) can diversify the generation process to some extent, but the knowledge content of generated texts still tends to be highly repetitive when used to generate documents for a given question. We further propose two novel solutions, including diverse human prompts and clustering-based prompts, which will be elaborated on in this section.

In order to avoid similar token distributions under a single prompt, we ask human annotators to provide different prompts, in order to make the generated document diverse. This method is simple, but can effectively vary the token distribution during generation. In the experiments, we empirically found this method can bring improvement to the retrieval performance (Figure 2). However, this method suffers from two drawbacks. On one hand, it requires human annotators to write different prompts, which cannot be easily generalized to different knowledge-intensive tasks. On the other hand, different large language models might be sensitive to different prompt words, which might cause a set of good prompt words not work on a different large language model.

2.2 Clustering-based Prompts

To increase knowledge coverage in generated documents, we propose a novel clustering-based prompt method. It first clusters the representations of a set of documents into $K$ classes ( $K=2$ in Figure 1), where the number of classes is equal to the number of documents that need to be generated in the end. Next, it randomly selects $n$ question-document pairs ( $n=5$ in Figure 1) from each cluster. Lastly, a large language model presents the different $n$ question-document pairs as in-context demonstrations for generating documents to a given question. In this way, large language models are based on different distributions of examples, hence resulting in generated documents covering different perspectives. We show this in Figure 1 and illustrate the details of each step as follows.

Step 1: Get One Initial Document Per Question. Similar to the zero-shot setting, we first ask a large language model to generate one contextual document $d$ for each question $q\in\mathcal{Q}$ , where $\mathcal{Q}$ is the set of questions in the training split. Alternatively, we can use an unsupervised retriever (e.g., BM25) to obtain a document from Wikipedia. We now have a question-document pair set $\{q_{i},d_{i}\}_{i=1}^{|\mathcal{Q}|}$ .

Step 2: Encode each Document, Do K-means Clustering. We then use a large language model (i.e., GPT-3) to encode each question-document pair, i.e., $\textbf{e}_{i}=\text{GPT-3}([q_{i},d_{i}])$ , resulting in a 12,288-dimensional vector per document. Then, we use K-means to cluster all embedding vectors $\{\textbf{e}_{i}\}_{i=1}^{|Q|}$ into $K$ sets, so each question-document pair is assigned a unique cluster id $c\in\{1,\text{...},K\}$ . We vary the number of $K$ in the experiments, which will be illustrated in Figure 2.

Step 3: Sample and Generate $K$ Documents. Lastly, we sample $n$ question-document pairs from each cluster $c$ , denoted as $\{q_{c1},d_{c1};q_{c2},d_{c2};\text{...};q_{cn},d_{cn}\}$ , in which $n$ is a hyperparameterIn the experiments, we set $n=5$ and found increasing $n$ does not bring extra improvement.. Then, the $n$ sampled question-document pairs from the same cluster serve as in-context demonstrations for the large language model to generate a contextual document. For example, the input to the large language model could be “{ $q_{c1}$ placeholder} { $d_{c1}$ placeholder} … { $q_{cn}$ placeholder} { $d_{cn}$ placeholder} {input question placeholder}”. By enumerating the sampled documents in these $K$ clusters, we can finally get $K$ -generated documents. By conditioning on different sampled in-context demonstrations collected from different clusters, the large language model has been biased for different perspectives. Although these different perspectives exist in a latent manner, we empirically show it works well in practice, by comparing it with sampling methods, diverse human prompts (Figure 2 and Table 2) and randomly sampling $n$ pairs from the entire dataset (Table 11).

Experiments

In this section, we conduct comprehensive experiments on three knowledge-intensive NLP tasks, including open-domain QA (NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017) and WebQ (Berant et al., 2013)), fact checking (FEVER (Thorne et al., 2018) and FM2 (Eisenschlos et al., 2021)) and open-domain dialogue system (WoW (Dinan et al., 2019)). More detailed dataset information can be found in Appendix A.1. To evaluate the model performance, we use exact match (EM) score for evaluating open-domain QA (Zhu et al., 2021). An answer is considered correct if and only if its normalized form has a match in the acceptable answer list. We also employ Recall@K (R@K) as an intermediate evaluation metric, measured as the percentage of top-K retrieved or generated documents that contain the answer. This metric is commonly used in evaluations of previous works (Karpukhin et al., 2020; Izacard & Grave, 2020; Sachan et al., 2022). For other knowledge-intensive tasks, we follow the KILT benchmark (Petroni et al., 2021) to use accuracy (ACC) for fact checking and F1 / Rouge-L (R-L) score for open-domain dialogue system.

We first compare our proposed GenRead approach with various large language models proposed in recent years, including GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), FLAN (Wei et al., 2021), GLaM (Du et al., 2022), Chinchilla (Hoffmann et al., 2022), PaLM (Chowdhery et al., 2022) and InstructGPT (Ouyang et al., 2022). Due to the space limitation, we only put the best performance on each dataset in Table 1, in which the line is called previous SoTA methods. In addition, their corresponding model parameters and performance are listed in Table 9 in Appendix. All of these baseline methods use the same input formats, i.e., [prompt words; question].

GenRead is based on InstructGPT with 175B parameters. In order to fully evaluate the effectiveness of our proposed method, we also compare with InstructGPT augmented with retrieved documents from Wikipedia or Google search. The baseline methods (1) BM25 / Contriever + InstructGPT; (2) Google + InstructGPT; (3) DPR + InstructGPT have the same input format as our GenRead , i.e., [prompt words; contextual document; question]. BM25 is a traditional sparse retrieval method. Contriever (Izacard et al., 2022a) is a state-of-the-art unsupervised dense retrieval model. DPR (Karpukhin et al., 2020) is a supervised dense retrieval model directly trained on NQ, TriviaQA and WebQ datasets. We note that comparing with above three methods is challenging because our method only relies on the large language model itself, without using any external corpus.

In the experiments, we use InstructGPT as our backbone model. As shown in Table 1, compared with state-of-the-art large language models, our proposed GenRead with the InstructGPT reader improves its performance by generating contextual documents and conditioning on the generated documents, even though no new data is introduced, and the generator and reader have the exact same parameters. Specifically, GenRead can improve the EM score by +6.9 on three open-domain QA benchmarks, compared to the original InstructGPT. We also make a similar observation on fact checking and open-domain dialogue system. Our proposed GenRead can consistently outperform the baseline InstructGPT model without retrieving any contextual documents.

To further validate the effectiveness of GenRead , we compare against zero-shot retrieve-then-read pipeline models, which first use a retrieval model or the Google search engine to get a relevant contextual document, then use InstructGPT to read the texts and produce the final answer. As shown in Table 1, GenRead can achieve on-par performance with zero-shot retrieve-then-read pipeline models on the NQ and FM2 datasets, and outperform them on all other benchmarks. The knowledge learned by the large language models can be retrieved via autoregressive text generation. Without seeing any examples from these datasets, GenRead can outperform using the supervised retrieval model (i.e., DPR) to recover relevant contextual documents.

2 Supervised Setting Experiments

We compare our proposed GenRead with retrieve-then-read models, including DPR (Karpukhin et al., 2020), RAG (Lewis et al., 2020), and FiD (Izacard & Grave, 2021). In addition, we compared with obtaining relevant documents from the internet using the Google search engine.

For our proposed method, we replace the retriever with a large language model to directly generate contextual documents. In the experiments, we use InstructGPT (Ouyang et al., 2022). After contextual documents are retrieved or generated, we employ a FiD reader with 770M parameter models (i.e., FiD-l) and 3B parameter models (i.e., FiD-xl) that are fine-tuned on the training split of target datasets. We note that we only use 10 documents during reading for the following reasons.

Why do we choose to use only 10 documents instead of 100 when reading?

As noted in Section 6.2 in DPR (Karpukhin et al., 2020) and Figure 3 in FiD (Izacard & Grave, 2021), increasing the number of documents can lead to better model performance and achieve state-of-the-art when using 100 documents. However, there are two major drawbacks to using 100 documents during the reading step. First, the operation is very expensive, leading to a significant increase in memory consumption and training time. As reported by Izacard & Grave (2021), the training process requires 64 Tesla V100 32GB running for around one day. Second, generating documents by using a large language model is slow and expensive, so only using 10 documents can be a significant cost saving in our method. Therefore, in our experiments, we choose to use 10 documents during the reading process. When using FiD-770M (i.e., FiD-large), the training process can be easily performed even on a single Tesla V100 32GB GPU. Meanwhile, when only using 10 documents, we can also increase the size of FiD model from 770M to 3B, which takes about the same amount of GPU memory as using 100 documents on a 770M model, but at the same time significantly shortens the training time. We note that training T5-3B model needs a bigger cluster such as 8 Tesla V100 or A100 GPUs.

2.2 Experimental Results on Open-domain QA

We first use Recall@K to compare the retrieval accuracy of different models. As shown in Figure 2, GenRead can significantly outperform DPR and Google search for under 10 retrieved or generated documents. Compared to different GenRead variants, including nucleus sampling, human written prompts, and clustering-based prompts, clustering-based prompts achieve the best performance. At the same time, we notice that the language model inevitably has the problem that the slope of the curve decreases as the number of generated documents increases. On one hand, this is due to the similarity of token distributions when large language models generate multiple documents. On the other hand, due to the shallow interaction characteristics of the dense retrieval model itself, the retrieved documents might not be completely relevant to the given question, so that the increase in recall might come from false positive documents, as also mentioned by Sachan et al. (2022).

As shown in Table 2, we can first observe the FiD model performs the best among all baseline models. Using FiD-xl with only 10 documents achieves comparable performance with using FiD-l with 100 documents. The average gap is less than 1% on three benchmarks. Compared with both close-book models and Wikipedia-based retrieve-then-read pipelines, our proposed GenRead can achieve state-of-the-art performance. Furthermore, compared with using sampling methods to generate documents, the clustering-based prompt method can improve the EM score by +2.2 on average. This indicates that the clustering-based prompt method is effectively increasing the knowledge coverage of generated documents, and also leading to better downstream QA performance. We also show that GenRead can outperform Google search on all benchmarks. We observe both our method and Google search perform worse than DPR, mainly due to the significant portion of time-dependent questions in the dataset, which is described in the following analysis.

2.3 Experimental Results on Other Tasks

We demonstrate the experimental results in Table 3. Under the supervised setting, GenRead can achieve on par performance on the fact checking task and superior performance on the dialogue system task, indicating that large language model can be seen as a strong knowledge generator.

The main reason that GenRead performs worse than the dense retriever for fact checking is that the task provides sufficient semantic information to reach strong performance on this binary decision task. So, there is a smaller semantic gap between the given factual statement and contextual documents than that of question and document pairs in open-domain QA, which is an easier retrieval setting for modern dense retrieval methods that are mainly based on vector similarity.

3 Observations and Experimental Analysis

Generated documents can be combined with retrieved documents to outperform both. Even with a very large number of retrieved documents, including few samples of generated knowledge leads to large improvements. As shown in Table 2, merging retrieved documents with generated documents can achieve state-of-the-art performance compared to all baseline methods listed in the table. Specifically, it can improve +5.7 averagely on three open-domain QA benchmarks compared to DPR alone, and improve +4.4 averagely compared to the large language model alone.

3.2 Coverage Analysis over All Possible Answers

The improvement in open-domain QA performance is due to the fact that correct answers are included more frequently in the generated text Recall@K is the most commonly used metric in existing works to measure the retrieval performance, which computes the percentage of top-K retrieved or generated documents that contain any possible answer at least once. than in the retrieved documents. However, as many questions contain multiple correct answers, recall@K cannot fully reflect the diversity of generated or retrieved documents. Each question in the WebQ has 2.39 correct answers, 1.79 correct answers in NQ and 14.02 (including all entity alias) in the TriviaQA. NQ and WebQ do not include alias names in the labels.

In this section, we also demonstrate the answer coverage performance of different models in Table 6. Answer coverage measures the percentage of the number of answers that are contained in the documents over all possible answers. Coverage analysis showed that generated text tends to have lower coverage than retrieved documents because generated documents tends to have little diversity compared to retrieved documents. To improve coverage, we propose GenRead with clustering, where we include examples in the prompt from different clusters of the training data to elicit more diverse generations.

4 Readability Analysis of Retrieved and Generated Documents

After we manually compare some retrieved documents from DPR and generated documents from InstructGPT, we observe that the readability of different documents, when they contain the correct answer string, is different. In other words, documents containing answers might also contain noisy information that is irrelevant to the question, which could affect both the model and human reading.

In order to further validate the readability of retrieved documents and generated documents, we extracted a subset of data examples from NQ, TriviaQA and WebQ datasets, in which both retrieved and generated documents contain the correct answer. As shown in Table 5, when both retrieved and generated documents contain the correct answer, the FiD reader can produce more correct answers when reading the generated documents from large language models (e.g., InstructGPT).

We also provide some case studies in Tables 16-19. For example, in Table 18, the question is “What city was Zeus the patron god of?”. The first document retrieved by DPR is “Like the other Panhellenic Games, the ancient Olympic Games were a religious festival, held at the sanctuary of Zeus at Olympia.”. Although it contains the correct answer, it is hard to infer the answer “Olympia” from it. On the contrary, InstructGPT generates the document “Zeus was the patron god of the city of Olympia, which was located in the northwestern Peloponnese region of Greece. Olympia was the site of the Olympic Games, held every four years in honor of Zeus.”, which is much easier to read.

Epilogue

Conclusion. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing the dense retrieval models with large language model generators. We call it generate-then-read, which first prompts a large language model to generate contextual documents, then read the generated document to infer the final answer. Notably, without retrieving any documents, it reaches 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the current retrieval-reader model DPR-FiD, as well as on other two knowledge-intensive tasks.

Limitation and Future Work. Despite the strong performance on the presented datasets, our approach is limited in its ability to update knowledge state and adapt to new domains. A major feature of retrieve-then-read is the ability to swap in new documents when new information is learned, such as temporally more recent documents, or adding in documents from a new domain to quickly adapt to a new downstream task. Our approach relies on a large language model to contain all this knowledge and adding new knowledge would likely require some retraining. Future work will explore how to efficiently incorporate new knowledge into our generate-then-read method. Besides, generated documents might suffer from hallucination error, resulting in incorrect predictions. We demonstrated case study in Table 15. Consideration in combination with recent approaches (Creswell & Shanahan, 2022) to boost generative faithfulness is a also direction worthy of future research.

Ethics Statement

Large language models have a wide range of beneficial applications for society, but they also have potentially harmful applications. Previous work has shown various forms of bias, such as racial and gender bias, in large language models like GPT-3, even after explicit efforts to reduce toxic language (Chan, 2022). The importance of addressing these societal harms is acknowledged by OpenAI themselves in their 2020 paper introducing GPT-3 (Brown et al., 2020), which stated “we focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 … and issues of bias, fairness, and representation within models like GPT-3.” on page 34.

The goal of this paper is to utilize knowledge stored in the parameters of large language models to answer open-domain questions and solve knowledge-intensive tasks. Unlike retrieve-then-read where an external corpus can be curated to be trustworthy, the use of a model to generate contextual documents may further permeate existing biases in common models. First, our work shows that generated documents suffer from challenges of stale information from outdated documents used for training. Second, we show that generated documents tend to be less diverse, potentially biasing answers towards more common entities and terms from the training data. Finally, we conducted experiments on only three large language models. It is possible that some of our conclusions or observations may not necessarily hold for other models trained with different data or objectives.

Regarding ethical solutions, future work includes (i) further exploring potential bias and intentional or unintentional harm that may result from using generated contextual documents; (ii) better aligning language models with user intent to generate less biased contents and fewer fabricated facts.

Acknowledgements

This work was supported in part by NSF IIS-2119531, IIS-2137396, IIS-2142827, CCF-1901059, and ONR N00014-22-1-2507. Wenhao is supported in part by Bloomberg Data Science Ph.D Fellowship.

References

Appendix A Appendix

– TriviaQA (TQA) (Joshi et al., 2017) contains a set of trivia questions with answers that were originally scraped from trivia and quiz-league websites.

– WebQuestions (WebQ) (Berant et al., 2013) consists of questions selected using Google Suggest API, where the answers are entities in Freebase.

– Natural Questions (NQ) (Kwiatkowski et al., 2019) were mined from real Google search queries and the answers are spans in Wikipedia articles identified by human annotators.

We explore the same train / dev / test splits for the open-domain QA setting as used by Izacard & Grave (2021); Karpukhin et al. (2020). For TriviaQA, GPT-3 / GLaM / PaLM (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022) evaluate on the Wikipedia dev set of 7,993 examples, so we ran an additional evaluation on that dev set in order to compare with their performance.

– Fever (Thorne et al., 2018) is one of the largest datasets for fact checking that requires retrieving evidence from external corpus to support if a statement is supported or refuted.

– Fool Me Twice (FM2) (Eisenschlos et al., 2021) is a challenging fact checking dataset collected by gamification. Players write challenging claims either entailed or refuted by evidence from Wikipedia. They are then tasked to spot the refuted claim among a group.

– Wizard of Wikipedia (WoW) (Dinan et al., 2019) is an open-domain dialogue task for training agents that can converse knowledgeably about open-domain topics. One speaker in the conversation must ground their utterances in a specific knowledge sentence from a Wikipedia page.

We use the same train / dev / test splits in KILT challenge (Petroni et al., 2021) for the FEVER and WoW datasets. Their test labels are hidden, so the performance can only be evaluated through https://ai.facebook.com/tools/kilt. For FM2, we use its official dataset splits.

A.2 Implementation Details

We use T5-770M (Raffel et al., 2020) and T5-3B as our backbone models to implement FiD (Izacard & Grave, 2021). We use AdamW as the optimizer, with 2,000 warm-up steps. We set the dropout probability to $0.1$ and weight decay to $0.01$ . We use one A100 for running T5-770M and set the batch size of 16. We use 8 A100 for running T5-3B and set the per GPU batch as 2, leading to the total batch size as 16. We searched different learning rates, ranging from $5e$ - $6$ to $4e$ - $5$ , and we found $3e$ - $5$ to $6e$ - $5$ performed the best under the T5-3B setting and $5e$ - $5$ to $1e$ - $4$ performed the best under the T5-770M setting. We refer to more individual implementation details in Table 7.

We implement other baseline methods by using repositories:

– BM25: https://github.com/castorini/pyserini

– DPR: https://github.com/facebookresearch/DPR

– Contriever: https://github.com/facebookresearch/contriever

A.3 Reproducibility via Open Source Large Language Models

We note that reproducing experiments on the OpenAI API, though publicly available, costs money. For this reason, we further add an evaluation on two open-source large language models OPT (Zhang et al., 2022) and Codex (OpenAI, 2022). As shown in Table 8, OPT performed worse than InstructGPT, but still achieved comparable performance with DPR; OpenAI Codex achieved the best performance on both TriviaQA and WebQ.

A.4 Scaling with Number of Large Language Model Parameters

Figure 4 shows the scaling of performance with InstructGPT generator parameters, including Ada-150M, Babbage-1.3B, Curie-6.7B and Davinci-175B. We note that for both FiD and our GenRead , we use the FiD-xl with 10 input documents either retrieved from Wikipedia or generated by InstructGPT. The performance of both TriviaQA and WebQ continues to improve as the generator model parameters increase, as does the slope. Only with the largest size InstructGPT, GenRead can outperform the DPR-FiD. This indicates using large language model to generate contextual documents is an “emergent ability” of scaling, which is not present in smaller models but is only present in larger language models (Wei et al., 2022a).

A.5 Additional Numbers for Tables in the Main Paper

– Table 9 contains additional evaluation results for Table 1. It demonstrates zero-shot open-domain QA performance, compared to recent large language model.

– Figure 5 contains additional retrieval performance evaluation for Figure 3 of experiments on combining DPR retrieved documents and large language model generated document.

– Table 10 contains additional retrieval performance evaluated by Recall@K of baselines and different GenRead variants. Some numbers in the table overlaps with those in Figure 2.

A.6 Discussion on Inference Cost of DPR and InstructGPT

We now compare the costs of using DPR and InstructGPT to retrieve or generate contextual documents. We consider DPR using the BERT-base (Devlin et al., 2019) version with 110M parameters and InstructGPT using its largest version with 175B parameters. For simplicity, we use the FLOPs-per-token estimates for Transformer-based language models, which is introduced by Kaplan et al. (2020). It should be noted that FLOPs are not a direct measure of real-world computing costs, as latency, power consumption, and other costs can vary widely based on other factors (Liu et al., 2022a).

For the DPR model, all Wikipedia documents (around 21M) only need to be encoded once. Therefore, as the number of input questions increases, the marginal computational cost gradually decreases. For fair comparison, we first use DPR to encode all 21M Wikipedia documents once. Encoding all Wikipedia documents requires $110e6$ (BERT-base parameters) $\times$ $21e6$ (total number of documents) $\times$ $100$ (tokens per document) $=2.3e17$ FLOPs. When the embedding of all candidate documents are produced, retrieving documents for a given question requires $110e6$ (BERT-base parameters) $\times$ $20$ (tokens per question) $+21e6$ (total number of documents) $\times$ ( $768+768-1$ ) $=3.2e10$ FLOPs.

For InstructGPT, it requires $175e9$ (InstructGPT parameters) $\times$ $10$ (number of documents) $\times$ $55$ (generated tokens per document) $=$ $9.6e13$ FLOPs to generate 10 documents for a given question.

Therefore, the equation for the total cost $Y_{\text{DPR-cost}}$ to retrieve 10 documents using DPR versus the number of input questions $X$ is: $Y_{\text{DPR-cost}}=3.2e10X+2.3e17$ . Besides, the equation for the total cost $Y_{\text{GPT3-cost}}$ to generate 10 documents using InstructGPT versus the number of input questions $X$ is: $Y_{\text{GPT3-cost}}=9.6e13X$ . When $Y_{\text{DPR-cost}}=Y_{\text{GPT3-cost}}$ , $X\approx 2473$ . In conclusion, if the number of input questions is less than $2473$ , the total cost of InstructGPT is lower than the DPR; if the number of input questions is greater than $2473$ , the total cost of InstructGPT exceeds the DPR.

A.7 Error Analysis and Case Studies on the NQ dataset

As stated in Zhang & Choi (2021), NQ contains a significant proportion, roughly 16.5%, of questions that have time-dependent answers. Similarly, Izacard et al. (2022b) observed using the latest version of Wikipedia (12 / 2021) could lead to 4.4 drops of the EM score, compared to the Wikipedia version (12 / 2018) that the NQ questions are created from. We provide case studies in Table 13 in Appendix.

We did case studies of 100 examples from the NQ dataset. The results are shown in Table 12. Among these 100 examples, we found that 29 examples have data collection and annotation mistakes, mainly including the temporal question issue (13 / 29) and the incomplete answer issue (16 / 29). A typical temporal-dependent question is that no specific time condition is provided. For example, “Who won the MVP for the National League?” could have different answers in different years. In 2017, the MVP is Giancarlo Stanton, and in 2018, the MVP is Christian Yelich. Besides, some answer labels provided in the NQ dataset are not complete. For example, person names in the NQ dataset usually consist of first, middle, and last names, but most names in the generated documents are first and last names. For the question “who played lionel in as time goes by?”, the labeled answer is “Geoffrey Dyson Palmer”. DPR-FiD produces “Geoffrey Dyson Palmer” but GenRead produces “Geoffrey Palmer”, both of which should be considered correct. More examples are provided in Table 14.

Besides, GenRead produced correct answers for 49 questions. Among the 22 incorrect predictions, 12 of them could be classified as retrieval errors (i.e., step-I error) and 12 as reading errors (i.e., step-II error). In all cases of retrieval errors, none of the generated documents contain the correct answer. In all cases of reading errors, at least one generated document contains the correct answer but the reader model failed to infer the correct answer from the documents..

Appendix B Prompts Choices

– (1) “{query} $\backslash$ n $\backslash$ nThe answer is” (no space between {query} and $\backslash$ n)

– (2) “{query} $\backslash$ n $\backslash$ n The answer is” (performance reported in Table 1)

For fact checking and dialogue system, we used the following prompts.

– Fact Checking “{claim} $\backslash$ n $\backslash$ n Is the claim true or false?”

– Open-domain Dialogue System “{query} $\backslash$ n $\backslash$ n”

B.1.2 Prompts for Background Generation (Step-1)

– Open-domain Question Answering “Generate a background document from Wikipedia to answer the given question. $\backslash$ n $\backslash$ n {query} $\backslash$ n $\backslash$ n”

– Fact checking “Generate a background document from Wikipedia to support or refute the statement. $\backslash$ n $\backslash$ n Statement: {claim} $\backslash$ n $\backslash$ n”

– Open-domain Dialogue System “Generate a background document from Wikipedia to answer the given question. $\backslash$ n $\backslash$ n {utterance} $\backslash$ n $\backslash$ n”

B.1.3 Prompts for Reading Comprehension (Step-2)

We collected the prompt from P3 (Bach et al., 2022), which includes over 2,000 open-source prompts for roughly 170 datasets. For zero-shot QA, we experimented with three different reading comprehension prompts. We reported the performance for each prompt in Table 20.

– (1) “Refer to the passage below and answer the following question with just a few words. Passage: {background} $\backslash$ n $\backslash$ n Question: {query} $\backslash$ n $\backslash$ n The answer is”

– (2) “Passage: {background} $\backslash$ n $\backslash$ n Question: {query} $\backslash$ n $\backslash$ n Referring to the passage above, the correct answer (just one entity) to the given question is”

– (3) “Refer to the passage below and answer the following question with just one entity. $\backslash$ n $\backslash$ n Passage: background $\backslash$ n $\backslash$ n Question: query $\backslash$ n $\backslash$ n The answer is”

For fact checking and dialogue system, we chose the simplest prompt from P3.

– Fact Checking “{background} $\backslash$ n $\backslash$ n claim: {claim} $\backslash$ n $\backslash$ n Is the claim true or false?”

– Open-domain Dialogue System “{background} $\backslash$ n $\backslash$ n utterance $\backslash$ n $\backslash$ n”

B.2 Human Prompt Annotations (for Section 3.2.1)

In order to get a better prompt for large language models to generate better contextual documents, we asked 30 students in the computer science department to write different prompts. We first constructed a small validation set with 200 examples by combining 50 random question-answer pairs from NQ, 100 random pairs from TriviaQA and 50 random pairs from WebQ. When an annotator wrote down a prompt, our system can immediately evaluate the prompt by using the validation set and return the performance to the annotator. Then, the annotator can modify the previous prompt until the recall performance reaches a threshold, which is set as 50 in our experiments. Finally, we got 29 prompts from human annotators due to two of them are the same. We used the top-10 prompts (shown in Table 21 and Table 22) in the human prompt setting, as described in $\S$ 3.2.1.