Query2doc: Query Expansion with Large Language Models

Liang Wang, Nan Yang, Furu Wei

Introduction

Information retrieval (IR) aims to locate relevant documents from a large corpus given a user issued query. It is a core component in modern search engines and researchers have invested for decades in this field. There are two mainstream paradigms for IR: lexical-based sparse retrieval, such as BM25, and embedding-based dense retrieval (Xiong et al., 2021; Qu et al., 2021). Although dense retrievers perform better when large amounts of labeled data are available (Karpukhin et al., 2020), BM25 remains competitive on out-of-domain datasets (Thakur et al., 2021).

Query expansion (Rocchio, 1971; Lavrenko and Croft, 2001) is a long-standing technique that rewrites the query based on pseudo-relevance feedback or external knowledge sources such as WordNet. For sparse retrieval, it can help bridge the lexical gap between the query and the documents. However, query expansion methods like RM3 (Lavrenko and Croft, 2001; Lv and Zhai, 2009) have only shown limited success on popular datasets (Campos et al., 2016), and most state-of-the-art dense retrievers do not adopt this technique. In the meantime, document expansion methods like doc2query (Nogueira et al., 2019) have proven to be effective for sparse retrieval.

In this paper, we demonstrate the effectiveness of LLMs (Brown et al., 2020) as query expansion models by generating pseudo-documents conditioned on few-shot prompts. Given that search queries are often short, ambiguous, or lack necessary background information, LLMs can provide relevant information to guide retrieval systems, as they memorize an enormous amount of knowledge and language patterns by pre-training on trillions of tokens.

Our proposed method, called query2doc, generates pseudo-documents by few-shot prompting LLMs and concatenates them with the original query to form a new query. This method is simple to implement and does not require any changes in training pipelines or model architectures, making it orthogonal to the progress in the field of LLMs and information retrieval. Future methods can easily build upon our query expansion framework.

For in-domain evaluation, we adopt the MS-MARCO passage ranking (Campos et al., 2016), TREC DL 2019 and 2020 datasets. Pseudo-documents are generated by prompting an improved version of GPT-3 text-davinci-003 from OpenAI (Brown et al., 2020). Results show that query2doc substantially improves the off-the-shelf BM25 algorithm without fine-tuning any model, particularly for hard queries from the TREC DL track. Strong dense retrievers, including DPR (Karpukhin et al., 2020), SimLM (Wang et al., 2023), and E5 (Wang et al., 2022) also benefit from query2doc, although the gains tend to be diminishing when distilling from a strong cross-encoder based re-ranker. Experiments in zero-shot OOD settings demonstrate that our method outperforms strong baselines on most datasets. Further analysis also reveals the importance of model scales: query2doc works best when combined with the most capable LLMs while small language models only provide marginal improvements over baselines. To aid reproduction, we release all the generations from text-davinci-003 at https://huggingface.co/datasets/intfloat/query2doc_msmarco.

Method

Given a query qq, we employ few-shot prompting to generate a pseudo-document dd^{\prime} as depicted in Figure 1. The prompt comprises a brief instruction “Write a passage that answers the given query:” and kk labeled pairs randomly sampled from a training set. We use k=4k=4 throughout this paper. Subsequently, we rewrite qq to a new query q+q^{+} by concatenating with the pseudo-document dd^{\prime}. There are slight differences in the concatenation operation for sparse and dense retrievers, which we elaborate on in the following section.

Sparse Retrieval Since the query qq is typically much shorter than pseudo-documents, to balance the relative weights of the query and the pseudo-document, we boost the query term weights by repeating the query nn times before concatenating with the pseudo-document dd^{\prime}:

Here, “concat” denotes the string concatenation function. q+q^{+} is used as the new query for BM25 retrieval. We find that n=5n=5 is a generally good value and do not tune it on a dataset basis.

Dense Retrieval The new query q+q^{+} is a simple concatenation of the original query qq and the pseudo-document dd^{\prime} separated by [SEP]:

For training dense retrievers, several factors can influence the final performance, such as hard negative mining (Xiong et al., 2021), intermediate pre-training (Gao and Callan, 2021), and knowledge distillation from a cross-encoder based re-ranker (Qu et al., 2021). In this paper, we investigate two settings to gain a more comprehensive understanding of our method. The first setting is training DPR (Karpukhin et al., 2020) models initialized from BERTbase{}_{\text{base}} with BM25 hard negatives only. The optimization objective is a standard contrastive loss:

The second setting is to build upon state-of-the-art dense retrievers and use KL divergence to distill from a cross-encoder teacher model.

pcep_{\text{ce}} and pstup_{\text{stu}} are the probabilities from the cross-encoder and our student model, respectively. α\alpha is a coefficient to balance the distillation loss and contrastive loss.

Comparison with Pseudo-relevance Feedback Our proposed method is related to the classic method of pseudo-relevance feedback (PRF) (Lavrenko and Croft, 2001; Lv and Zhai, 2009). In conventional PRF, the feedback signals for query expansion come from the top-k documents obtained in the initial retrieval step, while our method prompts LLMs to generate pseudo-documents. Our method does not rely on the quality of the initial retrieval results, which are often noisy or irrelevant. Rather, it exploits cutting-edge LLMs to generate documents that are more likely to contain relevant terms.

Experiments

Evaluation Datasets For in-domain evaluation, we utilize the MS-MARCO passage ranking (Campos et al., 2016), TREC DL 2019 (Craswell et al., 2020a) and 2020 (Craswell et al., 2020b) datasets. For zero-shot out-of-domain evaluation, we select five low-resource datasets from the BEIR benchmark (Thakur et al., 2021). The evaluation metrics include MRR@10, R@k (k{50,1k}\text{k}\in\{\text{50},\text{1k}\}), and nDCG@10.

Hyperparameters For sparse retrieval including BM25 and RM3, we adopt the default implementation from Pyserini (Lin et al., 2021). When training dense retrievers, we use mostly the same hyperparameters as SimLM (Wang et al., 2023), with the exception of increasing the maximum query length to 144144 to include pseudo-documents. When prompting LLMs, we include 44 in-context examples and use the default temperature of 11 to sample at most 128128 tokens. For further details, please refer to Appendix A.

2 Main Results

In Table 1, we list the results on the MS-MARCO passage ranking and TREC DL datasets. For sparse retrieval, “BM25 + query2doc” beats the BM25 baseline with over 15%\text{15}\% improvements on TREC DL 2019 and 2020 datasets. Our manual inspection reveals that most queries from the TREC DL track are long-tailed entity-centric queries, which benefit more from the exact lexical match. The traditional query expansion method RM3 only marginally improves the R@1k metric. Although the document expansion method docT5query achieves better numbers on the MS-MARCO dev set, it requires training a T5-based query generator with all the available labeled data, while “BM25 + query2doc” does not require any model fine-tuning.

For dense retrieval, the model variants that combine with query2doc also outperform the corresponding baselines on all metrics. However, the gain brought by query2doc tends to diminish when using intermediate pre-training or knowledge distillation from cross-encoder re-rankers, as shown by the “SimLM + query2doc” and “E5 + query2doc” results.

For zero-shot out-of-domain retrieval, the results are mixed as shown in Table 2. Entity-centric datasets like DBpedia see the largest improvements. On the NFCorpus and Scifact datasets, we observe a minor decrease in ranking quality. This is likely due to the distribution mismatch between training and evaluation.

Analysis

Scaling up LLMs is Critical For our proposed method, a question that naturally arises is: how does the model scale affect the quality of query expansion? Table 3 shows that the performance steadily improves as we go from the 1.3B model to 175B models. Empirically, the texts generated by smaller language models tend to be shorter and contain more factual errors. Also, the “davinci-003” model outperforms its earlier version “davinci-001” by using better training data and improved instruction tuning. The recently released GPT-4 (OpenAI, 2023) achieves the best results.

Performance Gains are Consistent across Data Scales Figure 2 presents a comparison between two variants of DPR models, which differ in the amount of labeled data used. The results show that the “DPR + query2doc” variant consistently outperforms the DPR baseline by approximately 1%, regardless of the amount of data used for fine-tuning. This observation highlights that our contribution is orthogonal to the continual scaling up of supervision signals.

How to Use Pseudo-documents In this paper, we concatenate the original query and pseudo-documents as the new query. Alternatively, one can solely use the pseudo-documents, as done in the approach of HyDE (Gao et al., 2022). The results presented in Table 4 demonstrate that the original query and pseudo-documents are complementary, and their combination leads to substantially better performance in sparse retrieval.

Case Analysis In Table 5, we show two queries along with their respective pseudo-documents and groundtruth. The pseudo-documents, which are generated by LLMs, offer detailed and mostly accurate information, thereby reducing the lexical mismatch between the query and documents. In some cases, the pseudo-documents are sufficient to meet the user’s information needs, rendering the retrieval step unnecessary. However, it is worth noting that the LLM generations may contain factual errors. For instance, in the second query, the theme song "It’s a Jungle Out There" was used as of season two in 2003, not 2002 Refer to https://en.wikipedia.org/wiki/It’s_a_Jungle_Out_There_(song). Although such errors may appear subtle and difficult to verify, they pose a significant challenge to building trustworthy systems using LLMs.

Related Work

Query Expansion and Document Expansion are two classical techniques to improve retrieval quality, particularly for sparse retrieval systems. Both techniques aim to minimize the lexical gap between the query and the documents. Query expansion typically involves rewriting the query based on relevance feedback (Lavrenko and Croft, 2001; Rocchio, 1971) or lexical resources such as WordNet (Miller, 1992). In cases where labels are not available, the top-k retrieved documents can serve as pseudo-relevance feedback signals (Lv and Zhai, 2009). Liu et al. fine-tunes an encoder-decoder model to generate contextual clues.

In contrast, document expansion enriches the document representation by appending additional relevant terms. Doc2query (Nogueira et al., 2019) trains a seq2seq model to predict pseudo-queries based on documents and then adds generated pseudo-queries to the document index. Learned sparse retrieval models such as SPLADE (Formal et al., 2021) and uniCOIL (Lin and Ma, 2021) also learn document term weighting in an end-to-end fashion. However, most state-of-the-art dense retrievers (Ren et al., 2021; Wang et al., 2023) do not adopt any expansion techniques. Our paper demonstrates that strong dense retrievers also benefit from query expansion using LLMs.

Large Language Models (LLMs) such as GPT-3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022), and LLaMA (Touvron et al., 2023) are trained on trillions of tokens with billions of parameters, exhibiting unparalleled generalization ability across various tasks. LLMs can follow instructions in a zero-shot manner or conduct in-context learning through few-shot prompting. Labeling a few high-quality examples only requires minimal human effort. In this paper, we employ few-shot prompting to generate pseudo-documents from a given query. A closely related recent work HyDE (Gao et al., 2022) instead focuses on the zero-shot setting and uses embeddings of the pseudo-documents for similarity search. HyDE implicitly assumes that the groundtruth document and pseudo-documents express the same semantics in different words, which may not hold for some queries. In the field of question answering, RECITE (Sun et al., 2022) and GENREAD (Yu et al., 2022) demonstrate that LLMs are powerful context generators and can encode abundant factual knowledge. However, as our analysis shows, LLMs can sometimes generate false claims, hindering their practical application in critical areas.

Conclusion

This paper presents a simple method query2doc to leverage LLMs for query expansion. It first prompts LLMs with few-shot examples to generate pseudo-documents and then integrates with existing sparse or dense retrievers by augmenting queries with generated pseudo-documents. The underlying motivation is to distill the LLMs through prompting. Despite its simplicity, empirical evaluations demonstrate consistent improvements across various retrieval models and datasets.

Limitations

An apparent limitation is the efficiency of retrieval. Our method requires running inference with LLMs which can be considerably slower due to the token-by-token autoregressive decoding. Moreover, with query2doc, searching the inverted index also becomes slower as the number of query terms increases after expansion. This is supported by the benchmarking results in Table 6. Real-world deployment of our method should take these factors into consideration.

References

Appendix A Implementation Details

For dense retrieval experiments in Table 1, we list the hyperparameters in Table 7. When training dense retrievers with distillation from cross-encoder, we use the same teacher score released by Wang et al.. The SimLM and E5 checkpoints for initialization are publicly available at https://huggingface.co/intfloat/simlm-base-msmarco and https://huggingface.co/intfloat/e5-base-unsupervised. To compute the text embeddings, we utilize the [CLS] vector for SimLM and mean pooling for E5. This makes sure that the pooling mechanisms remain consistent between intermediate pre-training and fine-tuning. The training and evaluation of a dense retriever take less than 10 hours to finish.

When prompting LLMs, we include 4 in-context examples from the MS-MARCO training set. To increase prompt diversity, we randomly select 4 examples for each API call. A complete prompt is shown in Table 11. On the budget side, we make about 550550k API calls to OpenAI’s service, which costs nearly 55k dollars. Most API calls are used to generate pseudo-documents for the training queries.

For GPT-4 prompting, we find that it has a tendency to ask for clarification instead of directly generating the pseudo-documents. To mitigate this issue, we set the system message to “You are asked to write a passage that answers the given query. Do not ask the user for further clarification.”.

Regarding out-of-domain evaluations on DBpedia (Hasibi et al., 2017), NFCorpus (Boteva et al., 2016), Scifact (Wadden et al., 2020), Trec-Covid (Voorhees et al., 2021), and Touche2020 (Bondarenko et al., 2022), SimLM’s results are based on the released checkpoint by Wang et al..

For ablation experiments in Figure 2, we fine-tune for 40 epochs or 18k steps, whichever is reached first.

Appendix B Exploring Other Prompting Strategies

Instead of generating pseudo-documents in one round, recent work (Press et al., 2022) proposes to iteratively prompt the LLM to improve the generation quality. We explore this intuition by asking GPT-4 to rewrite its own generated pseudo-documents with the following prompt template:

You are asked to rewrite the passage that potentially answers the given query. You should only correct the factual errors in the passage, do not ask for clarification or make unnecessary changes. Query: {{query}} # Begin of passage {{passage}} # End of passage

Empirically, we find that GPT-4 makes very few changes to the generated pseudo-documents, which suggests that the pseudo-documents are already of high quality or GPT-4 is not capable of correcting its own errors. The results are shown in Table 8.

Appendix C Results Across Multiple Runs

In our method, there are two sources of randomness: the selection of few-shot examples and the auto-regressive top-p sampling of LLMs. To quantify the variance of our method, we report the average and standard deviation of sparse retrieval results across 3 random runs in Table 10. One possible improvement is to select few-shot examples based on semantic similarity to the query. We leave this for future work.