Improving Passage Retrieval with Zero-Shot Question Generation

Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, Luke Zettlemoyer

Introduction

Text retrieval is a core sub-task in many NLP problems, for example, open-domain question answering where a document must be retrieved and then read to answer an input query. Queries and documents are typically embedded in a shared representation space to enable efficient search, before using a task-specific model to perform a deeper, token-level document analysis (e.g. a document reader that selects an answer span). We show that adding a zero-shot re-ranker to the retrieval stage of such models leads to large gains in performance, by doing deep token-level analysis with no task-specific data or tuning.

We focus on open-domain question answering and introduce a re-ranker based on zero-shot question generation with a pre-trained language model. Our re-ranker, which we call Unsupervised Passage Re-ranker (UPR), re-scores the retrieved passages by computing the likelihood of the input question conditioned on a retrieved passage.In this paper, we refer to the words documents and passages interchangeably. We consider the retrieval units as short passages and not entire documents. This simple method enables task-independent cross-attention between query and passage that can be applied on top of any retrieval method (e.g. neural or keyword-based) and is highly effective in practice (Figure 1).

In part, UPR is inspired by the traditional models of query scoring with count-based language models Zhai and Lafferty (2001). However, instead of estimating a language model from each passage, UPR uses pre-trained language models (PLMs). More recent work on re-rankers have finetuned PLMs on question-passage pairs to generate relevance labels Nogueira et al. (2020), sometimes to jointly generate question and relevance labels Nogueira dos Santos et al. (2020); Ju et al. (2021). In contrast, UPR uses off-the-shelf PLMs, does not require any training data or finetuning, and still leads to strong performance gains (Figure 1).

Comprehensive experiments across a wide range of datasets, retrievers, and PLMs highlight the strengths of UPR:

By re-ranking the top-1000 passages from Contriever (unsupervised), UPR obtains a gain of 6%-18% points absolute in the top-20 retrieval accuracy across four QA datasets. UPR also achieves new state-of-the-art results on the difficult SQuAD-Open and Entity Questions datasets, outperforming BM25 by 14% and 8%.

These performance gains are consistent across both different kinds of retrievers and PLMs. Ablation studies reveal that instruction-tuned models such as T0 perform the best as re-rankers.

On the open-domain QA task, just by performing inference with the re-ranked passages and a pre-trained reader, we obtain improvements of up to 3 EM points on three benchmark datasets.

To the best of our knowledge, this is the first work to show that a fully unsupervised pipeline (consisting of a retriever and re-ranker) can greatly outperform supervised dense retrieval models like DPR Karpukhin et al. (2020). As language models continue to improve rapidly Rae et al. (2021); Chowdhery et al. (2022), the performance of UPR may see corresponding gains over time. UPR requires no annotated data and uses only generic pre-trained models, which means it may be easy to apply to a wide range of retrieval problems.

Method

Figure 2 presents an overview of our approach for open-domain retrieval, which introduces a new unsupervised re-ranker (Sec 2.2) that can be applied to any existing text retriever (Sec 2.1).

Let $\mathcal{D}=\{\boldsymbol{d}_{1},\ldots,\boldsymbol{d}_{M}\}$ be a collection of evidence documents. Given a question ( $\boldsymbol{q}$ ), the retriever selects a subset of relevant passages $\mathcal{Z}\subset\mathcal{D}$ , one or more of which will ideally contain the answer to $\boldsymbol{q}$ . Our method will work with passages obtained from any retriever — either based on sparse representations like BM25 or dense representations like DPR. We only assume that the retriever provides the $K$ most relevant passages. We denote this set of top-K passages as $\mathcal{Z}=\{\boldsymbol{z}_{1},\ldots,\boldsymbol{z}_{K}\}$ .

2 Unsupervised Passage Re-ranking (UPR)

Given the top-K retrieved passages, the goal of the re-ranker is to reorder them such that a passage with the correct answer is ranked as highly as possible. The ordering is computed with a relevance score $p(\boldsymbol{z}_{i}\mid\boldsymbol{q})$ for each passage $\boldsymbol{z}_{i}\in\mathcal{Z}$ .

Our re-ranking approach is unsupervised, i.e., it does not use any task-specific training examples. We refer to it as UPR, for Unsupervised Passage Re-ranking. UPR uses a pre-trained language model to score the probability of generating the question $\boldsymbol{q}$ given the passage text $\boldsymbol{z}$ , as described below. The question generation model is zero-shot, allowing for dataset-independent re-ranking, and also incorporates cross-attention between the question and passage tokens while forcing the model to explain every token in the input question. UPR is, therefore, more expressive than using dense retrievers alone, even if both methods fundamentally build on top of the same (or very similar) pre-trained models.

More specifically, we estimate $p(\boldsymbol{z}_{i}\mid\boldsymbol{q})$ by computing the likelihood of question generation conditioned on the passage, i.e., the quantity $p(\boldsymbol{q}\mid\boldsymbol{z}_{i})$ . This also naturally emerges when applying Bayes’ rule to $p(\boldsymbol{z}_{i}\mid\boldsymbol{q})$ as

where $p(\boldsymbol{z}_{i})$ is the prior on the retrieved passage and $c$ is a common constant for all $\boldsymbol{z}_{i}$ .

As a simplifying assumption, we assume that the passage prior $\log p(\boldsymbol{z}_{i})$ is uniform, and can be ignored for re-ranking. With this, the above expression reduces to

We estimate $\log p(\boldsymbol{q}\mid\boldsymbol{z}_{i})$ using a pre-trained language model (PLM) to compute the average log-likelihood of the question tokens conditioned on the passage:

where $\Theta$ denotes the parameters of the PLM and $|\boldsymbol{q}|$ denotes the number of question tokens. We apply the PLM in a zero-shot fashion with no finetuning by simply appending the natural language instruction “Please write a question based on this passage” to the passage tokens as shown in Figure 2.

The initial passage ordering is then sorted based on $\log p(\boldsymbol{q}\mid\boldsymbol{z})$ . This enables us to re-rank the passages by just performing inference using off-the-shelf language models avoiding the need to label question-passage pairs for finetuning. Because the question generation model is applied zero-shot, this overall approach can be applied to improve the retrieval accuracy of any test collection, with no dataset-specific models or tuning data.

Experimental Setup

In this section, we describe the datasets, unsupervised and supervised retrievers, and language models used for our passage re-ranking experiments.

Following previous work on passage retrieval, we use the popular datasets of SQuAD-Open Rajpurkar et al. (2016), TriviaQA Joshi et al. (2017), Natural Questions (NQ; Kwiatkowski et al. (2019)), and WebQuestions (WebQ; Berant et al. (2013)). Their statistics are presented in Table 1.

We use the preprocessed English Wikipedia dump from December 2018 as released by Karpukhin et al. (2020) as our evidence passages. Each Wikipedia article is split into non-overlapping 100 word passages. There are over 21 million total passages.

2 Keyword-centric Datasets

To examine the robustness of UPR to keyword-centric datasets, we experiment with test collections where dense retrievers struggle and when the questions are from different domains.

contains 22K short questions about named entities based on facts from Wikipedia. Previous work on this dataset has shown that dense retrievers struggle to retrieve relevant passages while sparse approaches like BM25 are more successful Sciavolino et al. (2021).

is a test suite for benchmarking retrieval algorithms and consists of multiple datasets, where each dataset consists of test set queries, evidence documents, and relevance document annotations Thakur et al. (2021). These datasets contain different kinds of retrieval tasks like fact-checking, question answering, etc. and span diverse domains including news, technical, and Wikipedia making it a challenging benchmark.

3 Retrievers

In our re-ranking experiments, we retrieve passages from both unsupervised and supervised retrievers, as detailed below.

ranks based on the term-frequency and inverse document frequency of the keywords present in the question and passage Robertson and Zaragoza (2009). Prior work Ma et al. (2021) has shown that BM25 is a strong baseline for the datasets we consider.

is a dense retriever trained by predicting masked salient spans like named entities with the help of a reader network Sachan et al. (2021a). MSS pre-training has also shown to improve supervised retrieval performance.

uses momentum contrastive training to learn dense retrievers from text paragraphs Izacard et al. (2022). Such training has shown to obtain strong zero-shot retrieval performance on many benchmarks.

3.2 Supervised Retrievers

uses annotated question-context paragraphs and hard negative examples to train a supervised dense retriever Karpukhin et al. (2020).

further improves DPR performance by first pre-training the dense retriever using MSS followed by DPR-style supervised finetuning Sachan et al. (2021a).

4 Pre-Trained Language Models (PLMs)

We use a range of pre-trained models for computing our re-ranking relevance scores.

These models consist of encoder and decoder transformers pre-trained by denoising input text sequences. We experiment with the T5 model Raffel et al. (2020), its language model adapted version (T5-lm-adapt; Lester et al. (2021)), and the T0 language model Sanh et al. (2022). T0 was trained by finetuning T5-lm-adapt with multiple tasks defined by instructions. We use the “xl” configurations that contain 3B parameters.

These consist of a transformer decoder trained with the autoregressive language modeling objective. We use the GPT-neo model with 2.7B parameters Black et al. (2021).

5 Implementation Details

We run all the experiments on a cluster with V100-32GB GPUs. We use PyTorch Paszke et al. (2019) to implement the UPR approach and relevant baselines. To get the top-K retrieved passages, we use the open-source implementations of the retrievers and their checkpoints. For BM25, we use the pre-computed top-k passages outputs from the pyserini toolkit Lin et al. (2021).https://github.com/castorini/pyserini/blob/master/docs/experiments-dpr.md For MSS, DPR, and MSS-DPR retrievers, we use the open-source implementations from Sachan et al. (2021b).https://github.com/DevSinghSachan/emdr2 For Contriever and PLMs, we use their checkpoints as hosted in Huggingface Wolf et al. (2020).

For the dense retriever experiments, we use the base configuration, which consists of 12 attention heads, 12 layers, and 768 model dimensions. To experiment with supervised retrievers, we train DPR and MSS-DPR for 3 epochs on SQuAD-Open, 40 epochs on NQ and TriviaQA, and 20 epochs on WebQ.In contrast to previous work on SQuAD-Open, we train DPR and MSS-DPR for 3 epochs to prevent overfitting. Detailed hyperparameter settings are specified in Appendix A.1 and A.2.

Experiments: Passage Retrieval

We evaluate the performance of our proposed Unsupervised Passage Re-ranker (UPR), conduct ablations to better understand the approach, evaluate robustness on challenging test collections, and discuss run-time efficiency.

Our goal is to improve the rankings of top-{20, 100} passages. Hence, in the first stage, a larger candidate list is fetched by retrieving the top-1000 passages. Then, in the second stage, these passages are re-ranked with the T0-3B PLM unless specified otherwise. To evaluate UPR performance, we compute the conventional top-K retrieval accuracy metric. It is defined as the fraction of questions for which at least one passage within the top-K passages contains a span that matches the human-annotated answer to the question.

We experiment with the four datasets and five retrievers as introduced in §3.1 and §3.3, respectively and perform re-ranking with the T0-3B model. Table 2 reports the top-20 and top-100 retrieval accuracy before and after re-ranking. UPR provides consistent improvements across all the retrievers and datasets, improving unsupervised models by 6%-18% absolute and supervised models by up to 12% in top-20 accuracy.

Re-ranked Contriever outperforms DPR by an average of 7% in top-20 and 4% in top-100 when considering all the datasets. This shows that a fully unsupervised pipeline of a retriever and re-ranker can outperform strong supervised models like DPR. Sparse representations still remain competitive, with BM25 outperforming Contriever and MSS on SQuAD-Open and TriviaQA re-ranking.

We also see that re-ranked MSS-DPR comes close to or matches the performance of state-of-the-art supervised retrievers (last row in Table 2). Because these supervised models are based on end-to-end training of the retriever and language model, they are memory-intensive and too expensive to train for very large models. As such, UPR offers a viable alternative to expensive joint training.

The question generation step in the re-ranker involves expressive cross-attention with the passage tokens. As a result, each question token attends to all the passage tokens in each decoder layer before predicting the next question token. This results in an accurate estimation of the relevance (or log-likelihood) scores than the original retriever scores, thus leading to an improved retrieval accuracy after re-ranking. This reasoning is further corroborated by our error analysis in Appendix A.3, where we present several examples where UPR improves over the incorrect BM25 retrievals.

2 Ablation Studies

To understand the importance of re-ranking based on question generation $p(\boldsymbol{q}\mid\boldsymbol{z})$ , we compare it with another unsupervised approach where re-ranking is based on passage generation conditioned on the question $p(\boldsymbol{z}\mid\boldsymbol{q})$ . This quantity can be estimated by computing the average log-likelihood of generating the passage tokens using PLM and teacher-forcing as

where $\Theta$ denotes the parameters of the PLM and $|\boldsymbol{z}|$ denotes the number of passage tokens.

For this analysis, we work with the NQ development set and obtain the union of top-1000 passages from the BM25 and MSS retrievers. These passages are re-ranked with two PLMs: T0-3B and GPT-2.7B. Our results in Figure 3 demonstrate that question generation obtains substantial improvements over the BM25 and MSS, highlighting its usefulness in passage re-ranking. On the other hand, re-ranking based on passage generation leads to a drop in retrieval accuracy in comparison to the baseline retrievers, empirically confirming that this approach does not work well in practice.

2.2 Impact of Pre-trained Language Models

To understand how much the choice of PLM contributes to top-K accuracy, we compare the performance of T5 (3B), T5-lm-adapt (different sizes), T0-{3B, 11B}, and GPT-neo (2.7 B) (as introduced in §3.4) on the NQ development set. We obtain the union of top-1000 passages retrieved from BM25 and MSS and then re-rank them with UPR. Results in Table 3 reflect that all the PLMs obtain significant improvements over the baseline retrievers, with the T0 models achieving the best results. Scaling up the PLM size, especially the T5-lm-adapt models, leads to consistent performance improvements.

When comparing across PLMs, we see that the performance of T5 suffers especially on top-{1, 5} accuracy levels. This might be because it was trained to predict corrupted spans, which is not ideal for text generation. On the other hand, autoregressive PLMs such as GPT-neo and T5-lm-adapt tend to be better re-rankers. Furthermore, T0 obtains large improvements on top-{1, 5, 20}, demonstrating that finetuning with instructions on unrelated tasks is also beneficial for re-ranking.

2.3 Passage Candidate Size vs Latency

We study the effect of the number of passage candidates to be re-ranked on the retrieval performance along with the time taken. For this, we consider the NQ development set, re-rank up to top-1000 passages obtained from BM25, and use top-20 accuracy as the evaluation criteria. Results in Figure 4 illustrate that a larger pool of passage candidates indeed helps to improve the performance. However, the gains tend to plateau as the number of passages is increased.

With more passages, the latency in re-ranking per question linearly increases reflecting the trade-off between accuracy and throughput. The higher latency can be partly alleviated with approaches like weight quantization, efficient implementations of the transformer kernel, model distillation, caching passage embeddings, and using data parallelism. However, we leave these explorations to future work.

3 Zero-Shot Supervised Transfer

To gain a better understanding of the relative strengths of UPR and supervised (or finetuned) re-rankers, we perform zero-shot supervised transfer experiments and compare the results with UPR. We adopt the training method of Nogueira et al. (2020), henceforth referred to as monoT5, who finetune the T5 PLMs on the MS MARCO Bajaj et al. (2016) passage ranking dataset. To train, question and passage tokens are concatenated and fed to the T5 encoder. The decoder attends to the encoded sequence and the T5 PLM is finetuned to maximize the likelihood of the “true” label. To re-rank the passages during inference, the log-likelihood score of the “true” label is used as the relevance score.

We use the open-source checkpoints of monoT5 to re-rank the top-1000 passages retrieved by BM25 and report results on the NQ development set (Table 4).https://github.com/castorini/pygaggle Interestingly, we see that supervised transfer improves the top-1 and top-5 retrieval accuracy by a large margin over UPR. However, when the set of retrieved passages increases, such as 20-100, the results of UPR come close to or match the results of monoT5. As end tasks such as open-domain question answering rely on a larger set of passages to achieve good results (as demonstrated in Sec 5), this highlights the importance of UPR over supervised models as it does not require collecting annotated data for finetuning.

4 Evaluation on Keyword-centric Datasets

We re-rank the top-1000 passages from every retriever with UPR. As the training set is not provided, we use the checkpoints of DPR and MSS-DPR trained on NQ. Results are presented in Table 5. Re-ranking leads to a gain of 8-20% absolute in top-20 accuracy and 4-10% in top-100 accuracy, with BM25 achieving the best results after re-ranking. It also substantially narrows the gap between BM25 and dense retrievers. Re-ranking the union of BM25 and Contriever outputs outperforms the current best results by 6% and 3% in top-20 and top-100, respectively.

We also note that multi-vector approaches specially tailored towards the robust representation of textual entities de Jong et al. (2022) are promising alternatives to the dual-encoder retrievers as they offer an improved retrieval accuracy although at the expense of increased memory and compute requirements. However, we defer the application of UPR to these retrievers as a part of future work.

4.2 BEIR Benchmark

We re-rank the top-1000 documents from Contriever and BM25 with the T0-3B PLM. Following convention, we report the macro average of NDCG@10 and Recall@100 metrics in Table 6. Results demonstrate the effectiveness of UPR as NDCG@10 scores improve by 3-8% absolute and Recall@100 improves by 5-6%. We include performance numbers on individual datasets with fine-grained analysis in Appendix A.4.

Experiments: Question Answering

Finally, we show that UPR improves the performance of full open-domain QA systems.

An open-domain QA system consists of a retriever and a reader component. The reader attends to the retrieved passages to produce a final answer to the question. We use the Fusion-in-Decoder (FiD; Izacard and Grave (2021b)) model as the reader. In FiD, each retrieved passage is concatenated with the question and is then passed as an input to the T5 encoder Raffel et al. (2020). Then the encoded representations for all the passages are concatenated which the T5 decoder leverages for cross-attention.

We train the FiD reader using standard negative log-likelihood loss and teacher-forcing to generate an answer autoregressively. To understand the effect of UPR on answer generation, we then do inference with the previously trained reader and the re-ranked passages for each question.

2 Results

For training FiD models, we use the top-100 retrieved passages and a batch size of 64. Detailed training hyperparameters are provided in Appendix A.1. During inference, an answer is generated using greedy decoding. For our experiments, we train the FiD base and large models using the retrieved documents from MSS, DPR, and MSS-DPR retrievers. We re-rank the top-1000 passages with UPR using the T0-3B PLM and then perform inference with the top-100 re-ranked passages. We conduct experiments on SQuAD-Open, TriviaQA, and NQ datasets and report the exact match (EM) scores for evaluation. We employ the same set of evidence passages for all the datasets.Previous work has often used the 2016 Wikipedia dump as evidence for SQuAD-Open. As our evidence set is larger and newer, some questions may be unanswerable, which renders a fair comparison difficult. However, to alleviate dataset-specific design choices, we adopt a common experimental setup.

Results are presented in Table 7. More accurate passages after re-ranking improve the performance of the pre-trained FiD models for all the retrievers. Performing inference on the FiD-large model with re-ranked MSS-DPR passages achieves new state-of-the-art results, outperforming the pre-trained FiD model by 1-3 EM points. Overall, this provides a simple approach for obtaining performance gains without the need to iteratively re-train Izacard and Grave (2021a) or perform expensive end-to-end training Sachan et al. (2021b).

Related Work

Our work is based on re-ranking passages for open-domain retrieval using pre-trained language models (PLMs) which we have covered in earlier sections. Here, we instead focus on covering previous work related to generative pre-training, query likelihood for document ranking, and open-domain QA.

Recently, there has been an increased adoption of the generative pre-trained transformer (GPT) series of models by the NLP community Radford et al. (2019). Among the interesting properties of GPT models is their ability to understand task instructions specified in natural language and then perform well on tasks in a zero-shot or few-shot manner Brown et al. (2020); Smith et al. (2022). The zero-shot performance of GPT models further improves when finetuning them on multiple different tasks using task-specific instructions, which is also known as instruction-tuning Sanh et al. (2022); Wei et al. (2022); Min et al. (2022).

In information retrieval, an appealing approach to rank documents is by utilizing language models to compute relevance scores for a query Ponte and Croft (1998). Prior approaches estimated a count-based language model for each document that was used to compute query likelihood scores for ranking Zhai and Lafferty (2001). However, these approaches suffer from issues such as data sparsity. More recent approaches utilize PLMs such as GPT or T5 to compute query likelihood Nogueira dos Santos et al. (2020). To improve ranking accuracy, they perform supervised finetuning using query-document pairs Ju et al. (2021). Our work also utilizes PLMs, but instead, we leverage a larger instruction-tuned language model and apply them in a zero-shot manner without finetuning.

involves producing answers to information-seeking questions from large document collections. Typical approaches consist of retriever and reader networks, where the retriever identifies a small number of documents to aid the reader in producing answers Chen et al. (2017). To be scalable, retrievers are often modeled using dual-encoders Lee et al. (2019) or with multi-vector encoders Zhou and Devlin (2021) and then to further improve retieval accuracy, re-rankers are employed Nogueira et al. (2020). Given retrieved documents, a reader is then trained to generate a short answer to the question Izacard and Grave (2021b); Sachan et al. (2021b).

Conclusions and Future Work

In this work, we propose UPR, an approach to perform unsupervised passage re-ranking for open-domain retrieval. To re-rank, UPR computes a relevance score for question generation conditioned on each retrieved passage using pre-trained language models. Extensive experiments across a wide range of QA datasets show that an unsupervised pipeline consisting of retriever and UPR greatly outperforms strong supervised retriever models. In addition, UPR further improves the performance of supervised retrievers. On the open-domain QA task, by just performing inference using re-ranked passages and a pre-trained reader model, we achieve new state-of-the-art results.

UPR presents several interesting directions for future work. First, its applications to other retrieval tasks such as improving source-code retrieval based on textual queries can be explored. Second, another promising direction would be to tune instructions according to the nature of the retrieval tasks. For instance, when retrieving similar sentences in the BEIR benchmark, variations of the instruction prompt used in UPR can be explored. Finally, it would also be interesting to investigate the extent to which specialized language models such as the ones finetuned to generate questions using passage-questions data would further help in improving retrieval.

Acknowledgements

This work was done during the first author’s internship at Meta AI Research. The authors would like to thank Dmytro Okhonko and the anonymous reviewers for providing useful suggestions and feedback about this work that helped us to improve the paper. We would also like to thank the administrators of the compute cluster at FAIR, Meta AI for their assistance in facilitating experimental runs.

Limitations

A limitation of UPR is that re-ranking a large pool of passages can have a high latency as it involves performing cross-attention whose complexity is proportional to the product of the question and passage tokens and the number of layers of the pre-trained language model (PLM). We have also discussed this quantitatively in Sec 4.2.3. UPR also shares the inherent limitation associated with all the re-ranking approaches in that its maximum possible performance is dependent on the first-stage retrieval. For example, when processing the top-1000 retrieved passages, the upper limit of top-100 re-ranking accuracy would be the top-1000 accuracy of the retrieved passages. Finally, we want to remark that UPR results might be sensitive to the training data used to train the PLM. As a result, in a domain-specific retrieval or question-answering task, PLMs trained on in-domain text Gururangan et al. (2020) are expected to be more accurate than those trained on broad-coverage text.

Ethics Statement

The experiments conducted in the paper demonstrate the usefulness of large language models for information retrieval tasks when using English Wikipedia as the evidence source. However, when deployed in production, our work shares the typical ethical risks associated with large language models. There are chances that the re-ranked results may not be fair to all communities. This can potentially lead to an increased discrimination and exclusion of marginalized groups. These risks can also perpetuate to question-answering applications such as generating toxic or fake text as answers. Therefore, care should be taken before deploying our approach in real-world or customer facing applications; it is advisable to conduct tests and benchmark the models covering these aspects.

References

Appendix A Appendix

We use Adam optimizer Kingma and Ba (2015), a batch size of 128, 1 hard negative example for each positive pair, a learning rate of 2e-5 with a linear decay, weight decay of 0.1, and train for 3 epochs on SQuAD-Open, 40 epochs for NQ and TriviaQA, and 20 epochs on WebQ. Model training was performed on 16 GPUs.

We use Adam optimizer Kingma and Ba (2015), a batch size of 64, a learning rate of 2e-5 with a linear decay, a weight decay of 0.1, gradient clipping with a maximum value of 1.0, and train for 3 epochs on SQuAD-Open, 10 epochs for NQ and TriviaQA. Model training was performed on 64 GPUs. For our experiments, we use the Fusion-in-Decoder model implementation from the open-source repository (https://github.com/DevSinghSachan/emdr2) Sachan et al. (2021b).

A.2 Instruction Prompt Selection

We cross-validate using several prompts formulated as natural language instructions to aid in question reconstruction. We re-rank top-1000 BM25 passages of NQ development set using different instructions including the case with no instruction. Results in Table 8 reveal that when prompted via instructions, PLMs perform better than the case when not given any instructions. We also note that simple but effective instructions can lead to a higher top-1 accuracy. Due to its better accuracy, we use the instruction "Please write a question based on this passage" for all the experiments in this paper.

A.3 Analysis

In Table 9, we present some examples of questions and their BM25 retrieved and UPR re-ranked top-1 passages. While BM25 retrieves passages with high lexical overlap, UPR owing to its cross-attention mechanism is more able to understand the relationships between tokens in the question and passage and thus leads to an improvement in passage rankings over the first-stage retriever. In the last example, we note that although the BM25 retrieved passage contains the ground-truth answer, it should be considered a false positive result. On the other hand, UPR leads to the correctly ranked passage but the exact match evaluation metric marks it as incorrect as it does not match the full ground-truth answer.

A.4 BEIR Benchmark Results

We re-rank the top-1000 documents from the BM25 and Contriever retrievers with the T0-3B pre-trained language model and evaluate performance with NDCG@10 and Recall@100 metrics. We present the results of the individual datasets included in the BEIR benchmark in Table 10. On both the metrics, the initial scores of BM25 are much higher than those of Contriever. After re-ranking, BM25 retriever obtains improvements on 12 out of 15 datasets while Contriever obtains improvements on 13 out of 15 datasets. On average, NDCG@10 scores improve by 3-8% and Recall@100 improves by 5-6%. The performance gap between BM25 and Contriever also narrows down after re-ranking.

Due to the diversity in datasets, there is a considerable variation in performance gains across them. In the case of BM25, the highest relative performance gains are obtained by UPR on datasets containing information-seeking questions such as FIQA-2018, NQ, MS-Marco, etc. Similarly, for Contriever, the relative gains are much higher for the datasets of Trec-Covid, NQ, HotpotQA, etc., where the queries are questions. On other datasets, the relative gains from re-ranking are moderate to little.

For both the retrievers, we also observe a drop in performance on the fact-verification datasets of Fever and Climate-fever (results highlighted in red color in Table 10). In addition, re-ranking BM25 also results in a drop in performance on the Touche-2020 dataset. We note that in these datasets, the queries are statements such as claims, which presents a different retrieval challenge for re-ranking. We anticipate that by experimenting with different prompt instructions in UPR to better suit the end-task and cross-validating with the number of top-K documents to be re-ranked, results can be improved on these datasets. However, we leave these explorations as a part of future work.

Appendix B Reproducibility Checklist

A clear description of the mathematical setting, algorithm, and/or model: This is provided in the main paper in Sec. 2.

A link to a downloadable source code, with specification of all dependencies, including external libraries: We are submitting the source codes as a zip file.

A description of computing infrastructure used: We run experiments on a cluster containing V100 GPUs where each node’s specifications are: Number of CPUs: 256, Physical Memory: 1.2TB, GPU model: 8 x Nvidia V100, GPU architecture and memory: Volta/32GB, Arch: x86_64, and Disk size: 4TB. For experiments in Sec. 4.2.3, we used a single node of 8 x A100 GPUs of 40GB memory.

The average runtime for each model or algorithm (e.g., training, inference, etc.), or estimated energy cost: We discuss the average runtime of performing inference with UPR in Sec. 4.2.3. However, we want to highlight that our codes were not carefully optimized to minimize runtime or to make optimal use of the hardware resources.

Number of parameters in each model: We provide these details in Sec. 3.4 and Table 7.

Corresponding validation performance for each reported test result: The re-ranking experiments does not require validation set for model selection, as we only perform inference for each query using the language model and retrieved passages. If the program committee or reviewers require the validation set performance, we will include it in the Appendix in the final version of the paper. Our ablations and analysis are conducted on the validation set of datasets. For the open-domain QA experiments, we also report the performance on the validation set.

Explanation of evaluation metrics used, with links to code: Our evaluation metrics are standard and widely used by the community. We provide their details in the main paper in Sec. 4. The code is submitted with the paper.

B.2 For all results involving multiple experiments, such as hyperparameter search

The exact number of training and evaluation runs: We provide training details for all models in Sec. 3.5.

Hyperparameter configurations for best-performing models: We provide the hyperparameter settings in Appendix A.1.

The bounds for each hyperparameter: As described in Appendix A.1, our model and training setting uses standard hyperparameters such as different dropouts $\in[0,1)$ , warmup ratio of optimizer $\in[0.01,0.05]$ , weight regularization $\in$ , and learning rate $\in[1e^{-4},1e^{-5}]$ .

Number of hyperparameter search trials: maximum 5.

The method of choosing hyperparameter values (e.g., uniform sampling, manual tuning, etc.) and the criterion used to select among them (e.g., accuracy): For the open-domain QA experiments, we performed manual hyperparameter tuning. We selected the best hyperparameter using EM results on the validation set.

Summary statistics of the results (e.g. mean, variance, error bars, etc.): The re-ranking experiments are based on performing inference using open-source PLMs using a single prompt. As such, these summary statistics are not applicable to UPR. The open-domain QA experiments are compute expensive utilizing a lot of CPU and GPUs resources and take time in the range of tens of hours. Therefore, due to computational and time constraints performing multiple runs for each experiment was not feasible. Therefore, we adopted the approach of using the same seed value (1234) for all the training runs.

B.3 For all datasets used

Details of train/validation/test splits: We use the standard training / dev / test splits whose details are provided in Sec. 3.1 and Table 1.

Relevant statistics such as number of examples and label distributions: We provide dataset statistics details in Table 1.

An explanation of any data that were excluded, and all pre-processing steps: We include the relevant details in Sec. 3.

For natural language data, the name of the language(s): Our datasets are in English language.

A zip file containing data or link to a downloadable version of the data: All the datasets used in this work are open-source available and widely used by the community. Please refer to the respective dataset papers for the download links.

For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control: This is not applicable to this work.