Autoregressive Search Engines: Generating Substrings as Document Identifiers

Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen-tau Yih, Sebastian Riedel, Fabio Petroni

Introduction

Surfacing knowledge from large corpora is a crucial step when dealing with knowledge intensive language tasks Levy et al. (2017); Dinan et al. (2019); Elsahar et al. (2018); Petroni et al. (2021), such as open-domain question answering Voorhees et al. (1999); Joshi et al. (2017); Yang et al. (2018); Kwiatkowski et al. (2019) and fact checking Thorne et al. (2018). A popular paradigm to approach such tasks is to combine a search engine with a machine reader component. The former retrieves relevant context, usually in the form of short passages, which the latter then examines to produce answers Chen et al. (2017); Lewis et al. (2020); Izacard and Grave (2021).

In recent years we have witnessed a surge of research and development in autoregressive language models Radford et al. (2019); Lewis et al. (2019); Raffel et al. (2019); Brown et al. (2020); Rae et al. (2021); Artetxe et al. (2021); Smith et al. (2022), with ever increasing size and natural language understanding (NLU) capabilities. Such models are currently the de-facto implementation of the machine reader component in retrieval-reader architectures, and have contributed to rapid progress on a wide range of benchmarks Joshi et al. (2017); Kwiatkowski et al. (2019); Petroni et al. (2021). However, these tremendous advances in aggressive modelling has yet to bring similar transformational changes in how retrieval is approached.

Transferring the NLU capabilities of modern autoregressive models to retrieval is non-trivial. Some works have demonstrated that knowledge stored in the parameters of these models can be retrieved to some extend by directly generating evidence given a query Petroni et al. (2019, 2020); Roberts et al. (2020). However, such approaches have been shown to be unreliable because of their tendency to hallucinate non-factual content Massarelli et al. (2019); Metzler et al. (2021); Ji et al. (2022). To alleviate this issue, previous work proposes to only use generation for query expansion in traditional search engines (Mao et al., 2021), but these solutions don’t exploit the full potential of autoregressive architecture, such as word order sensitivity and conditional probability modeling, and still lag behind vector-based approaches Karpukhin et al. (2020).

Recently, another line of work has investigated using autoregressive language models to generate identifier strings for documents, as an intermediate target for retrieval, such as Wikipedia page titles (De Cao et al., 2021b), or root-to-leaf paths in a hierarchical cluster tree (Tay et al., 2022). Employing identifiers, rather than generating evidence directly, induces some structure in the search space, (i.e., index documents by their title or their cluster tree) which can be easier to memorize, learn, and retrieve from, than full unstructured passages. Moreover, it is relatively easy to constrain beam search decoding with a prefix tree “index” so that only valid identifiers are generated. As a downside, if appropriate metadata (e.g. titles) are not available, one needs to create the identifiers, hence the structure (e.g. with hierarchical clustering), which has not been thoroughly evaluated on a large-scale benchmark.

In this work, we propose a solution that does not force any structure in the search space, but rather uses all the ngrams occurring in a document as its identifiers. Concretely, we introduce Search Engines with Autoregressive LMs (SEAL), a retrieval solution that combines an autoregressive model, i.e., BART (Lewis et al., 2019), with a compressed full-text substring index, i.e., the FM-Index (Ferragina and Manzini, 2000) — see Figure 1 for an high-level overview. This configuration comes with a twofold benefit: i) we can constrain BART’s generations with the FM-Index, hence preventing the generation of invalid identifiers (i.e., ngrams not occurring in any document); ii) the FM-Index provides information on all documents in the corpus containing a specific ngram (for every decoding step), thus allowing to retrieve them. This setup allows SEAL to generate any span from any position in the corpus, without needing to explicitly encode all substrings in a document. Moreover, we design a novel scoring function to intersect the results of multiple ngrams combining LM probabilities with FM-index frequencies (i.e., number of occurrences of the ngram in the whole corpus).

Our experimental evaluation shows that SEAL matches or outperforms recent retrieval solutions (including autoregressive ones) on Natural Questions Kwiatkowski et al. (2019), while requiring substantially less memory ( $\sim$ 2 to 7 times smaller in footprint). Moreover, SEAL’s intersection formulation improves the state-of-the-art on passage-level retrieval by more than 10 points on the KILT benchmark Petroni et al. (2021), contributing in establishing new state-of-the-art downstream results on multiple datasets when paired with existing reader technologies.

Related Work

One way to approach retrieval with autoregressive models makes use of identifiers, i.e., string pointers to documents that are in some way easier to generate than the full document itself. In tasks where such data is available or relevant, such as Wikipedia-based entity linking (a form of page-level retrieval), titles have been shown to work well as identifiers (De Cao et al., 2021b, a, 2022). However, even on Wikipedia-based benchmarks, titles on their own are not well-suited for retrieval at passage-level, given they can only identify an article (that might contain several passages). In a different direction, Tay et al. (2021) have used hierarchical clustering on contextualized embeddings to create identifiers for arbitrary spans of text. In contrast, in our work the identifiers are corpus string matches, which do not necessarily occur in just one document.

Term Weighting

Virtually all modern approaches to string-matching-based sparse retrieval make use of a bag-of-words assumption, indexing documents with an inverted index, a data structure mapping terms to documents or, more generally, locations in a corpus (Robertson and Zaragoza, 2009). Retrieval performance in this setting depends heavily on term-weighting schemes, with many recent works proposing sophisticated, contextualized weights for both queries, and documents(Dai and Callan, 2019; Gao et al., 2021; Lin and Ma, 2021; Mallia et al., 2021; Dai and Callan, 2020; Bai et al., 2020; Zhao et al., 2021; Formal et al., 2021b, a). Many of these methods are also able to weigh terms that are not present in the query, addressing so-called vocabulary mismatch. In contrast, SEAL generates (and assigns scores to) ngrams of arbitrary size, using the index for both generation and retrieval. Nevertheless, this line of work is partly orthogonal to our own, as many of the proposed techniques could be used to rescore higher-order ngrams.

Query/Document Expansion

A line of research which often involves autoregressive language models is that of document and query expansion. For example, one can augment stored documents by generating possible queries that might be answered by them (Nogueira et al., 2019; Nogueira and Lin, 2021). In the opposite direction, works like GAR (Mao et al., 2021) augment the query by predicting helpful additional terms, such as an answer, sentence containing the answer, or the title of a document where the answer may be found. We note that while query expansion bears a superficial resemblance with SEAL, the approaches are conceptually distinct. While query expansion methods rely on a stand-alone black-box retriever, in our work the boundary between generation and retrieval is blurred, since our identifiers are grounded passage spans.

Query Likelihood Models

Another connected strand of research is that of query likelihood models, which, in their latest incarnations, use autoregressive models to (re)rank passages according to the probability $P(q|p)$ of a query $q$ given the passage $p$ (Nogueira dos Santos et al., 2020; Zhuang and Zuccon, 2021; Lesota et al., 2021). In our case, the autoregressive architecture models the likelihood of an ngram given the query, i.e., $P(n|q)$ .

“Learning to Google”

Recently, language models have been shown to be able to directly generate search queries for modern web search engines either with finetuning on demonstrations (Komeili et al., 2021; Shuster et al., 2022) and human preferences (Nakano et al., 2021) or via prompting (Lazaridou et al., 2022). In our case, there is no black-box retrieval system that is queried. Rather, the white-box index determines both the generated ngrams and the search process.

Background

In retrieval, the automatic system is required to return an ordered list of documents $d_{1},d_{2},\dots,d_{n}$ from a retrieval corpus $\mathcal{R}$ , given a query $q$ . Both queries and documents are texts, i.e., lists of tokens $\langle t_{1},t_{2},\dots,t_{N}\rangle$ , where each token $t$ is drawn from a vocabulary $V$ . A span of tokens in a text is called an ngram; ngrams of size $1$ are known as unigrams. We denote with $F(n,\mathcal{R})$ the frequency of an ngram $n$ in $\mathcal{R}$ , i.e., the total number of times it appears in the whole retrieval corpus.

Our method requires a data structure that can support the efficient identification of occurring substrings to guarantee that all decoded sequences are located somewhere in the retrieval corpus. Moreover, to perform retrieval, we require the ability to identify which documents the generated ngrams appear in. Neither inverted indices, (which have no efficient way to to search for phrases of arbitrary length), nor prefix trees, (which would force us to explicitly encode all $k$ suffixes in a document), are viable options. The core data structure that satisfies our requirements is the FM-index (Ferragina and Manzini, 2000), i.e., a compressed suffix array that, as a self-index, requires no additional storage for the original text. FM-index space requirements are linear in the size of the corpus, and, with small vocabularies such as those used by modern subword-based language models, is thus usually significantly smaller than the uncompressed corpus. The FM-index can be used to count the frequency of any sequence of tokens $n$ in $O(|n|\text{log}|V|)$ , i.e., independently from the size of the corpus itself. For constrained decoding, the list of possible token successors can be obtained in $O(|V|\text{log}|V|)$ . Internally, the FM-index relies on the Burrows-Wheeler Transform (Burrows and Wheeler, 1994), or BWT, an invertible transformation that permutes a string to make it easier to compress, defined as follows: all the rotations of the string are sorted lexicographically and laid out in a matrix; the last column of the matrix is the strings’s BWT.Since our corpus contains multiple documents, we concatenate them with a separator token. For example, given the string $CABAC$ , the corresponding matrix would be:

where $\$ $is a special end-of-string token. The first (F) and last (L) columns are the only ones that will be explicitly stored in the FM-index; F is just an array of runs (i.e., sequences of repeated tokens), due to the rotations being sorted, so it can be represented with one count for each alphabet symbol; L, the string’s BWT, will be stored in a data structure known as the Wavelet Tree (Grossi et al., 2003), which allows efficient rank-select-access queries, while exploiting the compressibility induced by the transformation. FM-indices have the useful property that for each symbol, the relative rank stays the same: that is, the$ i $th occurrence of a symbol$ \sigma $in F points to the same location in the corpus of the$ i $th occurrence of$ \sigma $in L. Thanks to this property, we can locate any string$ \langle\sigma_{1},\sigma_{2},\dots,\sigma_{n}\rangle $in the index by starting from$ \sigma_{n} $and going backwards. First, we select the contiguous range of rows corresponding to the symbol$ \sigma_{n} $in F, then we check the ranks of the first and last occurrences of$ \sigma_{n-1} $in the same range of rows in$ L $. We use the ranks to select a new, smaller or equal range of rows looking up the symbol$ \sigma_{n-1} $in$ F$. The procedure can be applied iteratively to find ngrams of arbitrary size.

Method

In our retrieval methodology, SEAL, we generate multiple ngrams, conditioning on a query. The ngrams are then used to find the documents they appear in within the corpus, which are then returned to the user. In Figure 1 we show this process at a high-level. We use our indexing structure, i.e., the FM-index to constrain decoding so that each ngram occurs at least once in the retrieval corpus. Jointly, we use the FM-index to efficiently find matching documents. Documents are ranked using the scores of the generated ngrams.

We generate ngrams identifiers with constrained beam search, using the FM-index to identify the set of possible next tokens in at most $O(|V|\text{log}|V|)$ : tokens corresponding to unattested continuations are blocked by masking the logit to $-\infty$ . As a result, after a single decoding pass, we get a set of ngrams ( $K$ ), along with their autoregressively-computed probabilities according to the model. It is also trivial to find the positions in the corpus where the decoded ngrams appear, as constrained decoding already requires selecting the relevant range of rows in the FM-index. Note that autoregressive scoring entails monotonically decreasing scores—any string will be assigned a lower probability than any of its prefixes. To address this issue, we use fixed-length ngrams. Each document is assigned the score ( $P(n|q)$ ) of its most probable decoded occurring ngram. We refer to this as the LM scoring.

Factoring in FM-index frequencies

To counterbalance the monotonic probability decrease, we integrate in scoring unconditional ngram probabilities, computed as normalized index frequencies:

This also enables us to promote distinctive ngrams, i.e., those that have high probability according to the model and low probability according to the FM-index. We take inspiration from the theory behind TF-IDF and BM25 (Robertson and Zaragoza, 2009) and use the following scoring function:

This formulation addresses the problem of length, as the unconditional probability of an ngram will also be equal or lower than that of any of its prefixes. To make better use of the computational resources, we slightly modify the beam search implementation to keep track of all the partially decoded sequences that have been considered. Thanks to this, we score a larger number of ngrams than the size of the beam. We refer to this formulation as the LM+FM scoring.

An Intersective Scoring for Multiple Ngrams

One problem with the previous scoring formulations is that it is impossible to break ties among documents whose highest scoring ngram is the same, as they receive exactly the same score. Moreover, it might be difficult to capture all relevant information within a document by considering only a single ngram, for instance when salient ngrams are non-contiguous (e.g., separated by unrelated text). To address these issues we propose a novel scoring formulation that aggregates the contribution of multiple ngrams contained in the same document. To avoid repeated scoring of overlapping ngrams, for each document $d\in\mathcal{R}$ we only consider a subset of the generated ngrams $K^{(d)}\subset K$ . An ngram $n$ belongs to $K^{(d)}$ if there is at least one occurrence of $n$ in $d$ that does not overlap with an occurrence of another ngram $n^{\prime}$ such that a) $n^{\prime}\in K^{(d)}$ b) $w(n^{\prime},q)>w(n,q)$ . The document-level score, then, is the weighted sum of all ngrams in $K^{(d)}$ :

where $\alpha$ is a hyperparameter and the weight $\text{cover}(n,K)$ (controlled by the second hyperparameter $\beta$ ) is a function of how many ngram tokens are not included in the coverage set $C(n,K)\subset V$ , i.e., the union of all tokens in ngrams with a higher score. We define this coverage weight as follows:

The purpose of the coverage weight is to avoid the overscoring of very repetitive documents, where many similar ngrams are matched. Note that by saving the probability distribution at the first decoding step we can compute scores for all unigrams with no additional forward pass. We refer to this last approach, which can be thought of as a higher-order generalization of the bag-of-words assumption, as the LM+FM intersective scoring.

Experimental Setting

Our experimental setting evaluates SEAL on English knowledge-intensive NLP tasks. Each considered dataset is a collection of queries, each of which can be answered by looking for piece(s) of evidence in the corpus. We consider both an in vivo evaluation, in which we assess the model by looking at how well the document ranking matches with the ground truth, and, in addition, we perform a downstream evaluation, in which we feed the retrieved documents to a trained reader, that uses the documents to generate the answer.

Natural Questions (NQ) is dataset containing query-document pairs, where the query is a question (e.g., “who wrote photograph by ringo starr”), and the document is a Wikipedia page, in which a span is marked as an answer (Kwiatkowski et al., 2019). We experiment on both the customary retrieval setup used by, among others, Karpukhin et al. (2020) and Mao et al. (2021), and the substantially different setup used by Tay et al. (2022). We refer to these two settings as, respectively, NQ and NQ320 $k$ . In NQ, retrieval is performed on an entire Wikipedia dump, chunked in around $21$ M passages of 100 tokens. Performance is measured as accuracy@ $k$ , i.e., the fraction of instances for which at least one of the top- $k$ retrieved passages contains the answer. NQ320 $k$ is a much more restricted setting, in which the retrieval set is limited to the union of all ground truth document in the training, dev or test set. Different revisions of the same Wikipedia page count as different documents. Note that the exact splits used by Tay et al. (2022), the retrieval corpus and the preprocessing code have not been yet released at the time of writing. Therefore, we have tried to replicate the setting as closely as possible, but the exact numbers are not precisely comparable with those reported in the original paper. In NQ320 $k$ , performance is measured as hits@ $k$ , i.e, the fraction of instances for which at least one of the top- $k$ retrieved passages is in the ground truth.

KILT

is a comprehensive benchmark collecting different datasets including question answering, fact checking, dialogue, slot filling, and entity linking (Petroni et al., 2021). All these tasks are solvable by retrieving information from a unified corpus — a Wikipedia dump. In KILT, the evidence is usually the paragraph that contains the answer. Following Maillard et al. (2021), we have re-chunked KILT’s retrieval corpus, which is originally paragraph-based, in around $36$ M passages of 100 tokens. We do not use the entity linking and ELI5 KILT tasks, where a ground truth passage is not provided in the training set. KILT’s retrieval performance is measured with R-precision, a precision-oriented measure that considers only gold documents as correct answers, not just any document containing the answer. R-precision can be computed at either passage level or at page level.

2 SEAL configuration

We finetune BART large (Lewis et al., 2019) to generate ngrams of length $k=10$ from the ground truth document. Since there are $|d|-k$ ngrams in a document $d$ , we sample (with replacement) $10$ ngrams from it, biasing the distribution in favor of ngrams with a high character overlap with the query. We also add the title of the document to the set of training ngrams. To expose the model to more possible pieces of evidence, we also add different “unsupervised” examples for each document in the retrieval corpus to the training set. In each of these examples the model takes as input a uniformly sampled span from the document, and predicts either another sampled span, or the title of the page. We append special tokens to the input to signal to the model a) whether the pair comes from the supervised or unsupervised training pairs (in the same spirit as the co-training task prompts used by Tay et al. (2022)) b) whether a title or span is expected as output. On KILT we train SEAL on all datasets at once.

Training Hyperparameters

We finetune the model using fairseq. We use Adam (Kingma and Ba, 2015) with a learning rate of $3\cdot 10^{-5}$ , warming up for $500$ updates, then using polynomial decay for at $800k$ updates, evaluating every $15k$ steps. We stop the training run if the loss on the development set stops improving for $5$ evaluation passes. We use label smoothing ( $0.1$ ), weight decay ( $0.01$ ), and gradient norm clipping ( $0.1$ ). We train in batches of $4096$ tokens on $8$ GPUs.

Index

We use the C++ FM-index implementation in sdsl-lite. While the FM-index construction (which requires a sort of all rotations) takes around 6 hours in our single-threaded implementation, parallel algorithms are available (Labeit et al., 2017). Each document is encoded as the subword tokenization of the concatenation of the title and the passage, separated by a special token. We report in Table 1 the index statistics for Natural Questions. As can be seen, SEAL’s FM-index is more than 7 times lighter compared to DPR’s full document embeddings for exact inner product search, and needs neither a GPU for search on top of that, nor separate storage for the text itself. While vector compression methods can reduce dense retrievers’ index size, this still comes at the expense of performance (Yamada et al., 2021; Lewis et al., 2021a). In addition, our the size of our index is less than 50% of that of the well-optimized Lucene BM25 index used by pyserini, but also roughly 65% of the uncompressed plain text itself.

Inference

We decode for $10$ timesteps with a beam size of $15$ , and set the hyperparameters $\alpha$ , and $\beta$ to, respectively, $2.0$ and $0.8$ . The hyperparameters have been tuned on the Natural Questions development set (§5.1). In the constrained decoding stage, we force part of the generated ngrams to match document titles.

3 Retriever Baselines

We compare SEAL against well-established systems in the literature on each benchmark. On NQ and NQ320 $k$ we also compare against our BART-based replication of DSI (Tay et al., 2022, DSI-BART). On NQ320k, a page-level benchmark, we include our own replication of GENRE (De Cao et al., 2021b). Unless otherwise specified, we use pyserini to compute the BM25 baseline. For other systems, we either take figures from the literature, or use publicly released model predictions.

On NQ320 $k$ , bert-base-cased is used to compute the embeddings for the clustering. On regular NQ, we use the public precomputed DPR embeddings. To compare fairly against SEAL, we fine-tune the same encoder-decoder backbone, i.e., BART large.

4 Reader

For downstream results, we use the Fusion-in-Decoder abstractive reader (Izacard and Grave, 2021), which takes in the query along with 100 contexts and produces a task-specific answer. We train FiD on training set predictions.

Results

We report results on NQ320 $k$ in Table 2. SEAL outperforms BM25 and DSI-BART in hits@10 in all its formulations. When taking into account ngram frequencies (i.e., LM+FM), SEAL achieves even higher results than GENRE, despite the fact that this benchmark only requires page-level retrieval capabilities (that is the focus of GENRE). Finally, our intersective formulation achieves the highest results, both in hits@1 and @10, indicating that multiple ngrams identifiers might capture complementary information, which can be aggregated for stronger performances.

Natural Questions

We report in Table 3 the results of our evaluation on Natural Questions, a passage-level retrieval benchmark with a larger collection of documents (i.e., ~ $21$ M w.r.t. $200$ k in NQ320 $k$ ). In this setting, the gap in performance between DSI-BART and SEAL is larger, possibly because memorizing documents identifiers in the parameters of the model becomes more challenging with larger corpora. Remarkably, the intersective formulation of SEAL achieves results comparable or superior to more established retrieval paradigms (e.g., BM25, DPR and GAR). To better understand the generalization capabilities of our retrieval solution we use the question/answer overlap split of Lewis et al. (2021b). This study reveals that SEAL achieves the highest performance for question/answer pairs never seen during training (i.e., no overlap), suggesting a better ability to generalize to completely novel questions with novel answers (e.g., $3.5$ points better than GAR on average).

KILT

We report retrieval results at passage level on the KILT benchmark in Table 4.We report page-level and KILT-score results in the Appendix (§A). SEAL outperforms DPR by more than 10 points on average in passage-level R-precision, indicating that our method is more precise in surfacing ground truth evidence as the first result. Moreover, SEAL also performs better than MT-DPR (multi-task DPR) even when the latter is pretrained on tens of millions of questions from PAQ (Lewis et al., 2021c), a technique that can drastically improve results and that could potentially bring benefits to our method as well (a task we leave for future work). When it comes to downstream performances (Table 5), FiD with passages retrieved by intersective SEAL establishes a new state-of-the-art on 4 datasets out of 7 (FEVER, zsRE, NQ, HoPo), and achieves very competitive results on the remaining 3.

Speed and constrained decoding

The inference speed of SEAL is directly proportional to the beam size, with a limited overhead added by constrained decoding. On the Natural Questions test set, for instance, retrieval with the intersective scoring requires on our 1 GPU evaluation setup ~16 minutes and ~35 minutes with, respectively, a beam size of $5$ or $15$ . Mao et al. (2021) report a lower runtime for GAR (~5 minutes), and a comparable one for DPR (~30 minutes). Note that more efficient approaches to constrained decoding have been proposed (e.g., De Cao et al. (2021a)) and we leave their application to SEAL as future work.

Ablation studies

In Table 6 we report performances on Natural Questions for various configurations of SEAL. While, in general, performances increase with a larger beam, diminishing returns (or even a slight performance decrease) are encountered between a value of $10$ and $15$ . Disabling constrained decoding and discarding a posteriori all generated ngrams that don’t appear in the corpus, results in slightly lower performances.

Qualitative Analysis

In Table 7, we show examples of ngrams predicted by SEAL (trained on KILT) given the query “can you predict earthquakes”. SEAL is able to rephrase the query in ways that preserve its lexical material producing ngrams such as earthquakes can be predicted, used to predict earthquakes etc. Morevoer, the model is also able to explore more diverse regions of the output space, overcoming the vocabulary mismatch problem: ngrams contain related tokens like the subword seism- and the word forecast. SEAL’s LM+FM scoring is also able to assign a score below (and, thus, exclude from the search), unrelated ngrams that are considered by the beam because of their promising start, such as “Seismic risk in Malta @@”.

Discussion

With SEAL we present solution that could potentially find applications outside information retrieval (e.g., enforce generated substrings come from a white list of trusted sources). While we conduct our experiments with a model of ~ $400$ M parameters (i.e., BART) for fast iterations, we believe the use of larger models could considerably improve performance. Changing the model would not affect the size of the index nor the cost of using it — $O(|n|\text{log}|V|)$ for finding an ngram $n$ . Moreover, we believe that indexing very large corpora (e.g., the web) could be done more efficiently than existing attempts (e.g., Piktus et al. (2021)) given the light memory footprint. Finally, dynamic variants (Gerlach, 2007; Salson et al., 2009) could allow the update of the FM-index on the fly without the need of re-indexing. While out of the scope of the current paper, we plan to tackle some of these scaling challenges in future work.

Conclusion

In this paper we present SEAL, a novel retrieval system that combines an autoregressive language model with a compressed full-text substring index. Such combination allows to constraint the generation of existing ngrams in a corpus and to jointly retrieve all the documents containing them. Empirically, we show an improvement of more than $10$ points in average passage-level R-precision on KILT, and establish new state-of-the-art downstream performance on 4 out 7 datasets when paired with a reader model. While our results show that SEAL could already compete with more established retrieval systems, we believe there is potential in exploring the use of existing (or yet to come) larger autoregressive models.

Acknowledgements

We thank Aleksandra Piktus, Edoardo Barba, Niccolò Campolungo, and Pere-Lluis Huguet Cabot for their helpful comments and suggestions.

References

Appendix A Additional KILT results

We report in Table 8 page-level results on the KILT test set. On most datasets, SEAL obtains results which are comparable or better than other systems performing page-level retrieval. Furthermore, are results are within two points of the average performance of GENRE, i.e., a system that directly targets the page-level setting. Comparing KILT-scores (Table 9), i.e., a metric combining downstream performances and page-level R-precision, we achieve state-of-the-art results on 4 out of 7 datasets.

Appendix B Impact of unsupervised examples

SEAL is trained with both supervised and unsupervised examples. In Table 10 we report ablated results, by which we assess the importance of both kind of training examples. The addition of unsupervised examples improves purely supervised training by one point (A@100). Only training with unsupervised examples results in performances which are slightly below BM25’s.