Phrase Retrieval Learns Passage Retrieval, Too

Jinhyuk Lee, Alexander Wettig, Danqi Chen

Introduction

Dense retrieval aims to retrieve relevant contexts from a large corpus, by learning dense representations of queries and text segments. Recently, dense retrieval of passages (Lee et al., 2019; Karpukhin et al., 2020; Xiong et al., 2021) has been shown to outperform traditional sparse retrieval methods such as TF-IDF and BM25 in a range of knowledge-intensive NLP tasks Petroni et al. (2021), including open-domain question answering (QA) (Chen et al., 2017), entity linking (Wu et al., 2020), and knowledge-grounded dialogue Dinan et al. (2019).

One natural design choice of these dense retrieval methods is the retrieval unit. For instance, the dense passage retriever (DPR) Karpukhin et al. (2020) encodes a fixed-size text block of 100 words as the basic retrieval unit. On the other extreme, recent work Seo et al. (2019); Lee et al. (2021) demonstrates that phrases can be used as a retrieval unit. In particular, Lee et al. (2021) show that learning dense representations of phrases alone can achieve competitive performance in a number of open-domain QA and slot filling tasks. This is particularly appealing since the phrases can directly serve as the output, without relying on an additional reader model to process text passages.

In this work, we draw on an intuitive motivation that every single phrase is embedded within a larger text context and ask the following question: If a retriever is able to locate phrases, can we directly make use of it for passage and even document retrieval as well? We formulate phrase-based passage retrieval, in which the score of a passage is determined by the maximum score of phrases within it (see Figure 1 for an illustration). By evaluating DensePhrases Lee et al. (2021) on popular QA datasets, we observe that it achieves competitive or even better passage retrieval accuracy compared to DPR, without any re-training or modification to the original model (Table 1). The gains are especially pronounced for top- $k$ accuracy when $k$ is smaller (e.g., 5), which also helps achieve strong open-domain QA accuracy with a much smaller number of passages as input to a generative reader model Izacard and Grave (2021b).

To better understand the nature of dense retrieval methods, we carefully analyze the training objectives of phrase and passage retrieval methods. While the in-batch negative losses in both models encourage them to retrieve topically relevant passages, we find that phrase-level supervision in DensePhrases provides a stronger training signal than using hard negatives from BM25, and helps DensePhrases retrieve correct phrases, and hence passages. Following this positive finding, we further explore whether phrase retrieval can be extended to retrieval of coarser granularities, or other NLP tasks. Through fine-tuning of the query encoder with document-level supervision, we are able to obtain competitive performance on entity linking (Hoffart et al., 2011) and knowledge-grounded dialogue retrieval (Dinan et al., 2019) in the KILT benchmark (Petroni et al., 2021).

Finally, we draw connections to multi-vector passage encoding models (Khattab and Zaharia, 2020; Luan et al., 2021), where phrase retrieval models can be viewed as learning a dynamic set of vectors for each passage. We show that a simple phrase filtering strategy learned from QA datasets gives us a control over the trade-off between the number of vectors per passage and the retrieval accuracy. Since phrase retrievers encode a larger number of vectors, we also propose a quantization-aware fine-tuning method based on Optimized Product Quantization (Ge et al., 2013), reducing the size of the phrase index from 307GB to 69GB (or under 30GB with more aggressive phrase filtering) for full English Wikipedia, without any performance degradation. This matches the index size of passage retrievers and makes dense phrase retrieval a practical and versatile solution for multi-granularity retrieval.

Background

Given a set of documents $\mathcal{D}$ , passage retrieval aims to provide a set of relevant passages for a question $q$ . Typically, each document in $\mathcal{D}$ is segmented into a set of disjoint passages and we denote the entire set of passages in $\mathcal{D}$ as $\mathcal{P}=\{p_{1},\dots,p_{M}\},$ where each passage can be a natural paragraph or a fixed-length text block. A passage retriever is designed to return top- $k$ passages $\mathcal{P}_{k}\subset\mathcal{P}$ with the goal of retrieving passages that are relevant to the question. In open-domain QA, passages are considered relevant if they contain answers to the question. However, many other knowledge-intensive NLP tasks (e.g., knowledge-grounded dialogue) provide human-annotated evidence passages or documents.

While traditional passage retrieval models rely on sparse representations such as BM25 (Robertson and Zaragoza, 2009), recent methods show promising results with dense representations of passages and questions, and enable retrieving passages that may have low lexical overlap with questions. Specifically, Karpukhin et al. (2020) introduce DPR that has a passage encoder $E_{p}(\cdot)$ and a question encoder $E_{q}(\cdot)$ trained on QA datasets and retrieves passages by using the inner product as a similarity function between a passage and a question:

For open-domain QA where a system is required to provide an exact answer string $a$ , the retrieved top $k$ passages $\mathcal{P}_{k}$ are subsequently fed into a reading comprehension model such as a BERT model (Devlin et al., 2019), and this is called the retriever-reader approach (Chen et al., 2017).

Phrase retrieval

While passage retrievers require another reader model to find an answer, Seo et al. (2019) introduce the phrase retrieval approach that encodes phrases in each document and performs similarity search over all phrase vectors to directly locate the answer. Following previous work (Seo et al., 2018, 2019), we use the term ‘phrase’ to denote any contiguous text segment up to $L$ words (including single words), which is not necessarily a linguistic phrase and we take phrases up to length $L=20$ . Given a phrase $s^{(p)}$ from a passage $p$ , their similarity function $f$ is computed as:

where $E_{s}(\cdot)$ and $E_{q}(\cdot)$ denote the phrase encoder and the question encoder, respectively. Since this formulates open-domain QA purely as a maximum inner product search (MIPS), it can drastically improve end-to-end efficiency. While previous work Seo et al. (2019); Lee et al. (2020) relied on a combination of dense and sparse vectors, Lee et al. (2021) demonstrate that dense representations of phrases alone are sufficient to close the performance gap with retriever-reader systems. For more details on how phrase representations are learned, we refer interested readers to Lee et al. (2021).

Phrase Retrieval for Passage Retrieval

Phrases naturally have their source texts from which they are extracted. Based on this fact, we define a simple phrase-based passage retrieval strategy, where we retrieve passages based on the phrase-retrieval score:

where $\mathcal{S}(p)$ denotes the set of phrases in the passage $p$ . In practice, we first retrieve a slightly larger number of phrases, compute the score for each passage, and return top $k$ unique passages.In most cases, retrieving $2k$ phrases is sufficient for obtaining $k$ unique passages. If not, we try $4k$ and so on. Based on our definition, phrases can act as a basic retrieval unit of any other granularity such as sentences or documents by simply changing $\mathcal{S}(p)$ (e.g., $s^{(d)}\in\mathcal{S}(d)$ for a document $d$ ). Note that, since the cost of score aggregation is negligible, the inference speed of phrase-based passage retrieval is the same as for phrase retrieval, which is shown to be efficient in Lee et al. (2021). In this section, we evaluate the passage retrieval performance (Eq. (3)) and also how phrase-based passage retrieval can contribute to end-to-end open-domain QA.

We use two open-domain QA datasets: Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017), following the standard train/dev/test splits for the open-domain QA evaluation. For both models, we use the 2018-12-20 Wikipedia snapshot. To provide a fair comparison, we use Wikipedia articles pre-processed for DPR, which are split into 21-million text blocks and each text block has exactly 100 words. Note that while DPR is trained in this setting, DensePhrases is trained with natural paragraphs.We expect DensePhrases to achieve even higher performance if it is re-trained with 100-word text blocks. We leave it for future investigation.

Models

For DPR, we use publicly available checkpointshttps://github.com/facebookresearch/DPR. trained on each dataset (DPR♢) or multiple QA datasets (DPR♠), which we find to perform slightly better than the ones reported in Karpukhin et al. (2020). For DensePhrases, we train it on Natural Questions (DensePhrases♢) or multiple QA datasets (DensePhrases♠) with the code provided by the authors.DPR♠ is trained on NaturalQuestions, TriviaQA, CuratedTREC (Baudiš and Šedivỳ, 2015), and WebQuestions (Berant et al., 2013). DensePhrases♠ additionally includes SQuAD (Rajpurkar et al., 2016), although it does not contribute to Natural Questions and TriviaQA much. Note that we do not make any modification to the architecture or training methods of DensePhrases and achieve similar open-domain QA accuracy as reported. For phrase-based passage retrieval, we compute Eq. (3) with DensePhrases and return top $k$ passages.

Metrics

Following previous work on passage retrieval for open-domain QA, we measure the top- $k$ passage retrieval accuracy (Top- $k$ ), which denotes the proportion of questions whose top $k$ retrieved passages contain at least one of the gold answers. To further characterize the behavior of each system, we also include the following evaluation metrics: mean reciprocal rank at $k$ (MRR@ $k$ ) and precision at $k$ (P@ $k$ ). MRR@ $k$ is the average reciprocal rank of the first relevant passage (that contains an answer) in the top $k$ passages. Higher MRR@ $k$ means relevant passages appear at higher ranks. Meanwhile, P@ $k$ is the average proportion of relevant passages in the top $k$ passages. Higher P@ $k$ denotes that a larger proportion of top $k$ passages contains the answers.

Results

As shown in Table 1, DensePhrases achieves competitive passage retrieval accuracy with DPR, while having a clear advantage on top-1 or top-5 accuracy for both Natural Questions (+6.9% Top-1) and TriviaQA (+8.1% Top-1). Although the top-20 (and top-100, which is not shown) accuracy is similar across different models, MRR@20 and P@20 reveal interesting aspects of DensePhrases—it ranks relevant passages higher and provides a larger number of correct passages. Our results suggest that DensePhrases can also retrieve passages very accurately, even though it was not explicitly trained for that purpose. For the rest of the paper, we mainly compare the DPR♠ and DensePhrases♠ models, which were both trained on multiple QA datasets.

2 Experiment: Open-domain QA

Recently, Izacard and Grave (2021b) proposed the Fusion-in-Decoder (FiD) approach where they feed top 100 passages from DPR into a generative model T5 (Raffel et al., 2020) and achieve the state-of-the-art on open-domain QA benchmarks. Since their generative model computes the hidden states of all tokens in 100 passages, it requires large GPU memory and Izacard and Grave (2021b) used 64 Tesla V100 32GB for training.

In this section, we use our phrase-based passage retrieval with DensePhrases to replace DPR in FiD and see if we can use a much smaller number of passages to achieve comparable performance, which can greatly reduce the computational requirements. We train our model with 4 24GB RTX GPUs for training T5-base, which are more affordable with academic budgets. Note that training T5-base with 5 or 10 passages can also be done with 11GB GPUs. We keep all the hyperparameters the same as in Izacard and Grave (2021b).We also accumulate gradients for 16 steps to match the effective batch size of the original work.

As shown in Table 2, using DensePhrases as a passage retriever achieves competitive performance to DPR-based FiD and significantly improves upon the performance of original DensePhrases (NQ $=41.3$ EM without a reader). Its better retrieval quality at top- $k$ for smaller $k$ indeed translates to better open-domain QA accuracy, achieving +6.4% gain compared to DPR-based FiD when $k=5$ . To obtain similar performance with using 100 passages in FiD, DensePhrases needs fewer passages ( $k=25$ or $50$ ), which can fit in GPUs with smaller RAM.

A Unified View of Dense Retrieval

As shown in the previous section, phrase-based passage retrieval is able to achieve competitive passage retrieval accuracy, despite that the models were not explicitly trained for that. In this section, we compare the training objectives of DPR and DensePhrases in detail and explain how DensePhrases learns passage retrieval.

Both DPR and DensePhrases set out to learn a similarity function $f$ between a passage or phrase and a question. Passages and phrases differ primarily in characteristic length, so we refer to either as a retrieval unit $x$ .Note that phrases may overlap, whereas passages are usually disjoint segments with each other. DPR and DensePhrases both adopt a dual-encoder approach with inner product similarity as shown in Eq. (1) and (2), and they are initialized with BERT (Devlin et al., 2019) and SpanBERT (Joshi et al., 2020), respectively.

These dual-encoder models are then trained with a negative log-likelihood loss for discriminating positive retrieval units from negative ones:

where $x^{+}$ is the positive phrase or passage corresponding to question $q$ , and $\mathcal{X}^{-}$ is a set of negative examples. The choice of negatives is critical in this setting and both DPR and DensePhrases make important adjustments.

In-batch negatives are a common way to define $\mathcal{X}^{-}$ , since they are available at no extra cost when encoding a mini-batch of examples. Specifically, in a mini-batch of $B$ examples, we can add $B-1$ in-batch negatives for each positive example. Since each mini-batch is randomly sampled from the set of all training passages, in-batch negative passages are usually topically negative, i.e., models can discriminate between $x^{+}$ and $\mathcal{X}^{-}$ based on their topic only.

Hard negatives

Although topic-related features are useful in identifying broadly relevant passages, they often lack the precision to locate the exact passage containing the answer in a large corpus. Karpukhin et al. (2020) propose to use additional hard negatives which have a high BM25 lexical overlap with a given question but do not contain the answer. These hard negatives are likely to share a similar topic and encourage DPR to learn more fine-grained features to rank $x^{+}$ over the hard negatives. Figure 2 (left) shows an illustrating example.

In-passage negatives

While DPR is limited to use positive passages $x^{+}$ which contain the answer, DensePhrases is trained to predict that the positive phrase $x^{+}$ is the answer. Thus, the fine-grained structure of phrases allows for another source of negatives, in-passage negatives. In particular, DensePhrases augments the set of negatives $\mathcal{X}^{-}$ to encompass all phrases within the same passage that do not express the answer.Technically, DensePhrases treats start and end representations of phrases independently and use start (or end) representations other than the positive one as negatives. See Figure 2 (right) for an example. We hypothesize that these in-passage negatives achieve a similar effect as DPR’s hard negatives: They require the model to go beyond simple topic modeling since they share not only the same topic but also the same context. Our phrase-based passage retriever might benefit from this phrase-level supervision, which has already been shown to be useful in the context of distilling knowledge from reader to retriever (Izacard and Grave, 2021a; Yang and Seo, 2020).

2 Topical vs. Hard Negatives

To address our hypothesis, we would like to study how these different types of negatives used by DPR and DensePhrases affect their reliance on topical and fine-grained entailment cues. We characterize their passage retrieval based on two metrics (losses): $\mathcal{L}_{\text{topic}}$ and $\mathcal{L}_{\text{hard}}$ . We use Eq. (4) to define both $\mathcal{L}_{\text{topic}}$ and $\mathcal{L}_{\text{hard}}$ , but use different sets of negatives $\mathcal{X}^{-}$ . For $\mathcal{L}_{\text{topic}}$ , $\mathcal{X}^{-}$ contains passages that are topically different from the gold passage—In practice, we randomly sample passages from English Wikipedia. For $\mathcal{L}_{\text{hard}}$ , $\mathcal{X}^{-}$ uses negatives containing topically similar passages, such that $\mathcal{L}_{\text{hard}}$ estimates how accurately models locate a passage that contains the exact answer among topically similar passages. From a positive passage paired with a question, we create a single hard negative by removing the sentence that contains the answer.While $\mathcal{L}_{\text{hard}}$ with this type of hard negatives might favor DensePhrases, using BM25 hard negatives for $\mathcal{L}_{\text{hard}}$ would favor DPR since DPR was directly trained on BM25 hard negatives. Nonetheless, we observed similar trends in $\mathcal{L}_{\text{hard}}$ regardless of the choice of hard negatives. In our analysis, both metrics are estimated on the Natural Questions development set, which provides a set of questions and (gold) positive passages.

Hard negatives for DensePhrases?

We test two different kinds of hard negatives in DensePhrases to see whether its performance can further improve in the presence of in-passage negatives. For each training question, we mine for a hard negative passage, either by BM25 similarity or by finding another passage that contains the gold-answer phrase, but possibly with a wrong context. Then we use all phrases from the hard negative passage as additional hard negatives in $\mathcal{X}^{-}$ along with the existing in-passage negatives. As shown in Table 3, DensePhrases obtains no substantial improvements from additional hard negatives, indicating that in-passage negatives are already highly effective at producing good phrase (or passage) representations.

Improving Coarse-grained Retrieval

While we showed that DensePhrases implicitly learns passage retrieval, Figure 3 indicates that DensePhrases might not be very good for retrieval tasks where topic matters more than fine-grained entailment, for instance, the retrieval of a single evidence document for entity linking. In this section, we propose a simple method that can adapt DensePhrases to larger retrieval units, especially when the topical relevance is more important.

We modify the query-side fine-tuning proposed by Lee et al. (2021), which drastically improves the performance of DensePhrases by reducing the discrepancy between training and inference time. Since it is prohibitive to update the large number of phrase representations after indexing, only the query encoder is fine-tuned over the entire set of phrases in Wikipedia. Given a question $q$ and an annotated document set $\mathcal{D}^{*}$ , we minimize:

Datasets

We test DensePhrases trained with $\mathcal{L}_{\text{doc}}$ on entity linking (Hoffart et al., 2011; Guo and Barbosa, 2018) and knowledge-grounded dialogue (Dinan et al., 2019) tasks in KILT (Petroni et al., 2021). Entity linking contains three datasets: AIDA CoNLL-YAGO (AY2) (Hoffart et al., 2011), WNED-WIKI (WnWi) (Guo and Barbosa, 2018), and WNED-CWEB (WnCw) (Guo and Barbosa, 2018). Each query in entity linking datasets contains a named entity marked with special tokens (i.e., [START_ENT], [END_ENT]), which need to be linked to one of the Wikipedia articles. For knowledge-grounded dialogue, we use Wizard of Wikipedia (WoW) (Dinan et al., 2019) where each query consists of conversation history, and the generated utterances should be grounded in one of the Wikipedia articles. We follow the KILT guidelines and evaluate the document (i.e., Wikipedia article) retrieval performance of our models given each query. We use R-precision, the proportion of successfully retrieved pages in the top R results, where R is the number of distinct pages in the provenance set. However, in the tasks considered, R-precision is equivalent to precision@1, since each question is annotated with only one document.

Models

DensePhrases is trained with the original query-side fine-tuning loss (denoted as $\mathcal{L}_{\text{phrase}}$ ) or with $\mathcal{L}_{\text{doc}}$ as described in Eq. (5). When DensePhrases is trained with $\mathcal{L}_{\text{phrase}}$ , it labels any phrase that matches the title of gold document as positive. After training, DensePhrases returns the document that contains the top passage. For baseline retrieval methods, we report the performance of TF-IDF and DPR from Petroni et al. (2021). We also include a multi-task version of DPR and DensePhrases, which uses the entire KILT training datasets.We follow the same steps described in Petroni et al. (2021) for training the multi-task version of DensePhrases. While not our main focus of comparison, we also report the performance of other baselines from Petroni et al. (2021), which uses generative models (e.g., RAG (Lewis et al., 2020)) or task-specific models (e.g., BLINK (Wu et al., 2020), which has additional entity linking pre-training). Note that these methods use additional components such as a generative model or a cross-encoder model on top of retrieval models.

Results

Table 4 shows the results on three entity linking tasks and a knowledge-grounded dialogue task. On all tasks, we find that DensePhrases with $\mathcal{L}_{\text{doc}}$ performs much better than DensePhrases with $\mathcal{L}_{\text{phrase}}$ and also matches the performance of RAG that uses an additional large generative model to generate the document titles. Using $\mathcal{L}_{\text{phrase}}$ does very poorly since it focuses on phrase-level entailment, rather than document-level relevance. Compared to the multi-task version of DPR (i.e., DPR♣), DensePhrases- $\mathcal{L}_{\text{doc}}$ ♣ can be easily adapted to non-QA tasks like entity linking and generalizes better on tasks without training sets (WnWi, WnCw).

DensePhrases as a Multi-Vector Passage Encoder

In this section, we demonstrate that DensePhrases can be interpreted as a multi-vector passage encoder, which has recently been shown to be very effective for passage retrieval Luan et al. (2021); Khattab and Zaharia (2020). Since this type of multi-vector encoding models requires a large disk footprint, we show that we can control the number of vectors per passage (and hence the index size) through filtering. We also introduce quantization techniques to build more efficient phrase retrieval models without a significant performance drop.

Since we represent passages not by a single vector, but by a set of phrase vectors (decomposed as token-level start and end vectors, see Lee et al. (2021)), we notice similarities to previous work, which addresses the capacity limitations of dense, fixed-length passage encodings. While these approaches store a fixed number of vectors per passage (Luan et al., 2021; Humeau et al., 2020) or all token-level vectors (Khattab and Zaharia, 2020), phrase retrieval models store a dynamic number of phrase vectors per passage, where many phrases are filtered by a model trained on QA datasets.

Specifically, Lee et al. (2021) trains a binary classifier (or a phrase filter) to filter phrases based on their phrase representations. This phrase filter is supervised by the answer annotations in QA datasets, hence denotes candidate answer phrases. In our experiment, we tune the filter threshold to control the number of vectors per passage for passage retrieval.

2 Efficient Phrase Retrieval

The multi-vector encoding models as well as ours are prohibitively large since they contain multiple vector representations for every passage in the entire corpus. We introduce a vector quantization-based method that can safely reduce the size of our phrase index, without performance degradation.

Since the multi-vector encoding models are prohibitively large due to their multiple representations, we further introduce a vector quantization-based method that can safely reduce the size of our phrase index, without performance degradation. We use Product Quantization (PQ) (Jegou et al., 2010) where the original vector space is decomposed into the Cartesian product of subspaces. Using PQ, the memory usage of using $N$ number of $d$ -dimensional centroid vectors reduces from $Nd$ to $N^{1/M}d$ with $M$ subspaces while each database vector requires $\log_{2}N$ bits. Among different variants of PQ, we use Optimized Product Quantization (OPQ) (Ge et al., 2013), which learns an orthogonal matrix $R$ to better decompose the original vector space. See Ge et al. (2013) for more details on OPQ.

Quantization-aware training

While this type of aggressive vector quantization can significantly reduce memory usage, it often comes at the cost of performance degradation due to the quantization loss. To mitigate this problem, we use quantization-aware query-side fine-tuning motivated by the recent successes on quantization-aware training (Jacob et al., 2018). Specifically, during query-side fine-tuning, we reconstruct the phrase vectors using the trained (optimized) product quantizer, which are then used to minimize Eq. (5).

3 Experimental Results

In Figure 4, we present the top-5 passage retrieval accuracy with respect to the size of the phrase index in DensePhrases. First, applying OPQ can reduce the index size of DensePhrases from 307GB to 69GB, while the top-5 retrieval accuracy is poor without quantization-aware query-side fine-tuning. Furthermore, by tuning the threshold $\tau$ for the phrase filter, the number of vectors per each passage (# vec / $p$ ) can be reduced without hurting the performance significantly. The performance improves with a larger number of vectors per passage, which aligns with the findings of multi-vector encoding models (Khattab and Zaharia, 2020; Luan et al., 2021). Our results show that having 8.8 vectors per passage in DensePhrases has similar retrieval accuracy with DPR.

Related Work

Text retrieval has a long history in information retrieval, either for serving relevant information to users directly or for feeding them to computationally expensive downstream systems. While traditional research has focused on designing heuristics, such as sparse vector models like TF-IDF and BM25, it has recently become an active area of interest for machine learning researchers. This was precipitated by the emergence of open-domain QA as a standard problem setting (Chen et al., 2017) and the spread of the retriever-reader paradigm (Yang et al., 2019; Nie et al., 2019). The interest has spread to include a more diverse set of downstream tasks, such as fact checking (Thorne et al., 2018), entity-linking (Wu et al., 2020) or dialogue generation (Dinan et al., 2019), where the problems require access to large corpora or knowledge sources. Recently, REALM (Guu et al., 2020) and RAG (retrieval-augmented generation) (Lewis et al., 2020) have been proposed as general-purpose pre-trained models with explicit access to world knowledge through the retriever. There has also been a line of work to integrate text retrieval with structured knowledge graphs (Sun et al., 2018, 2019; Min et al., 2020). We refer to Lin et al. (2020) for a comprehensive overview of neural text retrieval methods.

Conclusion

In this paper, we show that phrase retrieval models also learn passage retrieval without any modification. By drawing connections between the objectives of DPR and DensePhrases, we provide a better understanding of how phrase retrieval learns passage retrieval, which is also supported by several empirical evaluations on multiple benchmarks. Specifically, phrase-based passage retrieval has better retrieval quality on top $k$ passages when $k$ is small, and this translates to an efficient use of passages for open-domain QA. We also show that DensePhrases can be fine-tuned for more coarse-grained retrieval units, serving as a basis for any retrieval unit. We plan to further evaluate phrase-based passage retrieval on standard information retrieval tasks such as MS MARCO.

Acknowledgements

We thank Chris Sciavolino, Xingcheng Yao, the members of the Princeton NLP group, and the anonymous reviewers for helpful discussion and valuable feedback. This research is supported by the James Mi *91 Research Innovation Fund for Data Science and gifts from Apple and Amazon. It was also supported in part by the ICT Creative Consilience program (IITP-2021-0-01819) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation) and National Research Foundation of Korea (NRF-2020R1A2C3010638).

Ethical Considerations

Models introduced in our work often use question answering datasets such as Natural Questions to build phrase or passage representations. Some of the datasets, like SQuAD, are created from a small number of popular Wikipedia articles, hence could make our model biased towards a small number of topics. We hope that inventing an alternative training method that properly regularizes our model could mitigate this problem. Although our efforts have been made to reduce the computational cost of retrieval models, using passage retrieval models as external knowledge bases will inevitably increase the resource requirements for future experiments. Further efforts should be made to make retrieval more affordable for independent researchers.

Introduction

Background

Phrase retrieval

Phrase Retrieval for Passage Retrieval

Models

Metrics

Results

2 Experiment: Open-domain QA

A Unified View of Dense Retrieval

Hard negatives

In-passage negatives

2 Topical vs. Hard Negatives

Hard negatives for DensePhrases?

Improving Coarse-grained Retrieval

Datasets

Models

Results

DensePhrases as a Multi-Vector Passage Encoder

2 Efficient Phrase Retrieval

Quantization-aware training

3 Experimental Results

Related Work

Conclusion

Acknowledgements

Ethical Considerations

References