Learning Dense Representations of Phrases at Scale

Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, Danqi Chen

Introduction

Open-domain question answering (QA) aims to provide answers to natural-language questions using a large text corpus (Voorhees et al., 1999; Ferrucci et al., 2010; Chen and Yih, 2020). While a dominating approach is a two-stage retriever-reader approach (Chen et al., 2017; Lee et al., 2019; Guu et al., 2020; Karpukhin et al., 2020), we focus on a recent new paradigm solely based on phrase retrieval (Seo et al., 2019; Lee et al., 2020). Phrase retrieval highlights the use of phrase representations and finds answers purely based on the similarity search in the vector space of phrases.Following previous work (Seo et al., 2018), ‘phrase’ denotes any contiguous segment of text up to $L$ words (including single words), which is not necessarily a linguistic phrase. Without relying on an expensive reader model for processing text passages, it has demonstrated great runtime efficiency at inference time.

Despite great promise, it remains a formidable challenge to build vector representations for every single phrase in a large corpus. Since phrase representations are decomposed from question representations, they are inherently less expressive than cross-attention models Devlin et al. (2019). Moreover, the approach requires retrieving answers correctly out of billions of phrases (e.g., $6\times 10^{10}$ phrases in English Wikipedia), making the scale of the learning problem difficult. Consequently, existing approaches heavily rely on sparse representations for locating relevant documents and paragraphs while still falling behind retriever-reader models (Seo et al., 2019; Lee et al., 2020).

In this work, we investigate whether we can build fully dense phrase representations at scale for open-domain QA. First, we aim to learn strong phrase representations from the supervision of reading comprehension tasks. We propose to use data augmentation and knowledge distillation to learn better phrase representations within a single passage. We then adopt negative sampling strategies such as in-batch negatives (Henderson et al., 2017; Karpukhin et al., 2020), to better discriminate the phrases at a larger scale. Here, we present a novel method called pre-batch negatives, which leverages preceding mini-batches as negative examples to compensate the need of large-batch training. Lastly, we present a query-side fine-tuning strategy that drastically improves phrase retrieval performance and allows for transfer learning to new domains, without re-building billions of phrase representations.

As a result, all these improvements lead to a much stronger phrase retrieval model, without the use of any sparse representations (Table 1). We evaluate our model, DensePhrases, on five standard open-domain QA datasets and achieve much better accuracies than previous phrase retrieval models (Seo et al., 2019; Lee et al., 2020), with 15%–25% absolute improvement on most datasets. Our model also matches the performance of state-of-the-art retriever-reader models (Guu et al., 2020; Karpukhin et al., 2020). Due to the removal of sparse representations and careful design choices, we further reduce the storage footprint for the full English Wikipedia from 1.5TB to 320GB, as well as drastically improve the throughput.

Finally, we envision that DensePhrases acts as a neural interface for retrieving phrase-level knowledge from a large text corpus. To showcase this possibility, we demonstrate that we can directly use DensePhrases for fact extraction, without re-building the phrase storage. With only fine-tuning the question encoder on a small number of subject-relation-object triples, we achieve state-of-the-art performance on two slot filling tasks (Petroni et al., 2021), using less than 5% of the training data.

Background

We first formulate the task of open-domain question answering for a set of $K$ documents $\mathcal{D}=\{d_{1},\dots,d_{K}\}$ . We follow the recent work Chen et al. (2017); Lee et al. (2019) and treat all of English Wikipedia as $\mathcal{D}$ , hence $K\approx 5\times 10^{6}$ . However, most approaches—including ours—are generic and could be applied to other collections of documents.

The task aims to provide an answer $\hat{a}$ for the input question $q$ based on $\mathcal{D}$ . In this work, we focus on the extractive QA setting, where each answer is a segment of text, or a phrase, that can be found in $\mathcal{D}$ . Denote the set of phrases in $\mathcal{D}$ as $\mathcal{S}(\mathcal{D})$ and each phrase $s_{k}\in\mathcal{S}(\mathcal{D})$ consists of contiguous words $w_{\texttt{start}(k)},\ldots,w_{\texttt{end}(k)}$ in its document $d_{\texttt{doc}(k)}$ . In practice, we consider all the phrases up to $L=20$ words in $\mathcal{D}$ and $\mathcal{S}(\mathcal{D})$ comprises a large number of $6\times 10^{10}$ phrases. An extractive QA system returns a phrase $\hat{s}=\operatorname*{argmax}_{s\in\mathcal{S}(\mathcal{D})}f(s|\mathcal{D},q)$ where $f$ is a scoring function. The system finally maps $\hat{s}$ to an answer string $\hat{a}$ : $\texttt{TEXT}{(\hat{s})}=\hat{a}$ and the evaluation is typically done by comparing the predicted answer $\hat{a}$ with a gold answer $a^{*}$ .

Although we focus on the extractive QA setting, recent works propose to use a generative model as the reader (Lewis et al., 2020; Izacard and Grave, 2021), or learn a closed-book QA model (Roberts et al., 2020), which directly predicts answers without using an external knowledge source. The extractive setting provides two advantages: first, the model directly locates the source of the answer, which is more interpretable, and second, phrase-level knowledge retrieval can be uniquely adapted to other NLP tasks as we show in §7.3.

A dominating paradigm in open-domain QA is the retriever-reader approach Chen et al. (2017); Lee et al. (2019); Karpukhin et al. (2020), which leverages a first-stage document retriever $f_{\text{retr}}$ and only reads top $K^{\prime}\ll K$ documents with a reader model $f_{\text{read}}$ . The scoring function $f(s\mid\mathcal{D},q)$ is decomposed as:

where $\{j_{1},\ldots,j_{K^{\prime}}\}\subset\{1,\ldots,K\}$ and if $s\notin\mathcal{S}(\{d_{j_{1}},\ldots,d_{j_{K^{\prime}}}\})$ , the score will be 0. It can easily adapt to passages and sentences Yang et al. (2019); Wang et al. (2019). However, this approach suffers from error propagation when incorrect documents are retrieved and can be slow as it usually requires running an expensive reader model on every retrieved document or passage at inference time.

Phrase retrieval.

Seo et al. (2019) introduce the phrase retrieval approach that encodes phrase and question representations independently and performs similarity search over the phrase representations to find an answer. Their scoring function $f$ is computed as follows:

where $E_{s}$ and $E_{q}$ denote the phrase encoder and the question encoder respectively. As $E_{s}(\cdot)$ and $E_{q}(\cdot)$ representations are decomposable, it can support maximum inner product search (MIPS) and improve the efficiency of open-domain QA models. Previous approaches Seo et al. (2019); Lee et al. (2020) leverage both dense and sparse vectors for phrase and question representations by taking their concatenation: $E_{s}(s,\mathcal{D})=[E_{\text{sparse}}(s,\mathcal{D}),E_{\text{dense}}(s,\mathcal{D})].$ Seo et al. (2019) use sparse representations of both paragraphs and documents and Lee et al. (2020) use contextualized sparse representations conditioned on the phrase. However, since the sparse vectors are difficult to parallelize with dense vectors, their method essentially conducts sparse and dense vector search separately. The goal of this work is to only use dense representations, i.e., $E_{s}(s,\mathcal{D})=E_{\text{dense}}(s,\mathcal{D})$ , which can model $f(s\mid\mathcal{D},q)$ solely with MIPS, as well as close the gap in performance.

DensePhrases

We introduce DensePhrases, a phrase retrieval model that is built on fully dense representations. Our goal is to learn a phrase encoder as well as a question encoder, so we can pre-index all the possible phrases in $\mathcal{D}$ , and efficiently retrieve phrases for any question through MIPS at testing time. We outline our approach as follows:

We first learn a high-quality phrase encoder and an (initial) question encoder from the supervision of reading comprehension tasks (§4.1), as well as incorporating effective negative sampling to better discriminate phrases at scale (§4.2, §4.3).

Then, we fix the phrase encoder and encode all the phrases $s\in\mathcal{S}(\mathcal{D})$ and store the phrase indexing offline to enable efficient search (§5).

Finally, we introduce an additional strategy called query-side fine-tuning (§6) by further updating the question encoder.In this paper, we use the term question and query interchangeably as our question encoder can be naturally extended to “unnatural” queries. We find this step to be very effective, as it can reduce the discrepancy between training (the first step) and inference, as well as support transfer learning to new domains.

Before we present the approach in detail, we first describe our base architecture below.

2 Base Architecture

A great advantage of this representation is that we eventually only need to index and store all the word vectors (we use $\mathcal{W}(\mathcal{D})$ to denote all the words in $\mathcal{D}$ ), instead of all the phrases $\mathcal{S}(\mathcal{D})$ , which is at least one magnitude order smaller.

Note that we use pre-trained language models to initialize $\mathcal{M}_{p}$ , $\mathcal{M}_{q,\text{start}}$ and $\mathcal{M}_{q,\text{end}}$ and they are fine-tuned with the objectives that we will define later. In our pilot experiments, we found that SpanBERT (Joshi et al., 2020) leads to superior performance compared to BERT Devlin et al. (2019). SpanBERT is designed to predict the information in the entire span from its two endpoints, therefore it is well suited for our phrase representations. In our final model, we use SpanBERT-base-cased as our base LMs for $E_{s}$ and $E_{q}$ , and hence $d=768$ .Our base model is largely inspired by DenSPI Seo et al. (2019), although we deviate from theirs as follows. (1) We remove coherency scalars and don’t split any vectors. (2) DenSPI uses a shared encoder for phrases and questions while we use 3 separate language models initialized from the same pre-trained model. (3) We use SpanBERT instead of BERT. See Table 5 for an ablation study.

Learning Phrase Representations

In this section, we start by learning dense phrase representations from the supervision of reading comprehension tasks, i.e., a single passage $p$ contains an answer $a^{*}$ to a question $q$ . Our goal is to learn strong dense representations of phrases for $s\in\mathcal{S}(p)$ , which can be retrieved by a dense representation of the question and serve as a direct answer (§4.1). Then, we introduce two different negative sampling methods (§4.2, §4.3), which encourage the phrase representations to be better discriminated at the full Wikipedia scale. See Figure 1 for an overview of DensePhrases.

To learn phrase representations in a single passage along with question representations, we first maximize the log-likelihood of the start and end positions of the gold phrase $s^{*}$ where $\texttt{TEXT}{(s^{*})}=a^{*}$ . The training loss for predicting the start position of a phrase given a question is computed as:

We can define $\mathcal{L}_{\text{end}}$ in a similar way and the final loss for the single-passage training is

This essentially learns reading comprehension without any cross-attention between the passage and the question tokens, which fully decomposes phrase and question representations.

Since the contextualized word representations $\mathbf{h}_{1},\dots,\mathbf{h}_{m}$ are encoded in a query-agnostic way, they are always inferior to query-dependent representations in cross-attention models (Devlin et al., 2019), where passages are fed along with the questions concatenated by a special token such as [SEP]. We hypothesize that one key reason for the performance gap is that reading comprehension datasets only provide a few annotated questions in each passage, compared to the set of possible answer phrases. Learning from this supervision is not easy to differentiate similar phrases in one passage (e.g., $s^{*}=$ Charles, Prince of Wales and another $s=$ Prince George for a question $q=$ Who is next in line to be the monarch of England?).

Following this intuition, we propose to use a simple model to generate additional questions for data augmentation, based on a T5-large model Raffel et al. (2020). To train the question generation model, we feed a passage $p$ with the gold answer $s^{*}$ highlighted by inserting surrounding special tags. Then, the model is trained to maximize the log-likelihood of the question words of $q$ . After training, we extract all the named entities in each training passage as candidate answers and feed the passage $p$ with each candidate answer to generate questions. We keep the question-answer pairs only when a cross-attention reading comprehension modelSpanBERT-large, 88.2 EM on SQuAD. makes a correct prediction on the generated pair. The remaining generated QA pairs $\{(\bar{q}_{1},\bar{s}_{1}),(\bar{q}_{2},\bar{s}_{2}),\ldots,(\bar{q}_{r},\bar{s}_{r})\}$ are directly augmented to the original training set.

Distillation

We also propose improving the phrase representations by distilling knowledge from a cross-attention model (Hinton et al., 2015). We minimize the Kullback–Leibler divergence between the probability distribution from our phrase encoder and that from a standard SpanBERT-base QA model. The loss is computed as follows:

where $P^{\text{start}}$ (and $P^{\text{end}}$ ) is defined in Eq. (5) and $P^{\text{start}}_{c}$ and $P^{\text{end}}_{c}$ denote the probability distributions used to predict the start and end positions of answers in the cross-attention model.

2 In-batch Negatives

Eventually, we need to build phrase representations for billions of phrases. Therefore, a bigger challenge is to incorporate more phrases as negatives so the representations can be better discriminated at a larger scale. While Seo et al. (2019) simply sample two negative passages based on question similarity, we use in-batch negatives for our dense phrase representations, which has been shown to be effective in learning dense passage representations before Karpukhin et al. (2020).

As shown in Figure 2 (a), for the $i$ -th example in a mini-batch of size $B$ , we denote the hidden representations of the gold start and end positions $\mathbf{h}_{\texttt{start}(s^{*})}$ and $\mathbf{h}_{\texttt{end}(s^{*})}$ as $\mathbf{g}^{\text{start}}_{i}$ and $\mathbf{g}^{\text{end}}_{i}$ , as well as the question representation as $[\mathbf{q}^{\text{start}}_{i},\mathbf{q}^{\text{end}}_{i}]$ . Let $\mathbf{G}^{\text{start}},\mathbf{G}^{\text{end}},\mathbf{Q}^{\text{start}},\mathbf{Q}^{\text{end}}$ be the $B\times d$ matrices and each row corresponds to $\mathbf{g}^{\text{start}}_{i},\mathbf{g}^{\ \text{end}}_{i},\mathbf{q}^{\text{start}}_{i},\mathbf{q}^{\text{end}}_{i}$ respectively. Basically, we can treat all the gold phrases from other passages in the same mini-batch as negative examples. We compute $\mathbf{S}^{\text{start}}={\mathbf{Q}^{\text{start}}}{\mathbf{G}^{\text{start}}}^{\intercal}$ and $\mathbf{S}^{\text{end}}={\mathbf{Q}^{\text{end}}}{\mathbf{G}^{\text{end}}}^{\intercal}$ and the $i$ -th row of $\mathbf{S}^{\text{start}}$ and $\mathbf{S}^{\text{end}}$ return $B$ scores each, including a positive score and $B$ $-1$ negative scores: $s^{\text{start}}_{1},\ldots,s^{\text{start}}_{B}$ and $s^{\text{end}}_{1},\ldots,s^{\text{end}}_{B}$ . Similar to Eq. (5), we can compute the loss function for the $i$ -th example as:

We also attempted using non-gold phrases from other passages as negatives but did not find a meaningful improvement.

3 Pre-batch Negatives

The in-batch negatives usually benefit from a large batch size (Karpukhin et al., 2020). However, it is challenging to further increase batch sizes, as they are bounded by the size of GPU memory. Next, we propose a novel negative sampling method called pre-batch negatives, which can effectively utilize the representations from the preceding $C$ mini-batches (Figure 2 (b)). In each iteration, we maintain a FIFO queue of $C$ mini-batches to cache phrase representations $\mathbf{G}^{\text{start}}$ and $\mathbf{G}^{\text{end}}$ . The cached phrase representations are then used as negative samples for the next iteration, providing $B\times C$ additional negative samples in total.This approach is inspired by the momentum contrast idea proposed in unsupervised visual representation learning He et al. (2020). Contrary to their approach, we have separate encoders for phrases and questions and back-propagate to both during training without a momentum update.

These pre-batch negatives are used together with in-batch negatives and the training loss is the same as Eq. (8), except that the gradients are not back-propagated to the cached pre-batch negatives. After warming up the model with in-batch negatives, we simply shift from in-batch negatives ( $B-1$ negatives) to in-batch and pre-batch negatives (hence a total number of $B\times C+B-1$ negatives). For simplicity, we use $\mathcal{L}_{\text{neg}}$ to denote the loss for both in-batch negatives and pre-batch negatives. Since we do not retain the computational graph for pre-batch negatives, the memory consumption of pre-batch negatives is much more manageable while allowing an increase in the number of negative samples.

4 Training Objective

Finally, we optimize all the three losses together, on both annotated reading comprehension examples and generated questions from §4.1:

where $\lambda_{1},\lambda_{2},\lambda_{3}$ determine the importance of each loss term. We found that $\lambda_{1}=1$ , $\lambda_{2}=2$ , and $\lambda_{3}=4$ works well in practice. See Table 5 and Table 6 for an ablation study of different components.

Indexing and Search

Search

For a given question $q$ , we can find the answer $\hat{s}$ as follows:

where $s_{(i,j)}$ denotes a phrase with start and end indices as $i$ and $j$ in the index $\mathbf{H}$ . We can compute the $\operatorname*{argmax}$ of $\mathbf{H}\mathbf{q}^{\text{start}}$ and $\mathbf{H}\mathbf{q}^{\text{end}}$ efficiently by performing MIPS over $\mathbf{H}$ with $\mathbf{q}^{\text{start}}$ and $\mathbf{q}^{\text{end}}$ . In practice, we search for the top- $k$ start and top- $k$ end positions separately and perform a constrained search over their end and start positions respectively such that $1\leq i\leq j<i+L\leq|\mathcal{W}(\mathcal{D})|$ .

Query-side Fine-tuning

So far, we have created a phrase dump $\mathbf{H}$ that supports efficient MIPS search. In this section, we propose a novel method called query-side fine-tuning by only updating the question encoder $E_{q}$ to correctly retrieve a desired answer $a^{*}$ for a question $q$ given $\mathbf{H}$ . Formally speaking, we optimize the marginal log-likelihood of the gold answer $a^{*}$ for a question $q$ , which resembles the weakly-supervised QA setting in previous work (Lee et al., 2019; Min et al., 2019). For every question $q$ , we retrieve top $k$ phrases and minimize the objective:

There are several advantages for doing this: (1) we find that query-side fine-tuning can reduce the discrepancy between training and inference, and hence improve the final performance substantially (§8). Even with effective negative sampling, the model only sees a small portion of passages compared to the full scale of $\mathcal{D}$ and this training objective can effectively fill in the gap. (2) This training strategy allows for transfer learning to unseen domains, without rebuilding the entire phrase index. More specifically, the model is able to quickly adapt to new QA tasks (e.g., WebQuestions) when the phrase dump is built using SQuAD or Natural Questions. We also find that this can transfers to non-QA tasks when the query is written in a different format. In Section 7.3, we show the possibility of directly using DensePhrases for slot filling tasks by using a query such as (Michael Jackson, is a singer of, x). In this regard, we can view our model as a dense knowledge base that can be accessed by many different types of queries and it is able to return phrase-level knowledge efficiently.

Experiments

We use two reading comprehension datasets: SQuAD Rajpurkar et al. (2016) and Natural Questions (NQ) Kwiatkowski et al. (2019) to learn phrase representations, in which a single gold passage is provided for each question. For the open-domain QA experiments, we evaluate our approach on five popular open-domain QA datasets: Natural Questions, WebQuestions (WQ) (Berant et al., 2013), CuratedTREC (TREC) (Baudiš and Šedivỳ, 2015), TriviaQA (TQA) (Joshi et al., 2017), and SQuAD. Note that we only use SQuAD and/or NQ to build the phrase index and perform query-side fine-tuning (§6) for other datasets.

We also evaluate our model on two slot filling tasks, to show how to adapt our DensePhrases for other knowledge-intensive NLP tasks. We focus on using two slot filling datasets from the KILT benchmark (Petroni et al., 2021): T-REx (Elsahar et al., 2018) and zero-shot relation extraction (Levy et al., 2017). Each query is provided in the form of “{subject entity} [SEP] {relation}" and the answer is the object entity. Appendix C provides the statistics of all the datasets.

Implementation details.

We denote the training datasets used for reading comprehension (Eq. (9)) as $\mathcal{C}_{\text{phrase}}$ . For open-domain QA, we train two versions of phrase encoders, each of which are trained on $\mathcal{C}_{\text{phrase}}=\{\text{SQuAD}\}$ and $\{\text{NQ},\text{SQuAD}\}$ , respectively. We build the phrase dump $\mathbf{H}$ for the 2018-12-20 Wikipedia snapshot and perform query-side fine-tuning on each dataset using Eq. (11). For slot filling, we use the same phrase dump for open-domain QA, $\mathcal{C}_{\text{phrase}}$ $=\{\text{NQ},\text{SQuAD}\}$ and perform query-side fine-tuning on randomly sampled 5K or 10K training examples to see how rapidly our model adapts to the new query types. See Appendix D for details on the hyperparameters and Appendix A for an analysis of computational cost.

2 Experiments: Question Answering

In order to show the effectiveness of our phrase representations, we first evaluate our model in the reading comprehension setting for SQuAD and NQ and report its performance with other query-agnostic models (Eq. (9) without query-side fine-tuning). This problem was originally formulated by Seo et al. (2018) as the phrase-indexed question answering (PIQA) task.

Compared to previous query-agnostic models, our model achieves the best performance of 78.3 EM on SQuAD by improving the previous phrase retrieval model (DenSPI) by $4.7\%$ (Table 2). Although it is still behind cross-attention models, the gap has been greatly reduced and serves as a strong starting point for the open-domain QA model.

Open-domain QA.

Experimental results on open-domain QA are summarized in Table 3. Without any sparse representations, DensePhrases outperforms previous phrase retrieval models by a large margin and achieves a $15\%$ – $25\%$ absolute improvement on all datasets except SQuAD. Training the model of Lee et al. (2020) on $\mathcal{C}_{\text{phrase}}=\{\text{NQ},\text{SQuAD}\}$ only increases the result from 14.5% to 16.5% on NQ, demonstrating that it does not suffice to simply add more datasets for training phrase representations. Our performance is also competitive with recent retriever-reader models Karpukhin et al. (2020), while running much faster during inference (Table 1).

3 Experiments: Slot Filling

Table 4 summarizes the results on the two slot filling datasets, along with the baseline scores provided by Petroni et al. (2021). The only extractive baseline is DPR + BERT, which performs poorly in zero-shot relation extraction. On the other hand, our model achieves competitive performance on all datasets and achieves state-of-the-art performance on two datasets using only 5K training examples.

Analysis

Table 5 shows the ablation result of our model on SQuAD. Upon our choice of architecture, augmenting training set with generated questions (QG = ✓) and performing distillation from cross-attention models (Distill = ✓) improve performance up to EM = 78.3. We attempted adding the generated questions to the training of the SpanBERT-QA model but find a 0.3% improvement, which validates that data sparsity is a bottleneck for query-agnostic models.

Effect of batch negatives.

We further evaluate the effectiveness of various negative sampling methods introduced in §4.2 and §4.3. Since it is computationally expensive to test each setting at the full Wikipedia scale, we use a smaller text corpus $\mathcal{D}_{\text{small}}$ of all the gold passages in the development sets of Natural Questions, for the ablation study. Empirically, we find that results are generally well correlated when we gradually increase the size of $|\mathcal{D}|$ . As shown in Table 6, both in-batch and pre-batch negatives bring substantial improvements. While using a larger batch size ( $B=84$ ) is beneficial for in-batch negatives, the number of preceding batches in pre-batch negatives is optimal when $C=2$ . Surprisingly, the pre-batch negatives also improve the performance when $\mathcal{D}=\{p\}$ .

Effect of query-side fine-tuning.

We summarize the effect of query-side fine-tuning in Table 7. For the datasets that were not used for training the phrase encoders (TQA, WQ, TREC), we observe a 15% to 20% improvement after query-side fine-tuning. Even for the datasets that have been used (NQ, SQuAD), it leads to significant improvements (e.g., 32.6% $\rightarrow$ 40.9% on NQ for $\mathcal{C}_{\text{phrase}}$ = {NQ}) and it clearly demonstrates it can effectively reduce the discrepancy between training and inference.

Related Work

Learning effective dense representations of words is a long-standing goal in NLP (Bengio et al., 2003; Collobert et al., 2011; Mikolov et al., 2013; Peters et al., 2018; Devlin et al., 2019). Beyond words, dense representations of many different granularities of text such as sentences (Le and Mikolov, 2014; Kiros et al., 2015) or documents (Yih et al., 2011) have been explored. While dense phrase representations have been also studied for statistical machine translation (Cho et al., 2014) or syntactic parsing (Socher et al., 2010), our work focuses on learning dense phrase representations for QA and any other knowledge-intensive tasks where phrases can be easily retrieved by performing MIPS.

This type of dense retrieval has been also studied for sentence and passage retrieval (Humeau et al., 2019; Karpukhin et al., 2020) (see Lin et al., 2020 for recent advances in dense retrieval). While DensePhrases is explicitly designed to retrieve phrases that can be used as an answer to given queries, retrieving phrases also naturally entails retrieving larger units of text, provided the datastore maintains the mapping between each phrase and the sentence and passage in which it occurs.

Conclusion

In this study, we show that we can learn dense representations of phrases at the Wikipedia scale, which are readily retrievable for open-domain QA and other knowledge-intensive NLP tasks. We learn both phrase and question encoders from the supervision of reading comprehension tasks and introduce two batch-negative techniques to better discriminate phrases at scale. We also introduce query-side fine-tuning that adapts our model to different types of queries. We achieve strong performance on five popular open-domain QA datasets, while reducing the storage footprint and improving latency significantly. We also achieve strong performance on two slot filling datasets using only a small number of training examples, showing the possibility of utilizing our DensePhrases as a knowledge base.

Acknowledgments

We thank Sewon Min, Hyunjae Kim, Gyuwan Kim, Jungsoo Park, Zexuan Zhong, Dan Friedman, Chris Sciavolino for providing valuable comments and feedback. This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0021) and National Research Foundation of Korea (NRF-2020R1A2C3010638). It was also partly supported by the James Mi *91 Research Innovation Fund for Data Science and an Amazon Research Award.

Ethical Considerations

Our work builds on standard reading comprehension datasets such as SQuAD to build phrase representations. SQuAD, in particular, is created from a small number of Wikipedia articles sampled from top-10,000 most popular articles (measured by PageRanks), hence some of our models trained only on SQuAD could be easily biased towards the small number of topics that SQuAD contains. We hope that excluding such datasets during training or inventing an alternative pre-training procedure for learning phrase representations could mitigate this problem. Although most of our efforts have been made to reduce the computational complexity of previous phrase retrieval models (further detailed in Appendices A and E), leveraging our phrase retrieval model as a knowledge base will inevitably increase the minimum requirement for the additional experiments. We plan to apply vector quantization techniques to reduce the additional cost of using our model as a KB.

References

Appendix A Computational Cost

We describe the resources and time spent during inference (Table 1 and A.1) and indexing (Table A.1). With our limited GPU resources (24GB $\times$ 4), it takes about 20 hours for indexing the entire phrase representations. We also largely reduced the storage from 1,547GB to 320GB by (1) removing sparse representations and (2) using our sharing and split strategy. See Appendix E for the details on the reduction of storage footprint and Appendix B for the specification of our server for the benchmark.

Appendix B Server Specifications for Benchmark

To compare the complexity of open-domain QA models, we install all models in Table 1 on the same server using their public open-source code. Our server has the following specifications:

For DPR, due to its large memory consumption, we use a similar server with a 24GB GPU (TITAN RTX). For all models, we use 1,000 randomly sampled questions from the Natural Questions development set for the speed benchmark and measure #Q/sec. We set the batch size to 64 for all models except BERTSerini, ORQA and REALM, which do not allow a batch size of more than 1 in their open-source implementations. #Q/sec for DPR includes retrieving passages and running a reader model and the batch size for the reader model is set to 8 to fit in the 24GB GPU (retriever batch size is still 64). For other hyperparameters, we use the default settings of each model. We also exclude the time and the number of questions in the first five iterations for warming up each model. Note that despite our effort to match the environment of each model, their latency can be affected by various different settings in their implementations such as the choice of library (PyTorch vs. Tensorflow).

Appendix C Data Statistics and Pre-processing

In Table C.3, we show the statistics of five open-domain QA datasets and two slot filling datasets. Pre-processed open-domain QA datasets are provided by Chen et al. (2017) except Natural Questions and TriviaQA. We use a version of Natural Questions and TriviaQA provided by Min et al. (2019); Lee et al. (2019), which are pre-processed for the open-domain QA setting. Slot filling datasets are provided by Petroni et al. (2021). We use two reading comprehension datasets (SQuAD and Natural Questions) for training our model on Eq. (9). For SQuAD, we use the original dataset provided by the authors (Rajpurkar et al., 2016). For Natural Questions (Kwiatkowski et al., 2019), we use the pre-processed version provided by Asai et al. (2020).https://github.com/AkariAsai/learning_to_retrieve_reasoning_paths We use the short answer as a ground truth answer $a^{*}$ and its long answer as a gold passage $p$ . We also match the gold passages in Natural Questions to the paragraphs in Wikipedia whenever possible. Since we want to check the performance changes of our model with the growing number of tokens, we follow the same split (train/dev/test) used in Natural Questions-Open for the reading comprehension setting as well. During the validation of our model and baseline models, we exclude samples whose answers lie in a list or a table from a Wikipedia article.

Appendix D Hyperparameters

We use the Adam optimizer (Kingma and Ba, 2015) in all our experiments. For training our phrase and question encoders with Eq. (9), we use a learning rate of 3e-5 and the norm of the gradient is clipped at 1. We use a batch size of $B=$ 84 and train each model for 4 epochs for all datasets, where the loss of pre-batch negatives is applied in the last two epochs. We use SQuAD to train our QG modelThe quality of generated questions from a QG model trained on Natural Questions is worse due to the ambiguity of information-seeking questions. and use spaCyhttps://spacy.io/ for extracting named entities in each training passage, which are used to generate questions. The number of generated questions is 327,302 and 1,126,354 for SQuAD and Natural Questions, respectively. The number of preceding batches $C$ is set to 2.

For the query-side fine-tuning with Eq. (11), we use a learning rate of 3e-5 and the norm of the gradient is clipped at 1. We use a batch size of 12 and train each model for 10 epochs for all datasets. The top $k$ for the Eq. (11) is set to 100. While we use a single 24GB GPU (TITAN RTX) for training the phrase encoders with Eq. (9), query-side fine-tuning is relatively cheap and uses a single 12GB GPU (TITAN Xp). Using the development set, we select the best performing model (based on EM) for each dataset, which are then evaluated on each test set. Since SpanBERT only supports cased models, we also truecase the questions (Lita et al., 2003) that are originally provided in the lowercase (Natural Questions and WebQuestions).

Appendix E Reducing Storage Footprint

As shown in Table 1, we have reduced the storage footprint from 1,547GB (Lee et al., 2020) to 320GB. We detail how we can reduce the storage footprint in addition to the several techniques introduced by Seo et al. (2019).

First, following Seo et al. (2019), we apply a linear transformation on the passage token representations to obtain a set of filter logits, which can be used to filter many token representations from $\mathcal{W}(\mathcal{D})$ . This filter layer is supervised by applying the binary cross entropy with the gold start/end positions (trained together with Eq. (9)). We tune the threshold for the filter logits on the reading comprehension development set to the point where the performance does not drop significantly while maximally filtering tokens. In the full Wikipedia setting, we filter about 75% of tokens and store 770M token representations.

Second, in our architecture, we use a base model (SpanBERT-base) for a smaller dimension of token representations ( $d=768$ ) and does not use any sparse representations including tf-idf or contextualized sparse representations (Lee et al., 2020). We also use the scalar quantization for storing float32 vectors as int4 during indexing.

Lastly, since the inference in Eq. (10) is purely based on MIPS, we do not have to keep the original start and end vectors which takes about 500GB. However, when we perform query-side fine-tuning, we need the original start and end vectors for reconstructing them to compute Eq. (11) since (the on-disk version of) MIPS index only returns the top- $k$ scores and their indices, but not the vectors.