What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

Ori Ram, Liat Bezalel, Adi Zicher, Yonatan Belinkov, Jonathan Berant, Amir Globerson

Introduction

Dense retrieval models based on neural text representations have proven very effective Karpukhin et al. (2020); Qu et al. (2021); Ram et al. (2022); Izacard et al. (2022a, b), improving upon strong traditional sparse models like BM25 Robertson and Zaragoza (2009). However, when applied off-the-shelf (i.e., in out-of-domain settings) they often experience a severe drop in performance Thakur et al. (2021); Sciavolino et al. (2021); Reddy et al. (2021). Moreover, the reasons for such failures are poorly understood, as the information captured in their representations remains under-investigated.

In this work, we present a new approach for interpreting and reasoning about dense retrievers, through distributions induced by their queryThroughout the paper, we use query and question interchangeably. and passage representations when projected to the vocabulary space, namely distributions over their vocabulary space (Figure 1). Such distributions enable a better understanding of the representational nature of dense models and their failures, which paves the way to simple solutions that improve their performance.

We begin by showing that dense retrieval representations can be projected to the vocabulary space, by feeding them through the masked language modeling (MLM) head of the pretrained model they were initialized from without any further training. This operation results in distributions over the vocabulary, which we refer to as query vocabulary projections and passage vocabulary projections.

Surprisingly, we find these projections to be highly interpretable to humans (Figure 2; Table 1). We analyze these projections and draw interesting connections between them and well-known concepts from sparse retrieval (§5). First, we highlight the high coverage of tokens shared by the query and the passage in the top-kk of their projections. This obersvation suggests that the lexical overlap between query and passages plays an important role in the retrieval mechanism. Second, we show that vocabulary projections of passages they are likely to contain words that appear in queries about the given passage. Thus, they can be viewed as predicting the questions one would ask about the passage. Last, we show that the model implicitly implements query expansion Rocchio (1971). For example, in Figure 2 the query is “How many judges currently serve on the Supreme court?”, and the words in the query projection QQ include “justices” (the common way to refer to them) and “nine” (the correct answer).

The above findings are especially surprising due to the fact that these retrieval models are fine-tuned in a contrastive fashion, and thus do not perform any prediction over the vocabulary or make any use of their language modeling head during fine-tuning. In addition, these representations are the result of running a deep transformer network that can implement highly complex functions. Nonetheless, model outputs remain “faithful” to the original lexical space learned during pretraining.

We further show that our approach is able to shed light on the reasons for which dense retrievers struggle with simple entity-centric questions Sciavolino et al. (2021). Through the lens of vocabulary projections, we identify an interesting phenomenon: dense retrievers tend to “ignore” some of the tokens appearing in a given passage. This is reflected in the ranking assigned to such tokens in the passage projection. For example, the word “michael” in the bottom example of Figure 2 is ranked relatively low (even though it appears in the passage title), thereby hindering the model from retrieving this passage. We refer to this syndrome as token amnesia (§6).

We leverage this insight and suggest a simple inference-time fix that enriches dense representations with lexical information, addressing token amnesia. We show that lexical enrichment significantly improves performance compared to vanilla models on the challenging BEIR benchmark Thakur et al. (2021) and additional datasets. For example, we boost the performance of the strong MPNet model on BEIR from 43.1% to 44.1%.

Taken together, our analyses and results demonstrate the great potential of vocabulary projections as a framework for more principled research and development of dense retrieval models.

Background

In this work, we suggest a simple framework for interpreting dense retrieves, via projecting their representations to the vocabulary space. This is done using the (masked) language modeling head of their corresponding pretrained model. We begin by providing the relevant background.

Most language models based on encoder-only transformers Vaswani et al. (2017) are pretrained using some variant of the masked language modeling (MLM) task Devlin et al. (2019); Liu et al. (2019); Song et al. (2020), which involves masking some input tokens, and letting the model reconstruct them.

2 Dense Retrieval

To fine-tune retrievers, a similarity measure s(q,p)s(q,p) is defined (e.g., the dot-product between eq\bm{e}_{q} and eq\bm{e}_{q} or their cosine similarity) and the model is trained in a contrastive manner to maximize retriever accuracy Lee et al. (2019); Karpukhin et al. (2020). Importantly, in this process, the MLM head function does not change at all.

Vocabulary Projections

We now describe our framework for projecting query and passage representations of dense retrievers to the vocabulary space. Given a dense retrieval model, we utilize the MLM head of the model it was initialized from to map from encoder output representations to distributions over the vocabulary (Eq. 1). For example, for DPR Karpukhin et al. (2020) we take BERT’s MLM head, as DPR was initialized from BERT. Given a query qq, we use the query encoder EncQ\text{Enc}_{Q} to obtain its representation eq\bm{e}_{q} as in Eq. 2. Similarly, for a passage pp we apply the passage encoder EncP\text{Enc}_{P} to get ep\bm{e}_{p}. We then apply the MLM head as in Eq. (1) to obtain the vocabulary projection:

Note that it is not clear a-priori that QQ and PP will be meaningful in any way, as the encoder model has been changed since pretraining, while the MLM-head function remains fixed. Moreover, the MLM function has not been trained to decode “pooled” sequence-level representations (i.e., the results of CLS or mean pooling) during pretraining. Despite this intuition, in this work we argue that PP and QQ are actually highly intuitive and can facilitate a better understanding of dense retrievers.

Experiment Setup

To evaluate our framework and method quantitatively, we consider several dense retrieval models and datasets.

We now list the retrievers used to demonstrate our framework and method. All dense models share the same architecture and size (i.e., that of BERT-base; 110M parameters), and all were trained in a contrastive fashion with in-batch negatives—the prominent paradigm for training dense models Lee et al. (2019); Karpukhin et al. (2020); Chang et al. (2020); Qu et al. (2021); Ram et al. (2022); Izacard et al. (2022a); Ni et al. (2022); Chen et al. (2022). For the analysis, we use DPR Karpukhin et al. (2020) and BERT Devlin et al. (2019) as its pretrained baseline. For the results of our method, we also use S-MPNet Reimers and Gurevych (2019) and Spider Ram et al. (2022). Our sparse retrieval model is BM25 Robertson and Zaragoza (2009). We refer the reader to App. A for more details.

2 Datasets

We follow prior work Karpukhin et al. (2020); Ram et al. (2022) and consider six common open-domain question answering (QA) datasets for the evaluation of our framework: Natural Questions (NQ; Kwiatkowski et al. 2019), TriviaQA Joshi et al. (2017), WebQuestions (WQ; Berant et al. 2013), CuratedTREC (TREC; Baudiš and Šedivý 2015), SQuAD Rajpurkar et al. (2016) and EntityQuestions (EntityQs; Sciavolino et al. 2021). We also consider the BEIR Thakur et al. (2021) and the MTEB Muennighoff et al. (2022) benchmarks.

3 Implementation Details

Our code is based on the official repository of DPR Karpukhin et al. (2020), built on Hugging Face Transformers Wolf et al. (2020).

For the six QA datasets, we use the Wikipedia corpus standardized by Karpukhin et al. (2020), which contains roughly 21 million passages of a hundred words each. For dense retrieval over this corpus, we apply exact search using FAISS Johnson et al. (2021). For sparse retrieval we use Pyserini Lin et al. (2021).

Analyzing Dense Retrievers via Vocabulary Projections

In Section 3, we introduce a new framework for interpreting representations produced by dense retrievers. Next, we describe empirical findings that shed new light on what is encoded in these representations. Via vocabulary projections, we draw connections between dense retrieval and well-known concepts from sparse retrieval like lexical overlap (§5.1), query prediction (§5.2) and query expansion (§5.3).

Tokens shared by questions and their corresponding gold passages constitute the lexical overlap signal in retrieval, used by sparse models like BM25. We start by asking: how prominent are they in vocabulary projections? Figure 3 illustrates the coverage of these tokens in QQ and PP for DPR after training, compared to its initialization before training (i.e., BERT with mean or CLS pooling). In other words, for each kk we check what is the percentage of shared tokens ranked in the top-kk of QQ and PP. Results suggest that after training, the model learns to rank shared tokens much higher than before. Concretely, 63% and 53% of the shared tokens appear in the top-20 tokens of QQ and PP respectively, compared to only 16% and 8% in BERT (i.e., before training). These numbers increase to 78% and 69% of the shared tokens that appear in the top-100 tokens of QQ and PP. In addition, we observed that for 71% of the questions, the top-scored token in QQ appears in both the question and the passage (App. B). These findings suggest that even for dense retrievers—which do not operate at the lexical level—lexical overlap remains a highly dominant signal.

2 Passage Encoders as Query Prediction

Our next analysis concerns the role of passage encoders. In §5.1, we show that tokens shared by the question and its gold passage are ranked high in both QQ and PP. However, passages contain many tokens, and the shared tokens constitute only a small fraction of all tokens. We hypothesize that out of passage tokens, those that are likely to appear in relevant questions receive higher scores in PP than others. If this indeed the case, it implies that passage encoders implicitly learn to predict which of the passage tokens will appear in relevant questions. To test our hypothesis, we analyze the ranks of question and passage tokens in passage vocabulary projections, PP. Formally, let Tq\mathcal{T}_{q} and Tp\mathcal{T}_{p} be the sets of tokens in a question qq and its gold passage pp, respectively. Table 2 shows the token-level mean reciprocal rank (MRR) of these sets in PP. We observe that tokens shared by qq and pp (i.e., TqTp\mathcal{T}_{q}\cap\mathcal{T}_{p}) are ranked significantly higher than other passage tokens (i.e., Tp\mathcal{T}_{p}). For example, in DPR the MRR of shared tokens is 26.1, while that of other passage tokens is only 3.0. In addition, the MRR of shared tokens in BERT is only 1.4. These findings support our claim that tokens that appear in relevant questions are ranked higher than others, and that this behavior is acquired during fine-tuning.

3 Query Encoders Implement Query Expansion

To overcome the “vocabulary mismatch” problem (i.e., when question-document pairs are semantically relevant, but lack significant lexical overlap), query expansion methods have been studied extensively Rocchio (1971); Voorhees (1994); Zhao and Callan (2012); Mao et al. (2021). The main idea is to expand the query with additional terms that will better guide the retrieval process. We define a token as a query expansion if it does not appear in the query itself but does appear in the query projection QQ, and also in the gold passage of that query pp (excluding stop words and punctuation marks). Figure 4 shows the percentage of queries with at least one query expansion token in the top-kk as a function of kk for DPR and the BERT baseline (i.e., before DPR training). We observe that after training, the model promotes query expansion tokens to higher ranks than before. In addition, we found that almost 14% of the tokens in the top-5 of QQ are query expansion tokens (cf. App B).

We note that there are two interesting classes of query expansion tokens: (1) synonyms of question tokens, as well as tokens that share similar semantics with tokens in qq (e.g., “michigan” in the first example of Table 1). (2) “answer tokens” which contain the answer to the query (e.g., “motown” in the second example of Table 1). The presence of such tokens may suggest the model already “knows” the answer to the given question, either from pretraining or from similar questions seen during training Lewis et al. (2021).

Given these findings, we conjecture that the model “uses” these query expansion tokens to introduce a semantic signal to the retrieval process.

Token Amnesia

The analysis in Section 5 shows that vocabulary projections of passages (i.e., PP) predict which of the input tokens are likely to appear in relevant questions. However, in some cases these predictions utterly fail. For example, in Figure 2 the token “michael” is missing from the top-kk of the passage projection PP. We refer to such cases as token amnesia. Here we ask, do these failure in query prediction hurt retrieval?

Next, we demonstrate that token amnesia indeed correlates with well-known failures of dense retrievers (§6.1). To overcome this issue, we suggest a lexical enrichment procedure for dense representations (§6.2) and demonstrate its effectiveness on downstream retrieval performance (§6.3).

Dense retrievers have shown difficulties in out-of-domain settings Sciavolino et al. (2021); Thakur et al. (2021), where even sparse models like BM25 significantly outperform them. We now offer an intuitive explanation to these failures via token amnesia. We focus on setups where BM25 outperforms dense models and ask: why do dense retrievers fail to model lexical overlap signals? To answer this question, we consider subsets of NQ and EntityQs where BM25 is able to retrieve the correct passage in its top-5 results. We focus on these subsets as they contain significant lexical overlap between questions and passages (by definition, as BM25 successfully retrieved the correct passage). Let qq be a question and pp the passage retrieved by BM25 for qq, and QQ and PP be their corresponding vocabulary projections for some dense retriever. Also, let TV\mathcal{T}\subseteq\mathcal{V} be the set of tokens that appear in both qq and pp (excluding stop words). Figure 5 shows the maximum (i.e., lowest) rank of tokens from T\mathcal{T} in the distributions PP (left) and QQ (right) as a function of whether DPR is able to retrieve this passage (i.e., the rank of pp in the retrieval results of DPR). Indeed, the median max-rank over questions for which DPR succeeds to fetch pp in its top-5 results (blue box) is much lower than that of questions for which DPR fails to retrieve the passage (red box). As expected (due to the fact that questions contain less tokens than passages), the ranks of shared tokens in question projections QQ are much higher. However, the trend is present in QQ as well. Additional figures (for EntityQs; as well as median ranks instead of max ranks) are given in App. C.

Overall, these findings indicate a correlation between token amnesia and failures of DPR. Next, we introduce a method to address token amnesia in dense retrievers, via lexical enrichment of dense representations.

2 Method: Lexical Enrichment

As suggested by the analysis in §6.1, dense retrievers have the tendency to ignore some of their input tokens. We now leverage this insight to improve these models. We refer to our method as lexical enrichment (LE) because it enriches text encodings with specific lexical items.

Intuitively, a natural remedy to the “token amnesia” problem is to change the retriever encoding such that it does include these tokens. For example, assume the query qq is “Where was Michael Jack born?” and the corresponding passage pp contains the text “Michael Jack was born in Folkestone, England”. According to Figure 2, the token “michael” is ranked relatively low in PP, and DPR fails to retrieve the correct passage pp. We would like to modify the passage representation ep\bm{e}_{p} and get an enriched version ep\bm{e}^{\prime}_{p} that does have this token in its top-kk projected tokens, while keeping most of the other projected tokens intact. This is our goal in LE, and we next describe the approach. We focus on enrichment of passage representations, as query enrichment works similarly. We first explain how to enrich representations with a single token, and then extend the process to multiple tokens.

Assume we want to enrich a passage representation ep\bm{e}_{p} with a token tt (e.g., t=t=``michael” in the above example). If there were no other words in the passage, we’d simply want to find an embedding such that feeding it into the MLM would produce tt as the top token.Note that feeding the token input embedding vt\bm{v}_{t} does not necessarily produce tt as the top token, as the MLM head applies a non-linear function gg (Eq. 1). We refer to this embedding as the single-token enrichment of tt, denote it by st\bm{s}_{t} and define it as:This is equivalent to the cross-entropy loss between a one-hot vector on tt and the output distribution MLM(s^)MLM(\hat{\bm{s}}).

In order to approximately solve the optimization problem in Eq. 4 for each tt in the vocabulary, we use Adam with a learning rate of 0.01.For S-MPNet, we used a learning rate of 10310^{-3}. We stop when a (cross-entropy) loss threshold of 0.1 is reached for all tokens. We then apply whitening Jung et al. (2022), which was proven effective for dense retrieval.

Multi-Token Enrichment

Now suppose we have an input xx (either a question or a passage) and we’d like to enrich its representation with its tokens x=[x1,..,xn]x=[x_{1},..,x_{n}], such that rare tokens are given higher weights than frequent ones (as in BM25). Then, we simply take its original representation ex\bm{e}_{x} and add to it a weighted sum of the single-token enrichments (Eq. 4). Namely, we define:

Here λ\lambda is a hyper-parameter chosen via cross validation. We use the inverse document frequency Sparck Jones (1972) of tokens as their weights: wxi=IDF(xi)w_{x_{i}}=\text{IDF}(x_{i}). The relevance score is then defined on the enriched representations.

3 Results

Our experiments demonstrate the effectiveness of our method for multiple models, especially in zero-shot settings. Table 3 shows the results of several models with and without our enrichment method, LE. Additional results are given in App. D. The results demonstrate the effectiveness of LE when added to all baseline models. Importantly, our method improves the performance of S-MPNet—the best base-sized model on the MTEB benchmark to date Muennighoff et al. (2022)—on MTEB and BEIR by 1.1% and 1.0%, respectively. When considering EntityQs (on which dense retrievers are known to struggle), we observe significant gains across all models, and S-MPNet and Spider obtain higher accuracy than BM25 that operates on the same textual units (i.e., BM25 with BERT vocabulary). This finding indicates that they are able to integrate semantic information (from the original representation) with lexical signals. Yet, vanilla BM25 is still better than LE models on EntityQs and SQuAD, which prompts further work on how to incorporate lexical signals in dense retrieval. Overall, it is evident that LE improves retrieval accuracy compared to baseline models for all models and datasets (i.e., zero-shot setting).

4 Ablation Study

We carry an ablation study to test our design choices from §6.2. We evaluate four elements of our method: (1) The use of IDF to highlight rare tokens, (2) Our approach for deriving single-token representations, (3) The use of whitening, and (4) The use of unit normalization.

In our method, we create lexical representations of questions and passages, exlex\bm{e}_{x}^{\text{lex}}. These lexical representations are the average of token embeddings, each multiplied by its token’s IDF. We validate that IDF is indeed necessary – Table 4 demonstrates that setting wxi=1w_{x_{i}}=1 in Eq. 5 leads to a significant degradation in performance on EntityQs. For example, top-20 retrieval accuracy drops from 65.2% to 57.7%.

Single-Token Enrichment

Eq. 4 defines our single-token enrichment: for each item in the vocabulary vVv\in\mathcal{V}, we find an embedding which gives a one-hot vector peaked at vv when fed to the MLM head. We confirm that this is necessary by replacing Eq. 4 with the static embeddings of the pretrained model (e.g., BERT in the case of DPR). We find that our approach significantly improves over BERT’s embeddings on EntityQs (e.g., the margin in top-20 accuracy is 3.4%).

Whitening & Normalization

Related Work

Projecting representations and model parameters to the vocabulary space has been studied previously mainly in the context of language models. The approach was initially explored by nostalgebraist (2020). Geva et al. (2021) showed that feed-forward layers in transformers can be regarded as key-value memories, where the value vectors induce distributions over the vocabulary. Geva et al. (2022) view the token representations themselves as inducing such distributions, with feed-forward layers “updating” them. Dar et al. (2022) suggest to project all transformer parameters to the vocabulary space. Dense retrieval models, however, do not have any language modeling objective during fine-tuning, yet we show that their representations can still be projected to the vocabulary.

Despite the wide success of dense retrievers recently, interpreting their representations remains under-explored. MacAvaney et al. (2022) analyze neural retrieval models (not only dense retrievers) via diagnostic probes, testing characteristics like sensitivity to paraphrases, styles and factuality. Adolphs et al. (2022) decode the query representations of neural retrievers using a T5 decoder, and show how to “move” in representation space to decode better queries for retrieval.

Language models (and specifically MLMs) have been used for sparse retrieval in the context of term-weighting and lexical expansion. For example, Bai et al. (2020) and Formal et al. (2021) learn such functions over BERT’s vocabulary space. We differ by showing that dense retrievers implicitly operate in that space as well. Thus, these approaches may prove effective for dense models as well. While we focus in this work on dense retrievers based on encoder-only models, our framework is easily extendable for retrievers based on autoregressive decoder-only (i.e., left-to-right) models like GPT Radford et al. (2019); Brown et al. (2020), e.g., Neelakantan et al. (2022) and Muennighoff (2022).

Conclusion

In this work, we explore projecting query and passage representations obtained by dense retrieval to the vocabulary space. We show that these projections facilitate a better understanding of the mechanisms underlying dense retrieval, as well as their failures. We also demonstrate how projections can help improve these models. This understanding is likely to help in improving retrievers, as our lexical enrichment approach demonstrates.

Limitations

We point to several limitations of our work. First, our work considers a popular family of models referred to as “dense retrievers”, but other approaches for retrieval include sparse retrievers Robertson and Zaragoza (2009); Bai et al. (2020); Formal et al. (2021), generative retrievers Tay et al. (2022); Bevilacqua et al. (2022), late-interaction models Khattab and Zaharia (2020), inter alia. While our work draws interesting connections between dense and sparse retrieval, our main focus is on understanding and improving dense models. Second, all three dense models we analyze are bidirectional and were trained in a contrastive fashion. While most dense retrievers indeed satisfy these properties, there are works that suggested other approaches, both in terms of other architectures Muennighoff (2022); Neelakantan et al. (2022); Ni et al. (2022) and other training frameworks Lewis et al. (2020); Izacard et al. (2022b). Last, while our work introduces new ways to interpret and analyze dense retrieval models, we believe our work is the tip of the iceberg, and there is still much work to be done in order to gain a full understanding of these models.

Ethics Statement

Retrieval systems have the potential to mitigate serious problems caused by language models, like factual inaccuracies. However, retrieval failures may lead to undesirable behavior of downstream models, like wrong answers in QA or incorrect generations for other tasks. Also, since retrieval models are based on pretrained language models, they may suffer from similar biases.

Acknowledgements

We thank Ori Yoran, Yoav Levine, Yuval Kirstain, Mor Geva and the anonymous reviewers for their valuable feedback. This project was funded by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080), the Blavatnik Fund, the Alon Scholarship, the Yandex Initiative for Machine Learning, Intel Corporation, ISRAEL SCIENCE FOUNDATION (grant No. 448/20), Open Philanthropy, and an Azrieli Foundation Early Career Faculty Fellowship.

References

Appendix A Models: Further Details

Karpukhin et al. (2020) is a dense retriever that was trained on Natural Questions Kwiatkowski et al. (2019). It was initialized from BERT-base Devlin et al. (2019). Thus, we use the public pretrained MLM head of BERT-base to project DPR representations.

BERT

Devlin et al. (2019) We use BERT for dense retrieval, mainly as a baseline for DPR, as DPR was initialized from BERT. This allows us to track where behaviors we observe stem from: pretraining or retrieval fine-tuning. We use both CLS and mean pooling for BERT.

S-MPNet

is a supervised model trained for Sentence Transformers Reimers and Gurevych (2019) using many available datasets for retrieval, sentence similarity, inter alia. It uses cosine similarity, rather than dot product, for relevance scores. It was initialized from MPNet-base Song et al. (2020), and thus we use this model’s MLM head.

Spider

Ram et al. (2022) is an unsupervised dense retriever trained using the recurring span retrieval pretraining task. It was also initialized from BERT-base, and we therefore use the same MLM head for projection as the one used for DPR.

BM25

Robertson and Zaragoza (2009) is a lexical model based on tf-idf. We use two variants of BM25: (1) vanilla BM25, and (2) BM25 over BERT/MPNet tokens (e.g., “Reba” \rightarrow “re ##ba”).BERT and MPNet use essentially the same vocabulary, up to special tokens. We consider this option to understand whether the advantages of BM25 stem from its use of different word units from the transformer models.

Appendix B Analysis: Further Results

Figure 6 gives an analysis of the top-kk tokens in the question projection QQ and passage projection PP.

Appendix C Token Amnesia: Further results

Figure 7 gives further analyses of token amnesia: It contains the results for EntityQuestions, as well as analysis of median ranks in addition to max ranks (complements Figure 5).

Appendix D Lexical Enrichment: Further Results

Table 9 gives the results of our method on the BEIR and MTEB benchmarks for all 19 datasets (complements Table 3). Table 6, Table 7 and Table 8 give the zero-shot results for k{1,5,100}k\in\{1,5,100\}, respectively (complement Table 3).

Appendix E Dataset Statistics & Licenses

Table 5 details the license and number of test example for each of the six open-domain datasets used in our work. For the BEIR benchmark, we refer the reader to Thakur et al. (2021) for number of examples and license of each of their datasets.

Appendix F Computational Resources

Our method (LE) does not involve training models at all. Our computational resources have been used to evaluate LE on the BEIR benchmark, i.e., computing passage embeddings for each corpus and each model. We used eight Quadro RTX 8000 GPUs. Each experiment took several hours.