Multi-Stage Document Ranking with BERT

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, Jimmy Lin

Introduction

Neural models pre-trained on language modeling tasks such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2019) have achieved impressive results on NLP tasks ranging from natural language inference to question answering. One such popular model, BERT, has recently been applied to search-related tasks, retrieval-based question answering Yang et al. (2019b), as well as document ranking Yang et al. (2019c); MacAvaney et al. (2019); Yilmaz et al. (2019).

This paper builds on previous initial work (Nogueira and Cho, 2019) to tackle the document ranking problem with a multi-stage ranking architecture. We introduce two BERT variants, called monoBERT and duoBERT. The monoBERT model treats document ranking as a binary classification problem over individual candidate documents, while the duoBERT model adopts a pairwise classification approach that considers pairs of candidate documents. For end-to-end document ranking, we arrange these models as stages in a pipeline where each balances the size of the candidate set against the inherent complexity of the model. This design allows us to obtain the benefits of richer models while controlling the increased inference latencies that come with these richer models.

Our work makes the following contributions: We start by describing monoBERT, a pointwise classification model of document relevance that was introduced in Nogueira and Cho (2019). Second, we propose a novel extension of monoBERT, called duoBERT, that adopts a pairwise classification approach to document relevance. Third, we integrate monoBERT and duoBERT in a multi-stage ranking architecture that allows us to reap the benefits of our richer duoBERT model with only a modest increase in inference latency. The architecture adopts an innovation from the information retrieval (IR) community that to our knowledge has not been explored by NLP researchers. Fourth, perhaps unsurprising, we show that pre-training on the corpus of the target task improves effectiveness over pre-training on out-of-domain corpora.

We evaluate our models on two large-scale document retrieval datasets that are conducive to deep learning experiments: the MS MARCO dataset and the Complex Answer Retrieval (CAR) Task at TREC. On both datasets, our results are either at or comparable to the state of the art. As we show through component-level ablation studies, both monoBERT and duoBERT contribute significantly to overall effectiveness. Additionally, within the framework of multi-stage ranking, we characterize the latency vs. effectiveness tradeoff space of each model.

Background and Related Work

In this paper, we tackle the document ranking problem (also known as ad hoc retrieval), following the widely-accepted standard formulation: Given a user’s information need expressed as a query $q$ and a (potentially large) corpus of documents, the system’s task is to produce a ranking of $k$ documents that maximizes some metric, such as mean average precision (MAP) or mean reciprocal rank (MRR). Throughout this paper, per standard parlance in IR, document is used generically to refer a unit of text being retrieved, when in actuality it may be a passage, a sentence, etc.

The basic idea behind multi-stage ranking is to break document ranking down into a series of pipeline stages. Following an initial retrieval stage, which typically issues a “bag of words” query against an inverted index, each subsequent stage re-ranks the set of candidates passed along from the previous stage until the final output is returned to the user. This basic approach has received much interest in academia Matveeva et al. (2006); Wang et al. (2011); Asadi and Lin (2013); Chen et al. (2017); Mackenzie et al. (2018) as well as industry. Known production deployments include the Bing web search engine Pedersen (2010) as well as Alibaba’s e-commerce search engine Liu et al. (2017).

Multi-stage ranking architectures have evolved to strike a balance between model complexity and search latency by controlling the size of the candidate set at each stage. Increasingly richer models can be made practical by considering successively smaller candidate sets. For certain (easy) queries, stages of the pipeline can be skipped entirely, known as “early exits” Cambazoglu et al. (2010). Viewed in this manner, multi-stage ranking captures the same intuition as progressive refinement in classifier cascades Viola and Jones (2004). For example, an early stage might consider only term statistics of single terms, whereas later stages might consider bigrams, phrases, or even apply lightweight NLP techniques. Given this setup, a number of researchers have proposed techniques based, for example, on boosting for composing these stages in an end-to-end manner Wang et al. (2011); Xu et al. (2012). In our work, we make the connection between BERT-based models and multi-stage ranking, which allows us to trade off the quality of the results with inference latency.

The advent of deep learning has brought tremendous excitement into the information retrieval community. Although machine-learned ranking models have been well studied since the mid-2000s under the banner of “learning to rank”, the paradigm is heavily driven by manual feature engineering Liu (2009); Li (2011); commercial web search engines are known to incorporate thousands of features (or more) in their models. Continuous vector space representations coupled with neural models promise to obviate the need for handcrafted features and have attracted the attention of many researchers. Well-known neural ranking models include DRMM (Guo et al., 2016), DUET (Mitra et al., 2017), KNRM (Xiong et al., 2017), and Co-PACRR (Hui et al., 2018); the literature is too vast for an exhaustive review here, and thus we refer readers to recent overviews Onal et al. (2018); Mitra and Craswell (2019).

Although often glossed over, most neural ranking models today (including all the models referenced above) are actually re-ranking models, in the sense that they operate over the output of a list of candidate documents, typically produced by a “bag of words” query. Thus, document retrieval with neural models today already uses multi-stage ranking, albeit an impoverished form with only a single re-ranking stage. This recognition provides a starting point of our work, from which we build BERT-based multi-stage ranking.

Multi-Stage Ranking with BERT

In our formulation, a multi-stage ranking architecture comprises a number of stages, denoted $H_{0}$ to $H_{N}$ . Except for $H_{0}$ , which retrieves $k_{0}$ candidates from an inverted index, each stage $H_{n}$ receives a ranked list $R_{n-1}$ comprising $k_{n-1}$ candidates from the previous stage. Each stage, in turn, provides a ranked list $R_{n}$ comprising $k_{n}$ candidates to the subsequent stage, with the obvious requirement that $k_{n}\leq k_{n-1}$ . The ranked list generated by the final stage $H_{N}$ is designated for consumption by the (human) searcher.

For expository purposes, we consider stages to receive and produce candidates even though they may in fact be documents, passages, etc. Within this general framework, we instantiate a specific design composed of three stages ( $H_{0}$ , $H_{1}$ , and $H_{2}$ ), as shown in Figure 1.

In our approach, each stage is unconstrained in its implementation other than the input–output specifications outlined above. For example, a pipeline stage is not obligated to consider all candidates provided to it, and in fact, latency introduced by each stage can be controlled by truncating the number of input candidates. Furthermore, each stage can choose to pay attention or ignore scores of the candidates it receives; in the latter case, the ranked list devolves into a set of unranked candidates. In our experiments, we explore the latency–quality tradeoff space that is induced by this design flexibility (see Section 5).

The first stage $H_{0}$ receives as input the user query $q$ and produces top- $k_{0}$ candidates $R_{0}$ . In our implementation, the query is treated as a “bag of words” for ranking documents from the corpus using a standard inverted index based on the BM25 scoring function Robertson et al. (1994). We use the Anserini IR toolkit Yang et al. (2017, 2018),http://anserini.io/ which is built on the popular open-source Lucene search engine.

BM25 is based on exact term matches, and all candidates must contain at least one term from the user’s query. However, since later BERT stages operate in continuous vector spaces, they have the ability to identify relevant candidates that do not have many matching terms. Thus, it is critical in $H_{0}$ to optimize for recall to provide subsequent stages a diverse set of documents to work with. On the other hand, precision is less of a concern because non-relevant documents can be discarded by later stages.

In general, the task of a re-ranking stage $H_{n}$ is to estimate a score $s_{i}$ quantifying how relevant a candidate $d_{i}\in R_{n-1}$ is to a query $q$ . Naturally, we expect that the ranking induced by these scores yields a higher metric (e.g., MAP or MRR) than the scores from the previous stage.

In stage $H_{1}$ , we call monoBERT our pointwise re-ranker, which is a BERT model used as a binary relevance classifier. Using the same notation as Devlin et al. (2019), we feed the query $q$ as sentence A and the text of candidate $d_{i}$ as sentence B. We truncate the query to have at most 64 tokens. We also truncate the candidate text such that the concatenation of query, candidate, and separator tokens have a maximum length of 512 tokens. Given these limits, we observe that none of the queries or documents of the datasets used in our experiments (TREC CAR and MS MARCO) have to be truncated.

Once the segment is passed through the model, we use the $\left[\text{CLS}\right]$ vector as input to a single layer neural network to obtain a probability $s_{i}$ of the candidate $d_{i}$ being relevant to $q$ . We obtain a score $s_{i}$ for each candidate independently and generate a new list of candidates $R_{1}$ by keeping only the top- $k_{1}$ candidates based on these scores.

We train the model for re-ranking using cross-entropy loss:

where $J_{\text{pos}}$ is the set of indexes of the relevant candidates and $J_{\text{neg}}$ is the set of indexes of the non-relevant candidates in $R_{0}$ .

The output $R_{1}$ from the previous stage is used as input to the pairwise re-ranker we call duoBERT. Within the framework of “learning to rank”, duoBERT can be characterized as a “pairwise” approach, while monoBERT can be characterized as a “pointwise” approach Liu (2009). In this pairwise approach, the re-ranker estimates the probability $p_{i,j}$ of the candidate $d_{i}$ being more relevant than $d_{j}$ .

This re-ranker is also a BERT model that takes as input the query as sentence A, candidate $d_{i}$ as sentence B, and candidate $d_{j}$ as sentence C. Similar to the original implementation, each sentence type (A, B, and C) has its own embedding that is summed to the token and positional embeddings. We truncate the query, candidates $d_{i}$ and $d_{j}$ to 62, 223, and 223 tokens, respectively, so the entire sequence will have at most 512 tokens when concatenated with the $[\text{CLS}]$ token and the three separator tokens. Using the above truncation limits, in the datasets used in this work none of the queries are truncated, and less than 1% of the documents are truncated.

We use the $[\text{CLS}]$ vector as input to a single layer neural network to obtain the probability $p_{i,j}$ . Since there are $k_{1}$ candidates, $k_{1}(k_{1}-1)$ probabilities are computed. We then train the model with the following loss:

Note in the equation above that candidates $d_{i}$ and $d_{j}$ are never both relevant or not relevant.

At inference time, we aggregate the pairwise scores $p_{i,j}$ so that each document receives a single score $s_{i}$ . We investigate five different aggregation methods (Sum, Binary, Min, Max, and Sample):

where $J_{i}=\{0\leq j<|R_{1}|,j\neq i\}$ and $m$ is the number of samples drawn without replacement from the set $J_{i}$ .

The Sum method measures the pairwise agreement that candidate $d_{i}$ is more relevant than the rest of the candidates ${\{d_{j}\}}_{j\neq i}$ . The Binary method is inspired by the Condorcet method Montague and Aslam (2002), which is a strong aggregation baseline Cormack et al. (2009). The Min (Max) method measures the relevance of $d_{i}$ only against its strongest (weakest) competitor. The Sample method aims to decrease the high inference costs of pairwise computations via sampling.

The final list of candidates $R_{2}$ is obtained by re-ranking the candidates in $R_{1}$ according to their scores ${s_{i}}$ . In our current design, the output $R_{2}$ is provided for human consumption, and serves as the input to computing the final evaluation metrics (e.g., MAP).

Experimental Setup

A fortunate confluence of events has enabled the multi-stage ranking architecture we propose in this paper. First, of course, is the innovation captured in BERT, as the latest refinement in a long stream of neural models that make heavy use of pre-training. Second, and just as important, is the availability of data. For document retrieval, most IR researchers have not had access to sufficient training data until recently.

As demonstrated by Lin (2019), in a limited data regime, it is not entirely clear that neural techniques actually perform better than well-tuned “classic” IR techniques; subsequent work by Yang et al. (2019a) show that the gains are modest at best. Until recently, research in neural ranking models mostly took advantage of proprietary datasets derived from user behavior logs (which large organizations can gather in abundance). Since these datasets cannot be shared, only a small set of researchers could productively work on neural ranking models and different models could not be easily compared; the combination of both factors hamper rapid progress.

Fortunately, the field has seen the release of two large-scale datasets for powering data-hungry neural models: MS MARCO Bajaj et al. (2018) and TREC CAR Dietz et al. (2017). We take advantage of both datasets to train our models, which we detail below.

The Microsoft MAchine Reading COmprehension dataset (MS MARCO) is a large-scale resource created from approximately half a million anonymized questions sampled from Bing’s search query logs. We focus on the passage ranking task, where given a corpus of 8.8M passages extracted from 3.6M web documents, the system’s goal is to retrieve passages that answer the question. Each passage contains an average of 55 words (or 340 characters), and hence is relatively short—however, in order to maintain consistent terminology throughout this paper, we refer to these basic units of retrieval as “documents.”

The training set (for the passage ranking task) comprises approximately 500k pairs of query and relevant document, and another 400M pairs of query and non-relevant documents. The relevance judgments are provided by humans. The development set contains 6,980 queries, with, on average, one relevant document per query. Thus, a noteworthy property of this dataset is the sparsity of relevance judgments—as opposed to typical TREC test collections built using pooling Voorhees (2002), which have far fewer topics (usually around 50) but many more judgments per topic (typically, hundreds). A blind, held-out evaluation set with 6,837 queries is also available, but without relevance judgments. Evaluation on these queries is provided by the Microsoft organizers upon submission to the online leaderboard. The official metric for this dataset is MRR@10.

Before training our models on the re-ranking task, we apply a two-phase pre-training. In the first phase, the model is pre-trained using Wikipedia (2.5B words) and the Toronto Book corpus (0.8B words) for one million iterations, as described by Devlin et al. (2019). In the second phase, we further pre-train the model on the MS MARCO corpus (0.5B words) for 100k iterations with a maximum sequence length of 512 tokens, batches of size 128, and learning rate of $5\times 10^{-5}$ . This second pre-training phase takes approximately 24 hours on a TPU v3. https://cloud.google.com/tpu/

Training.

We fine-tune both monoBERT and duoBERT using a TPU v3 with a batch size of 128 (128 sequences $\times$ 512 tokens = 16,384 tokens/batch) for 100k iterations, which takes approximately 24 hours. This corresponds to training on 12.8M (100k $\times$ 128) query–document pairs. We could not see any improvement on the dev set when training for another three days, which is equivalent to seeing 50M query–document pairs in total. To avoid biasing our model towards predicting non-relevant labels, which are approximately 1000 times more frequent in the training set, we build each batch by sampling an equal amount of relevant and non-relevant passages.

For both models, we use Adam (Kingma and Ba, 2014) with the initial learning rate set to $3\times 10^{-6}$ , $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. Dropout probability is set to $0.1$ in all layers.

Inference.

In our base configuration, we use top- $k_{0}=1000$ and top- $k_{1}=50$ candidates as input to monoBERT and duoBERT, respectively. Our experiments, however, include ablation settings as well as different parameterizations to characterize the contributions of each component as well as the latency–quality tradeoff space.

2 TREC CAR

Our second dataset is from the Complex Answer Retrieval (CAR) Track at the 2017 Text Retrieval Conference (TREC), whose aim is to explore passage-level retrieval techniques for simple fact and entity-centric needs Dietz et al. (2017). The primary dataset is synthetically constructed by exploiting the hierarchical structure of Wikipedia: “queries” are constructed by concatenating a Wikipedia article title with the title of one of its sections. The relevant documents are the paragraphs within that section. The corpus consists of cleaned paragraphs from English Wikipedia, except for the abstracts, totaling 29M documents, with an average of 60 words (or 380 characters) per document. The released dataset has five predefined folds, and we use the first four as a training set (approximately 3M queries), and the remaining as a validation set (approximately 700k queries). The test set is the same one used to evaluate the submissions to TREC 2017 CAR (2,254 queries).

Although the TREC 2017 CAR organizers provide manual annotations for the test set, only the top five documents retrieved by systems that submitted to the official evaluation have manual annotations. The sparsity of these judgments means that it is difficult to fairly evaluate runs that did not participate in the original evaluation. Hence, in this paper we evaluate using the automatic annotations, which provide a richer set of judgments. Per the official TREC CAR evaluation, we use Mean Average Precision (MAP) as the evaluation metric.

Both monoBERT and duoBERT are trained in the same manner as for MS MARCO, with the same hyperparameter settings. However, there is an important difference. The official pre-trained BERT models https://github.com/google-research/bert are pre-trained on the full Wikipedia, and therefore they have seen, although in an unsupervised way, Wikipedia documents that are used in the test set of TREC CAR. Thus, to avoid this leak of test data into training, we pre-train the BERT re-ranker only on the half of Wikipedia used by TREC CAR’s training set, which contains 1.1B words.

For fine-tuning, we generate our query–document pairs by retrieving the top ten documents from the entire TREC CAR corpus using BM25. This means that we end up with 30M example pairs (3M queries $\times$ 10 candidates/query) to train our model. We train it for 100k iterations, or 12.8M examples (100k iterations $\times$ 128 pairs/batch). Similar to the MS MARCO experiments, we did not see any gain on the dev set by training the model longer.

Results

Results on the MS MARCO dataset are shown in Table 1. The first row shows the BM25 baseline provided by Microsoft. Our initial application of BERT to the MS MARCO dataset, denoted by the entry monoBERT (Jan 2019), was published in January 2019 (Nogueira and Cho, 2019). On the evaluation data, it surpassed the previous best entry IRNet (submitted just five days earlier) by nearly eight points. This entry implements what we refer to as monoBERT here, albeit with a few minor differences, explained below. We are, based on official leaderboard records, the first to adapt BERT to the MS MARCO dataset, and to our knowledge, our model represents the first application of BERT to any retrieval task. We further note that every subsequent submission on the MS MARCO leaderboard (as of October 2019) exploits BERT in some capacity (evidenced by “BERT” appearing in every submission name). Given the availability of our source code on GitHub, it is likely that many of these entries are derived from or build on monoBERT, or are at least inspired by our innovation.Unfortunately, we cannot know with absolute certainty because most of the submissions are not paired with associated papers and source code.

Our BM25 baseline with Anserini is shown in the first row of the second block of Table 1; in our multi-stage ranking architecture, this is $R_{0}$ , the output of $H_{0}$ . Although both runs purport to implement BM25, Anserini is two points better than the Microsoft baseline. Our recall at 1000 hits is 85.7%, compared to only 81.5% from Microsoft’s implementation. It is a well-known fact in IR that different systems implementing the same scoring function might report very different results Mühleisen et al. (2014); Lin et al. (2016), owing to details such as tokenization, stopword selection, stemming, and parameter tuning. Thus, the differences between Anserini and the Microsoft baseline are not surprising.

By applying the monoBERT stage $H_{1}$ to the top 1000 ranked list from Anserini ( $H_{0}$ with $k_{0}=1000$ ), we observe a gain of 17.5 points. This result is slightly better than the monoBERT entry from January 2019 because that submission re-ranked the Microsoft baseline (a slightly worse $H_{0}$ , in essence). Other minor differences include a refactored codebase to improve reusability and readability.

Adding the duoBERT stage $H_{2}$ with the Sum aggregation method (Equation 3), denoted duoBERT ${}_{\textsc{Sum}}$ , improves over monoBERT alone by 0.5 points on the held-out evaluation set. In this setting, duoBERT considers the top 50 candidates from $H_{1}$ , and thus requires an additional $50\times 49$ BERT inferences to compute the final ranking (the time required for aggregation is negligible). This improvement in MRR, of course, comes at a cost in increased latency, an issue we explore in more detail below. The entry marked duoBERT ${}_{\textsc{Max}}$ shows that the Max aggregation method (Equation 6) performs quite poorly, and in fact makes monoBERT results worse. We find that the Binary method (Equation 4) performs slightly better (0.1 points) than Sum on the development set. Given these results, we abandon the Max aggregation method in subsequent experiments.

Note that official figures from the held-out evaluation set are not available for all conditions because obtaining those values requires formal submission of runs to the MS MARCO organizers. As good experimental practice, in order to avoid too much “unnecessary probing” of the held-out test data, we only submitted what we felt to be the most promising conditions.

Finally, pre-training on the target corpus (monoBERT + duoBERT ${}_{\textsc{Sum}}$ + TCP) improves MRR@10 by another 0.8 points. This result is in line with recent work that shows improvements with target corpus pre-training over out-of-domain corpus pre-training Beltagy et al. (2019); Raffel et al. (2019).

Results for TREC CAR are presented in Table 2, organized in a similar manner as Table 1. We see similar trends on this dataset. Again, Anserini’s implementation of BM25 leads to 2.3 MAP points improvement over another Lucene-based implementation from Kashyapi et al. (2018). It is also 0.5 MAP points higher than the best entry from TREC 2017 CAR MacAvaney et al. (2017). The monoBERT model gives an impressive jump of 19.5 MAP points over the BM25 baseline and duoBERT ${}_{\textsc{Sum}}$ or duoBERT ${}_{\textsc{Binary}}$ provides another improvement of 2.1 points. To our knowledge, this is the best-known result on this dataset. Note that we do not report target corpus pre-training results on TREC CAR because its target corpus is the same as the original BERT pre-training corpus, i.e., English Wikipedia.

In general, we notice that improvements from our BERT models are larger on TREC CAR than they are on MS MARCO. We believe this is primarily due to the evaluation metric: improvements in MRR@10 are much harder to achieve, since only the first correct answer contributes to the score, while better rankings of all relevant documents improve the MAP score. Additionally, MRR@10 is a highly discrete metric (there are only 11 possible values), and these values are arranged such that large gains in effectiveness are only possible in the early ranks (thus increasing the level of task difficulty).

The experimental results presented above capture monoBERT and duoBERT settings that focus on obtaining the best output quality. Our next set of experiments explore different parameterizations of the multi-stage ranking architecture that realizes different quality–latency tradeoffs.

For monoBERT, the number of candidates $k_{0}$ is the control “knob”: latency increases linearly as we consider more candidates, but effectiveness increases as well. This relationship is shown in Figure 2 for MS MARCO on the left and TREC CAR on the right. To aid in comparisons with duoBERT experiments below, the x-axis shows the number of inferences performed per query, which is exactly the same as $k_{0}$ , since each query–candidate pair from $R_{1}$ serves as an input to monoBERT.

As expected, we see diminishing returns with larger $k_{0}$ values on both datasets. For example, compared to $k_{0}=1000$ , on both datasets we can achieve more than half the gain in effectiveness with only around a fifth of the number of inferences. These curves also highlight the inadequacy of BM25 scores alone, since with a deep candidate list $R_{0}$ , monoBERT is considering documents that have quite low BM25 scores.

2 Tradeoffs with duoBERT

Similar to monoBERT, we can control the latency–quality tradeoff of duoBERT by considering different $k_{1}$ values. In these experiments, $k_{0}$ is fixed at 1000, and in Figure 3 we plot changes in effectiveness (MRR@10 for MS MARCO on the left, MAP for TREC CAR on the right) as a function of latency (inferences/query) for different values of $k_{1}$ . We find that actual inference latencies (measured in milliseconds) for duoBERT and monoBERT are comparable, and so the number of inferences per query provides a natural abstract time unit to support meaningful comparisons.

In the figure, each curve represents an aggregation technique and contains six points that correspond to $k_{1}=\{0,10,20,30,40,50\}$ . The leftmost point, $k_{1}=0$ , corresponds to monoBERT only, which allows us to quantify the additive impact of the duoBERT stage on top monoBERT results. The values for the Sample method represent the average of ten trials.

Of the four aggregation methods compared, Binary yields the highest effectiveness on the MS MARCO dataset, albeit by a small margin over Sum. On TREC CAR, Binary and Sum are very close, although Sum appears to be slightly better, especially at lower cutoffs. The Sample method has a lower effectiveness than Binary and Sum for any fixed number of inferences per query. This result shows that the top- $k_{1}$ candidates from monoBERT are a closer approximation of the true relevance ranking than uniformly sampling from a larger candidate set. This is an interesting result: given the choice of sampling from a larger candidate set or exhaustively enumerating all pairs from a smaller candidate set, the latter option always seems to yield better answers.

Considering these results, it seems that a good operating point is $k_{1}=20$ with Binary aggregation on MS MARCO and Sum aggregation on TREC CAR. In both cases, we obtain close to the maximum achievable score, with only a 40% increase in latency compared to monoBERT only (whereas $k_{1}=50$ , Sum or Binary, more than doubles the number of inferences required over monoBERT).

3 Multi-Stage Tradeoffs

Our next set of experiments quantify the tradeoffs when changing both $k_{0}$ and $k_{1}$ ; results are shown in Figure 4. Since inference times are approximately the same between monoBERT and duoBERT, we can quantify latency by the number of inferences.

On both datasets, the most computationally expensive point in the blue curve ( $k_{0}=1000$ and $k_{1}=50$ ) has a much higher effectiveness than the least expensive point in the red curve ( $k_{0}=50$ and $k_{1}=50$ ). This provides an example that analyzing multiple cutoffs jointly can improve our understanding of the tradeoff space.

4 Qualitative Analyses

Finally, we conduct qualitative analyses by sampling retrieved passages from three methods: BM25, monoBERT, and duoBERT ${{}_{\textsc{Sum}}}$ . A few examples are shown in Table 3. From the first two examples, we can see that BM25 tries to maximize unigram matches between queries and passages, and thus often neglects $n$ -grams, while monoBERT learns to assign a high matching score to $n$ -grams. This also shows an example where a high BM25 score—that comes from repeated instances of query terms—can be misleading. Our monoBERT model, at least in this example, does not appear to be fooled.

From the last two samples in Table 3, we can see that duoBERT matches the synonyms between “low” in the query and “reduced” in the passage, while monoBERT fails to distinguish “low” in the query and “elevated” in the passage.

Future Work and Conclusions

While our work is firmly situated in the context of multi-stage ranking architectures, it makes sense to discuss the broader landscape of applying neural models to document ranking. Search-related tasks, almost by definition, need to consider a large corpus, and thus it is impractical to apply inference over all documents for a given query. This simple fact necessitates reliance on standard “bag of words” techniques to reduce the “working set” that is presented to neural models.

Such a design, however, is inelegant, which has led researchers to explore alternatives that are able to directly perform ranking. The most promising approach is formulated as a representational learning problem, where the task is to learn some non-linear transformation of queries and documents (i.e., using a neural network) such that documents relevant to a query have high similarities in terms of a simple metric such as cosine similarity Henderson et al. (2017); Zamani et al. (2018); Ji et al. (2019). This, in essence, transforms neural ranking into approximate nearest-neighbor search once queries and documents have been mapped into the learned representational space.

While this is indeed a promising approach, and has seen production deployment in limited contexts Henderson et al. (2017), this thread of research is better characterized as exploratory. It is unclear whether representational learning is sufficient to boil the complex notion of relevance down to simple similarity computations—and if it isn’t, the complete end-to-end retrieval architecture will need to involve multiple stages anyway. In contrast, multi-stage ranking architectures are mature, well understood, easy to deploy, and proven in production.

Our future work aims to build the stages of the pipeline jointly, in which hyperparameters are automatically tuned for end-to-end performance. Also, explicitly using scoring signals from previous stages of the pipeline in later stages has the potential to increase overall effectiveness as more information is shared among stages. Lastly, current BERT-based models can only handle documents that are a few sentences long (at the most). Models that can handle longer documents without truncation, such as Yilmaz et al. (2019), should be evaluated on datasets such as the MS MARCO document ranking task. Overall, we believe that multi-stage ranking architectures pave the way to practical deployment of complex and computationally-intensive neural models.

Acknowledgments

RN and KC thank support by NVIDIA and CIFAR and were partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI) and Samsung Electronics (Improving Deep Learning using Latent Structure). WY and JL thank support by the Natural Sciences and Engineering Research Council (NSERC) of Canada, with additional computational resources provided by Compute Ontario and Compute Canada.