Simple Applications of BERT for Ad Hoc Document Retrieval

Wei Yang, Haotian Zhang, Jimmy Lin

Introduction

The dominant approach to ad hoc document retrieval using neural networks today is to deploy the neural model as a reranker over an initial list of candidate documents retrieved using a standard bag-of-words term-matching technique. Researchers have proposed many neural ranking models Mitra and Craswell (2019), but there has recently been some skepticism about whether they have truly advanced the state of the art Lin (2018), at least in the absence of large amounts of log data only available to a few organizations.

One important recent innovation is the use of neural models that make heavy use of pretraining Peters et al. (2018); Radford et al. (2018), culminating in BERT Devlin et al. (2018), the most popular example of this approach today. Researchers have applied BERT to a broad range of NLP tasks and reported impressive gains. Most relevant to document retrieval, BERTserini Yang et al. (2019) integrates passage retrieval using the open-source Anserini IR toolkit with a BERT-based reader to achieve large gains over the previous state of the art in identifying answer spans from a large Wikipedia corpus.

Given the successes in applying BERT to question answering and the similarities between QA and document retrieval, we naturally wondered: Would it be possible to apply BERT to improve document retrieval as well? In short, the answer is yes. Adapting BERT for document retrieval requires overcoming the challenges associated with long documents, both during training and inference. We present a simple yet effective approach, based on the same BERTserini framework, that applies inference over individual sentences in a document and then combines sentence scores into document scores.

Our approach is evaluated on standard ad hoc retrieval test collections from the TREC Microblog Tracks (2011–2014) and the TREC 2004 Robust Track. We report the highest average precision on these datasets for neural approaches that we are aware of. The contribution of our work is, to our knowledge, the first successful application of BERT to ad hoc document retrieval, yielding state of the art results.

Background and Related Work

In ad hoc document retrieval, the system is given a short query $q$ and the task is to produce the best ranking of documents in a corpus, according to some standard metric such as average precision (AP). Mitra and Craswell (2019) provide a recent overview of many of these models, to which we refer interested readers in lieu of a detailed literature review due to space considerations.

However, there are aspects of the task worth discussing. Researchers have understood for a few years now that relevance matching and semantic matching (for example, paraphrase detection, natural language inference, etc.) are different tasks, despite shared common characteristics Guo et al. (2016). The first task has a heavier dependence on exact match (i.e., “one-hot”) signals, whereas the second task generally requires models to more accurately capture semantics. Question answering has elements of both, but nevertheless remains a different task from document retrieval. Due to these task differences, neural models for document ranking, for example, DRMM Guo et al. (2016), are quite different architecturally from neural models for capturing similarity; see, for example, the survey of Lan and Xu (2018).

Another salient fact is that documents can be longer than the length of input texts that BERT was designed for. This creates a problem during training because relevance judgments are annotations on documents, not on individual sentences or passages. Typically, within a relevant document, only a few passages are relevant, but such fine-grained annotations are not available in most test collections. Thus, it is unclear how exactly one would fine-tune BERT given (only) existing document-level relevance judgments. In this paper, we sidestep the training challenge completely and present a simple approach to aggregating sentence-level scores during inference.

Searching Social Media Posts

Despite the task mismatch between QA and ad hoc document retrieval, our working hypothesis is that BERT can be fine-tuned to capture relevance matching, as long as we can provide appropriate training data. To begin, we tackled microblog retrieval—-searching short social media posts—where document length does not pose an issue. Fortunately, test collections from the TREC Microblog Tracks Lin et al. (2014), from 2011 to 2014, provide data for exactly this task.

As with BERTserini, we adopted a simple architecture that uses the Anserini IR toolkithttp://anserini.io/ for initial retrieval, followed by inference using a BERT model. Building on best practice, query likelihood (QL) with RM3 relevance feedback Abdul-Jaleel et al. (2004) provides the initial ranking to depth 1000. The texts of the retrieved documents (posts) are then fed into a BERT classifier, and the BERT scores are combined with the retrieval scores via linear interpolation. We used the BERT-Base model (uncased, 12-layer, 768-hidden, 12-heads, 110M parameters) described in Devlin et al. (2018). As input, we concatenated the query $Q$ and the document $D$ into a text sequence [[CLS], $Q$ , [SEP], $D$ , [SEP]], and then padded each text sequence in a mini-batch to $N$ tokens, where $N$ is the maximum length in the batch. Following Nogueira and Cho (2019), BERT is used for binary classification (i.e., relevance) by taking the [CLS] vector as input to a single layer neural network.

Test collections from the TREC Microblog Tracks were used for fine-tuning the BERT model, using cross-entropy loss. For evaluation on each year’s dataset, we used the remaining years for fine tuning, e.g., tuning on 2011–2013 data, testing on 2014 data. From the training data, we sampled 10% for validation. We fine-tuned BERT with a learning rate of $3\times 10^{-6}$ for 10 epochs. The interpolation weight between the BERT scores and the retrieval scores was tuned on the validation data. We only used as training examples the social media posts that appear in our initial ranking (i.e., as opposed to all available relevance judgments). There are a total of 225 topics (50, 60, 60, 55) in the four datasets, which yields 225,000 examples (unjudged posts are treated as not relevant).

Experimental results are shown in Table 1, where we present average precision (AP) and precision at rank 30 (P30), the two official metrics of the evaluation Ounis et al. (2011). The first two blocks of the table are copied from Rao et al. (2019), who compared bag-of-words baselines (QL and RM3) to several popular neural ranking models as well as MP-HCNN, the model they introduced. Results for all the neural models include interpolation with the original document scores. Rao et al. (2019) demonstrated that previous neural models are not suitable for ranking short social media posts, and are no better than the RM3 baseline in many cases. In contrast, MP-HCNN was explicitly designed with characteristics of tweets in mind: it significantly outperforms previous neural ranking models (see original paper for comparisons, not repeated here). We also copied results from Shi et al. (2018), who reported even higher effectiveness than MP-HCNN.

These results represent, to our knowledge, the most comprehensive summary of search effectiveness measured on the TREC Microblog datasets. Note that for these comparisons we leave aside many non-neural approaches that take advantage of learning-to-ranking techniques over manually-engineered features, as we do not believe they form a fair basis of comparison. In general, such approaches also take advantage of non-textual features (e.g., social signals), and these additional signals (naturally) allow them to beat approaches that use only the text of the social media posts (like all the models discussed here).

The final row of Table 1 reports results using our simple BERT-based technique, showing quite substantial and consistent improvements over previous results. Since we have directly copied results from previous papers, we did not conduct significance tests.

Searching Newswire Articles

Results on the microblog test collections confirm our working hypothesis that BERT can be fine-tuned to capture document relevance, at least for short social media posts. In other words, task differences between QA and document retrieval do not appear to hinder BERT’s adaptability. Having demonstrated this, we turn our attention to longer documents. For this, we take advantage of the test collection from the TREC 2004 Robust Track Voorhees (2004), which comprises 250 topics over a newswire corpus. We selected this collection for a couple of reasons: it is the largest newswire collection we know of in terms of training data, and Lin (2018) provides well-tuned baselines that support fair comparisons to recent neural ranking models.

Given the success of BERT on microblogs, one simple idea is to apply inference over each sentence in a candidate document, select the one with the highest score, and then combine that with the original document score (with linear interpolation). One rationale for this approach comes from Zhang et al. (2018b, a), who found that the “best” sentence or paragraph in a document provides a good proxy for document relevance. This is also consistent with a long thread of work in information retrieval that leverages passage retrieval techniques for document ranking (Callan, 1994; Clarke et al., 2000; Liu and Croft, 2002).

Generalizing, we could consider the top $n$ scoring sentences as follows:

where $S_{\textrm{doc}}$ is the original document score and $S_{i}$ is the $i$ -th top scoring sentence according to BERT. The hyperparameters $a$ and $w_{i}$ can be tuned via cross-validation.

Sentence-level inference seems like a reasonable initial attempt at adapting BERT to document retrieval, but what about fine-tuning? As previously discussed, the issue is that we lack sentence-level relevance judgments. Since our efforts represent an initial exploration, we simply sidestep this challenge (for now) and fine tune on existing sentence-level datasets. Specifically, we used: (1) the microblog data from the previous section and (2) the union of the TrecQA Yao et al. (2013) and WikiQA Yang et al. (2015) datasets. This sets up an interesting contrast: the first dataset captures the document retrieval task but on a different domain, while the second dataset captures a different task but on corpora that are much closer to newswire. It is an empirical question as to which source is more effective.

To support a fair comparison, we adopted the same experimental procedure as Lin (2018). He described two separate data conditions: one based on two-fold cross-validation to compare against “Paper 1” and one based on five-fold cross-validation to compare against “Paper 2”.Since Lin’s article is critical of neural methods, he anonymized the neural approaches but mentioned that they come from articles published in late 2018 and are representative of the most recent advances in neural approaches to document retrieval. The exact fold settings are provided online, which ensures a fair comparison.https://github.com/castorini/Anserini/blob/master/docs/ experiments-forum2018.md In our implementation, documents are first cleaned by stripping all tags and then segmenting the text into sentences using NLTK. If the input to BERT is longer than 512 tokens (BERT’s maximum limit), we further split sentences into fixed sized chunks. Across the 250 topics, each document averages 43 sentences, with 27 tokens per sentence.

In our experiments, we considered up to the top four sentences. For up to three sentences, $a$ and $w_{i}$ are tuned via exhaustive grid search in the following range: $a\in$ , $w_{1}=1$ (fixed), $w_{2}\in$ , and $w_{3}\in$ , all with step size $0.1$ . In the four-sentence condition, to reduce the search space, we started with the best three-sentence parameters and explored $w_{4}\in$ with step size $0.1$ , along with neighboring regions in $a$ , $w_{2}$ , and $w_{3}$ . We selected the parameters with the highest AP score on the training folds.

Results of our experiments are shown in Table 2, divided into two blocks: Paper 1 on the top and Paper 2 on the bottom. The effectiveness of the two papers are directly copied from Lin (2018); all other results are our own runs. The paper aggregation site “Papers With Code” places Lin’s result as the state of the art on Robust04 as of this writing.https://paperswithcode.com/sota/ ad-hoc-information-retrieval-trec-robust As a point of comparison, in the most recent survey of neural ranking models by Guo et al. (2019), the best AP on Robust04 is in the $0.29$ range, consistent with the above site. Therefore, we are quite confident that we are evaluating against competitive models. In the results table, “FT” indicates the dataset used for fine-tuning and $n$ S indicates inference using the top $n$ scoring sentences of the document. We find that the learned $w_{4}$ value is zero, indicating that additional sentences do not help beyond the top three (at least according to our tuning procedure); thus, 4S results are omitted from the table. Interestingly, we find that fine-tuning BERT on microblog data is more effective than QA data, suggesting that task (QA vs. relevance matching) is more important than document genre (tweets vs. newswire). Cognizant of the potential dangers of repeated hypothesis testing, we probed the statistical significance of one five-fold setting, BM25+RM3 vs. “3S: BERT FT (Microblog)”. According to a paired $t$ -test, the differences are statistically significant ( $p<10^{-7}$ ).

As a summary, we see that a well-tuned BM25+RM3 baseline already outperforms neural ranking approaches (which was Lin’s original point). Our simple BERT-based reranker yields further significant improvements.

Conclusions

In this preliminary study, we have adapted BERT for document retrieval in the most obvious manner, via sentence-level inference and simple score aggregation. Results show substantial improvements in both ranking social media posts and newswire documents—to our knowledge, the highest AP scores reported on the TREC Microblog and Robust04 datasets for neural approaches that we are aware of (although the literature does report non-neural approaches that are even better, for both tasks). We readily concede that our techniques are quite simple and that there are many obvious next steps. In particular, we simply sidestepped the issue of not having sentence-level relevance judgments, although there are some obvious distant supervision techniques to “project” relevance labels down to the sentence level that should be explored. We are actively pursuing these and other directions.

Acknowledgments