NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval

Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, Jungang Xu

Introduction

Recent progress in neural information retrieval models (NIRMs) has highlighted promising performance on the ad-hoc search task. State-of-the-art NIRMs, such as DRMM Guo et al. (2016), HiNT Fan et al. (2018), (Conv)-KNRM Xiong et al. (2017); Dai et al. (2018), and (Co)-PACRR Hui et al. (2017, 2018), have successfully implemented insights from traditional IR models using neural building blocks. Meanwhile, existing IR research has already demonstrated the effectiveness of incorporating relevance signals from top-ranked documents through pseudo relevance feedback (PRF) models Buckley and Robertson (2008); Diaz et al. (2016). PRF models expand the query with terms selected from top-ranked documents, thereby boosting ranking performance by reducing the problem of vocabulary mismatch between the original query and documents Rocchio (1971). Existing neural IR models do not have a mechanism for treating expansion terms differently from the original query terms, however, making it non-trivial to combine them with existing PRF approaches. In addition, neural IR models differ in their architectures, making the development of a widely-applicable PRF approach a challenging task.

To bridge this gap, we propose a generic neural pseudo relevance feedback framework, coined NPRF, that enables the use of PRF with existing neural IR models. Given a query and a target document, the top-ranked documents from the initial ranking are consumed by NPRF, which expands the query by interpreting it from different perspectives. Given a target document to evaluate, NPRF produces a final relevance score by considering the target document’s relevance to these top-ranked documents and to the original query.

The proposed NPRF framework can directly incorporate different established neural IR models, which serve as the concrete scorers in evaluating the relevance of a document relative to the top-ranked documents and to the query, without changing their architectures. We instantiate the NPRF framework using two state-of-the-art neural IR models, and we evaluate their performance on two widely-used TREC benchmark datasets for ad-hoc retrieval. Our results confirm that the NPRF framework can substantially improve the performance of both models. Moreover, both neural models perform similarly inside the NPRF framework despite the fact that without NPRF one model performed substantially worse than the other model. The contributions of this work are threefold: 1) the novel NPRF framework; 2) two instantiations of the NPRF framework using two state-of-the-art neural IR models; and 3) the experiments that confirm the effectiveness of the NPRF framework.

The rest of this paper is organized as follows. Section 2 presents the proposed NPRF framework in details. Following that, Section 3 describes the setup of the evaluation, and reports the results. Finally, Section 4 recaps existing literature, before drawing conclusions in Section 5.

Method

In this section, we introduce the proposed neural framework for pseudo relevance feedback (NPRF). Recall that existing unsupervised PRF models Rocchio (1971); Lavrenko and Croft (2001); Ye et al. (2009) issue a query to obtain an initial ranking, identify promising terms from the top- $m$ documents returned, and expand the original query with these terms. Rather than selecting the expanded terms within the top- $m$ documents, NPRF uses these documents directly as expansion queries by considering the interactions between them and a target document. Thus, each document’s ultimate relevance score depends on both its interactions with the original query and its interactions with these feedback documents.

Given a query $q$ , NPRF estimates the relevance of a target document $d$ relative to $q$ as described in the following steps. The architecture is summarized in Figure 1. Akin to the established neural IR models like DRMM Guo et al. (2016), the description is based on a query-document pair, and a ranking can be produced by sorting the documents according to their scores.

Create initial ranking. Given a document corpus, a ranking method $\mathit{rel_{q}}(q,d)$ is applied to individual documents to obtain the top- $m$ documents, denoted as $D_{q}$ for $q$ .

Extract document interactions. To evaluate the relevance of $d$ , each $d_{q}$ in $D_{q}$ is used to expand $q$ , where $d$ is compared against each $d_{q}$ , using a ranking method $\mathit{rel_{d}}(d_{q},d)$ .

Combine document interactions. The relevance scores $\mathit{rel}_{d}(d_{q},d)$ for individual $d_{q}\in D_{q}$ are further weighted by $\mathit{rel_{q}}(q,d_{q})$ , which serves as an estimator for the confidence of the contribution of $d_{q}$ relative to $q$ . The weighted combination of these relevance scores is used to produce a relevance score for $d$ , denoted as $\mathit{rel_{D}}(q,D_{q},d)$ .

While the same ranking model can be used for both $\mathit{rel_{q}}(.,.)$ and $\mathit{rel_{d}}(.,.)$ , we denote them separately in the architecture. In our experiments, the widely-used unsupervised ranking method BM25 Robertson et al. (1995) serves as $\mathit{rel_{q}}(.,.)$ ; meanwhile two state-of-the-art neural IR relevance matching models, namely, DRMM Guo et al. (2016) and K-NRM Xiong et al. (2017), serve as the ranking method $\mathit{rel_{d}}(.,.)$ . However, it is worth noting that in principle $\mathit{rel_{q}}$ and $\mathit{rel_{d}}$ can be replaced with any ranking method, and the above choices mainly aim to demonstrate the effectiveness of the NPRF framework.

2 Model Architecture

The NPRF framework begins with an initial ranking for the input query $q$ determined by $\mathit{rel_{q}}(.,.)$ , which forms $D_{q}$ , the set of the top- $m$ documents $D_{q}$ . The ultimate query-document relevance score $\mathit{rel_{D}}(q,D_{q},d)$ is computed as follows.

Extracting document interactions. Given the target document $d$ and each feedback document $d_{q}\in D_{q}$ , $\mathit{rel_{d}}(.,.)$ is used to evaluate the relevance between $d$ and $d_{q}$ , resulting in $m$ real-valued relevance scores, where each score corresponds to the estimated relevance of $d$ according to one feedback document $d_{q}$ .

As mentioned, two NIRMs are separately used to compute $\mathit{rel}_{d}(d_{q},d)$ in our experiments. Both models take as input the cosine similarities between each pair of terms in $d_{q}$ and $d$ , which are computed using pre-trained word embeddings as explained in Section 3.1. Given that both models consider only unigram matches and do not consider term dependencies, we first summarize $d_{q}$ by retaining only the top- $k$ terms according to their $\mathit{tf}$ - $idf$ scores, which speeds up training by reducing the document size and removing noisy terms. In our pilot experiments, the use of top- $k$ $\mathit{tf}$ - $idf$ document summarization did not influence performance. For different $d_{q}\in D_{q}$ , the same model is used as $\mathit{rel}_{d}(.,.)$ for different pairs of $(d_{q},d)$ by sharing model weights.

Combining document interactions. When determining the relevance of a target document $d$ , there exist two sources of relevance signals to consider: the target document’s relevance relative to the feedback documents $D_{q}$ and its relevance relative to the query $q$ itself. In this step, we combine $\mathit{rel}_{d}(d_{q},d)$ for each $d_{q}\in D_{q}$ into an overall feedback document relevance score $\mathit{rel_{D}}(q,D_{q},d)$ . When combining the relevance scores, the agreement between $q$ and each $d_{q}$ is also important, since $d_{q}$ may differ from $q$ in terms of information needs. The relevance of $d_{q}$ from the initial ranking $\mathit{rel}_{q}(q,d_{q})$ is employed to quantify this agreement and weight each $\mathit{rel}_{d}(d_{q},d)$ accordingly.

When computing such agreements, it is necessary to remove the influence of the absolute ranges of the scores from the initial ranker. For example, ranking scores from a language model Ponte and Croft (1998) and from BM25 Robertson et al. (1995) can differ substantially in their absolute ranges. To mitigate this, we use a smoothed min-max normalization to rescale $\mathit{rel}_{q}(q,d_{q})$ into the range $[0.5,1]$ . The min-max normalization is applied by considering $\mathit{min}(\mathit{rel}_{q}(q,d_{q})|d_{q}\in D_{q})$ and $\mathit{max}(\mathit{rel}_{q}(q,d_{q})|d_{q}\in D_{q})$ . Hereafter, $\mathit{rel}_{q}(q,d_{q})$ is used to denote this relevance score after min-max normalization for brevity. The (normalized) relevance score is smoothed and then weighted by the relevance evaluation of $d_{q}$ , producing a weighted document relevance score $\mathit{rel_{d}}^{\prime}(d_{q},d)$ for each $d_{q}\in D_{q}$ that reflects the relevance of $d_{q}$ relative to $q$ . This computation is described in the following equation.

As the last step, we propose two variants for combining the $\mathit{rel_{d}}^{\prime}(d_{q},d)$ for different $d_{q}$ into a single score $\mathit{rel_{D}}(q,D_{q},d)$ : (i) performing a direct summation and (ii) using a feed forward network with a hyperbolic tangent ( $\mathit{tanh}$ ) non-linear activation. Namely, the first variant simply sums up the scores, whereas the second takes the ranking positions of individual feedback documents into account.

3 Optimization and Training

Each training sample consists of a query $q$ , a set of $m$ feedback documents $D_{q}$ , a relevant target document $d^{+}$ and a non-relevant target document $d^{-}$ according to the ground truth. The Adam optimizer Kingma and Ba (2014) is used with a learning rate $0.001$ and a batch size of 20. Training normally converges within 30 epochs, with weights uniformly initialized. A hinge loss is employed for training as shown below.

Evaluation

Dataset. We evaluate our proposed NPRF framework on two standard test collections, namely, TREC1-3 Harman (1993) and Robust04 Voorhees (2004). TREC1-3 consists of 741,856 documents with 150 queries used in the TREC 1-3 ad-hoc search tasks Harman (1993, 1994, 1995). Robust04 contains 528,155 documents and 249 queries used in the TREC 2004 Robust track Voorhees (2004). We use those two collections to balance between the number of queries and the TREC pooling depth, i.e., 100 on both collections, allowing for sufficient training data. Manual relevance judgments are available on both collections, where both the relevant and non-relevant documents are labeled for each query.

Two versions of queries are included in our experiments: a short keyword query (title query), and a longer description query that restates the corresponding keyword query’s information need in terms of natural language (description query). We evaluate each type of query separately using the metrics Mean Average Precision at 1,000 (MAP), Precision at 20 (P@20) Manning et al. (2008), and NDCG@20 Järvelin and Kekäläinen (2002).

Preprocessing. Stopword removal and Porter’s stemmer are applied Manning et al. (2008). The word embeddings are pre-trained based on a pool of the top 2,000 documents returned by BM25 for individual queries as suggested by Diaz et al. (2016). The implementation of Word2Vechttps://code.google.com/p/word2vec/ from Mikolov et al. (2013) is employed. In particular, we employ CBOW with the dimension set to 300, window size to 10, minimum count to 5, and a subsampling threshold of $10^{-3}$ . The CBOW model is trained for 10 iterations on the target corpus.

Unsupervised ranking models serve as baselines for comparisons. We use the open source Terrier platform’s Macdonald et al. (2012) implementation of these ranking models:

BM25 Robertson et al. (1995), a classical probabilistic model, is employed as an unsupervised baseline. The hyper-parameters b and $k_{1}$ are tuned by grid search. As mentioned in Sec. 2.1, BM25 also generates the initial rankings $D_{q}$ , serving as $rel_{q}(.,.)$ in the NPRF framework.

On top of BM25, we use an adapted version of Rocchio’s query expansion Ye et al. (2009), denoted as BM25+QE. Note that, as demonstrated in the results, BM25+QE’s performance is comparable with the base neural IR models, including DRMM, K-NRM and PACRR. This illustrates the difficulty in making improvements on the TREC benchmarks through the uses of deep learning methods. The hyper-parameters, including the number of feedback documents and the number of expansion terms, are optimized using grid search on training queries.

In addition, QL+RM3, the query likelihood language model with the popular RM3 PRF Lavrenko and Croft (2001), is used as another unsupervised baseline.

Neural IR models are used for $rel_{d}(.,.)$ . As mentioned in Section 2.1, two unigram neural IR models are employed in our experiments:

DRMM. We employ the variant with the best effectiveness on Robust04 according to Guo et al. (2016), namely, DRMMLCH×IDF with the original configuration.

K-NRM. Due to the lack of training data compared with the commercial data used by Xiong et al. (2017), we employ a K-NRM variant with a frozen word embedding layer. To compensate for this substantial reduction in the number of learnable weights, we add an additional fully connected layer to the model. These changes lead to a small but competitive K-NRM variant, as demonstrated in Hui et al. (2018).

We additionally implement PACRR Hui et al. (2017) for the purpose of performing comparisons, but do not use PACRR to compute $rel_{d}(.,.)$ due to the computational costs. In particular, PACRR-firstk is employed where the first $1,000$ terms are used to compute the similarity matrices, and the original configuration from Hui et al. (2017) is used.

NIRM(QE) uses the modified query generated by the query expansion of BM25+QE Ye et al. (2009) as input to the neural IR model. Both DRMM and K-NRM are used to instantiate NIRM(QE).

Variants of the proposed NPRF approach. As indicated in Section 2.2, NPRF includes two variants that differ in the combination of the relevance scores from different $d_{q}\in D_{q}$ : the variant NPRFff uses a feed forward network with a hidden layer with five neurons to compute $rel(d,D_{q})$ , and the other variant NPRFds performs a direct summation of the different relevance scores. For the purposes of comparison, we additionally introduce another variant coined NPRF $\mathbf{{}_{\mathbf{ff}^{\prime}}}$ , where the relevance of $d_{q}$ to $q$ is not considered in the combination by directly setting $rel_{d}^{\prime}(d,d_{q})=rel_{d}(d,d_{q})$ in place of Equation 1, thereafter combining the scores with a fully connected layer as in NPRFff. We combine each of the three NPRF variants with the DRMM and K-NRM models, and report results for all six variants. Our implementation of the NPRF framework is available to enable future comparisonshttps://github.com/ucasir/NPRF.

Akin to Guo et al. (2016); Xiong et al. (2017); Hui et al. (2017), the NIRM baselines and the proposed NPRF are employed to re-rank the search results from BM25. In particular, the top- $10$ documents from the unsupervised baseline are used as the pseudo relevance feedback documents $D_{q}$ as input for NPRF, where each $d_{q}\in D_{q}$ is represented by its top- $20$ terms with the highest $\mathit{tf}$ - $idf$ weights. As illustrated later in Section 3.3, NPRF’s performance is stable over a wide range of settings for both parameters.

Cross-validation. Akin to Hui et al. (2018), owing to the limited number of labeled data, five-fold cross-validation is used to report the results by randomly splitting all queries into five equal partitions. In each fold, three partitions are used for training, one for validation, and one for testing. The model with the best MAP on the validation set is selected. We report the average performance on all test partitions. A two-tailed paired t-test is used to report the statistical significance at 95% confidence interval.

2 Results

Comparison to BM25. We first compare the proposed NPRF models with the unsupervised BM25. The results are summarized in Tables 1 and 2, where the best result in each column is highlighted in bold. From Tables 1 and 2, it can be seen that the proposed NPRF variants obtain significant improvement relative to BM25 on both test collections with both kinds of test queries. Moreover, the results imply that the use of different query types does not affect the effectiveness of NPRF, which consistently outperforms BM25.

Comparison to neural IR models. NPRF is further compared with different neural IR models, as summarized in Tables 3 & 4. It can be seen that NPRF regularly improves on top of the NIRM baselines. For both types of queries, NPRF-DRMM outperforms DRMM and NPRF-KNRM outperforms K-NRM when re-ranking BM25. Remarkably, the proposed NPRF is able to improve the weaker NIRM baseline. For instance, on Robust04, when using the description queries, DRMM and K-NRM obtain highly different results, with MAPs of $0.2630$ and $0.1687$ after re-ranking the initial results from BM25, respectively. When NPRF is used in conjunction with the NIRM models, however, the gap between the two models is closed; that is, MAP= $0.2801$ for NRFFds-DRMM and MAP= $0.2800$ for NRFFds-KNRM (see Table 4). This finding highlights that our proposed NPRF is robust with respect to the use of the two embedded NIRM models. A possible explanation for the poor performance of K-NRM on two TREC collections is the lack of training data, as suggested in Dai et al. (2018). While K-NRM could be improved by introducing weak supervision Dai et al. (2018), we achieve the same goal by incorporating pseudo relevance feedback information without extra training data.

While the six NPRF variants exhibit similar results across both kinds of queries, NPRFds-DRMM in general achieves the best performance on Robust04, and NPRFds-KNRM appears to be the best variant on TREC1-3. In the meantime, NPRFds outperforms NPRFff variants. One difference between the two methods is that NPRFff considers the position of each $d_{q}$ in the $D_{q}$ ranked documents, whereas NPRFds simply sums up the scores regardless of the positions. The fact that NPRFds performs better suggests that the ranking position within the $D_{q}$ documents may not be a useful signal. In the remainder of this paper, we mainly report on the results obtained by NPRFds.

Comparison to query expansion baselines. In Table 5, the proposed NPRF model is compared with three kinds of query expansion baselines, namely, the unsupervised BM25+QE Ye et al. (2009), QL+RM3 Lavrenko and Croft (2001), and DRMM/K-NRM(QE), the neural IR models using expanded queries as input. According to Table 5, the unsupervised BM25+QE baseline appears to achieve better performance in terms of MAP@1k, owing to its use of query expansion to match relevant documents containing the expansion terms from the whole collection. On the other hand, NPRFds, which reranks the top-1000 documents returned by BM25, outperforms the query expansion baselines in terms of early precision, as measured by either NDCG@20 or P@20. These measures on shallow rankings are particularly important for general IR applications where the quality of the top-ranked results is crucial to the user satisfaction. Moreover, our NPRF outperforms NIRM(QE) in most cases, indicating the benefit brought by wrapping up the feedback information in a document-to-document matching framework as in NPRF, as opposed to directly adding unweighted expansion terms to the query. Recall that, it is not straightforward to incorporate these expanded terms within the existing NIRMs’ architectures because the NIRMs do not distinguish between them and the original query terms.

3 Analysis

Parameter sensitivity. Moreover, we analyze factors that may influence NPRF’s performance. We report results on NPRFds using title queries on Robust04 for the sake of brevity, but similar observations also hold for the other NPRF variants, as well as on TREC1-3. Figure 2 illustrates the sensitivity of NPRF relative to two parameters: the number of feedback documents $m$ within $D_{q}$ and the number of terms $k$ that are used to summarize each $d_{q}\in D_{q}$ . Specifically, Figure 2 shows the performance of NPRFds as the number of feedback documents $m$ varies (top), and as the number of top terms $k$ varies (bottom). The effectiveness of NPRF appears to be stable over a wide range of the parameter configurations, where the proposed model consistently outperforms the BM25 baseline.

Case study. A major advantage of the proposed NPRF over existing neural IR models is that it allows for soft-matching query-related terms that are missing from both the query and the target document. Table 6 presents an illustrative example of soft matching in NPRF. From Table 6, it can be seen that there exist query-related terms in the top- $10$ documents returned by BM25 in the initial ranking. However, since those query-related terms are missing in both the query and the target document, they are not considered in the document-to-query matching and, consequently, the target document is ranked 122nd by BM25 despite the facts that it was judged relevant by a human assessor. In contrast, the NPRF framework allows for the soft-matching of terms that are missing in both the query and target document. As a result, the matching signals for the query terms and query-related terms in the target document are enhanced. This leads to enhanced effectiveness with the target document now ranked in the 5th position.

In summary, the evaluation on two standard TREC test collections shows promising results obtained by our proposed NPRF approach, which outperforms state-of-the-art neural IR models in most cases. Overall, NPRF provides effective retrieval performance that is robust with respect to the two embedded neural models used for encoding the document-to-document interactions, the two kinds of queries with varied length, and wide range of parameter configurations.

Related Work

Recently, several neural IR models (NIRMs) have been proposed to apply deep learning techniques in ad-hoc information retrieval. One of the essential ideas from prior work is to model the document-to-query interaction via neural networks, based on a matrix of document-to-query embedding term similarities, incorporating both the “exact matching” of terms appearing in both the document and query and the “soft matching” of different query and document term pairs that are semantically related.

DSSM, one of the earliest NIRMs proposed in Huang et al. (2013), employs a multi-layer neural network to project queries and document into a common semantic space. The cosine similarity between a query and a document (document title) is used to produce a final relevance score for the query-document pair. CDSSM is a convolutional version of DSSM, which uses the convolutional neural network (CNN) and max-pooling strategy to extract semantic matching features at the sentence level Shen et al. (2014). Pang et al. (2016) also employ a CNN to construct the MatchPyramid model, which learns hierarchical matching patterns between local interactions of document-query pair. Guo et al. (2016) argue that both DSSM and CDSSM are representation-focused models, and thus are better suited to capturing semantic matching than relevance matching (i.e., lexical matching), and propose the interaction-focused relevance model named DRMM. DRMM maps the local interactions between a query-document pair into a fixed-length histogram, from which the exact matching signals are distinguished from the other matching signals. These signals are fed into a feed forward network and a term gating network to produce global relevance scores. Similar to DRMM, K-NRM Xiong et al. (2017) builds its model on top of a matrix of local interaction signals, and utilizes multiple Gaussian kernels to obtain multi-level exact/soft matching features that are input into a ranking layer to produce the final ranking score. K-NRM is later improved by Conv-KNRM, which employs CNN filters to capture $n$ -gram representations of queries and documents Dai et al. (2018). DeepRank Pang et al. (2017) models the relevance generation process by identifying query-centric contexts, processing them with a CNN or LSTM, and aggregating them to produce a final relevance score. Building upon DeepRank, Fan et al. (2018) propose to model diverse relevance patterns by a data-driven method to allow relevance signals at different granularities to compete with each other for the final relevance assessment.

Duet Mitra et al. (2017) employs two separate deep neural networks to build a relevance ranking model, in which a local model estimates the relevance score according to exact matches between query and document terms, and a distributed model estimates relevance by learning dense lower-dimensional representations of query and document text. Zamani et al. (2018) extends the Duet model by considering different fields within a document.

Hui et al. (2017) propose the PACRR model based on the idea that an appropriate combination of convolutional kernels and pooling operations can be used to successfully identify both unigram and n-gram query matches. PACRR is later improved upon by Co-PACRR, a context-aware variant that takes the local and global context of matching signals into account through the use of three new components Hui et al. (2018).

Ran et al. (2017) propose a document-based neural relevance model that utilizes complemented medical records to address the mismatch problem in clinical decision support. Nogueira and Cho (2017) propose a reinforcement learning approach to reformulating a task-specific query. Li et al. (2018) propose DAZER, a CNN-based neural model upon interactions between seed words and words in a document for zero-shot document filtering with adversarial learning. Ai et al. (2018) propose to refine document ranking by learning a deep listwise context model.

In summary, most existing neural IR models are based on query-document interaction signals and do not provide a mechanism for incorporating relevance feedback information. This work proposes an approach for incorporating relevance feedback information by embedding neural IR models within a neural pseudo relevance feedback framework, where the models consume feedback information via document-to-document interactions.

Conclusions

In this work we proposed a neural pseudo relevance feedback framework (NPRF) for incorporating relevance feedback information into existing neural IR models (NIRM). The NPRF framework uses feedback documents to better estimate relevance scores by considering individual feedback documents as different interpretations of the user’s information need. On two standard TREC datasets, NPRF significantly improves the performance of two state-of-the-art NIRMs. Furthermore, NPRF was able to improve their performance across two kinds of query tested (namely, short queries and the verbal queries in natural language). Finally, our analysis demonstrated the robustness of the NPRF framework over different parameter configurations.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (61433015/61472391), and the Beijing Natural Science Foundation under Grant No. (4162067/4142050).