Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need

Zhengyi Ma, Zhicheng Dou, Wei Xu, Xinyu Zhang, Hao Jiang, Zhao Cao, Ji-Rong Wen

Introduction

Recent years have witnessed the great success of many pre-trained language representation models in the natural language processing (NLP) field (Devlin et al., 2019; Radford et al., 2018a, b; Clark et al., 2020). Pre-trained on large-scale unlabeled text corpus and fine-tuned on limited supervised data, these pre-trained models have achieved state-of-the-art performances on many downstream NLP tasks (Sutskever et al., 2014; Socher et al., 2013; Sang and Meulder, 2003). The success of pre-trained models has also attracted more and more attention in IR community (Nogueira et al., 2019; Chang et al., 2020; Ma et al., 2021a; Yang et al., 2019b; Xiong et al., 2020). For example, many researchers have begun to explore the use of pre-trained language models for the ad-hoc retrieval task, which is one of the most fundamental tasks in IR. The task aims to return the most relevant documents given one query solely based on the query-document relevance. Studies have shown that leveraging the existing pre-trained models for fine-tuning the ranking model over the limited relevance judgment data is able to achieve better retrieval effectiveness (Nogueira et al., 2019; Nogueira and Cho, 2019; Gao et al., 2021; Yang et al., 2019b).

Although existing methods of fine-tuning ranking models over pre-trained language models have been shown effective, the pre-training objectives tailored for IR are far from being well explored. Recently, there have been some preliminary studies on this direction (Chang et al., 2020; Ma et al., 2021a). For example, Ma et al. (2021a) proposed to sample word sets from documents as pseudo queries based on the query likelihood, and use these word sets to simulate query document relevance matching. Different from existing studies, in this work, we propose to leverage the correlations and supervised signals brought by hyperlinks and anchor texts, and design four novel pre-training objectives to learn the correlations of query and documents for ad-hoc retrieval.

Hyperlinks are essential for web documents to help users navigating from one page to another. Humans usually select some reasonable and representative terms as the anchor text to describe and summarize the destination page. We propose to leverage hyperlinks and anchor texts for IR-oriented pre-training, because: (1) Since anchor texts are usually short and descriptive, based on the classical anchor intuition, anchor texts share similar characteristics with web queries, and the anchor-document relations approximate relevance matches between query and documents (West et al., 2015; Dou et al., 2009; Zhang et al., 2020; Dai and Davison, 2010; Yi and Allan, 2010). For example, as shown in Figure 1, the anchor text “MacBook Pro” is a reasonable query for the introductory page of itself. (2) Anchor texts are created and filtered by web masters (i.e., humans), rather than generated by a specific model automatically. Thus, they can provide more accurate and reliable summarized information of one page, which further brings stronger supervised signals for pre-training. Besides, it can reflect user’s information need, and help to model the matching between user needs and documents. (3) Anchor texts can bring terms that are not in the destination page, while the existing methods mostly use the document terms for describing the document. In this way, the model can use more abundant information for capturing semantics and measuring relevance. (4) Hyperlinks widely exist on web pages and are cost-efficient to collect, which can provide large-scale training data for pre-training models. In summary, hyperlinks are appropriate for pre-training tailored for IR, and easy to obtain.

However, straightly building anchor-document pairs to simulate query-document relevance matching may hurt the accuracy of neural retrieval models, since there exist noises even spams in hyperlinks (Zhang et al., 2020; Dehghani et al., 2017). Besides, the semantics of short anchor texts could be insufficient. For example, as shown in Figure 1, the single term of “Apple” is not a suitable query for the page of “apple company”, since “Apple” can also refer to pages about “apple fruit”. However, by considering the whole sentence containing the anchor “Apple”, we could build more informative queries to describe the page, such as “Apple technology”. This indicates that we should leverage the context semantics around the anchor texts for building more accurate anchor-based pre-training data.

Based on the above observation, we propose a pre-training framework HARP, which focuses on designing Pre-training objectives for ad-hoc Retreival with Anchor texts and Hyperlinks. Inspired by the self-attentive retrieval architecture (Nogueira and Cho, 2019), we propose to firstly pre-train the language representation model with supervised signals brought by hyperlinks and anchor texts, and then fine-tune the model parameters according to downstream ad-hoc retrieval tasks. The major novelty lies in the pre-training stage. In particular, we carefully devise four self-supervised pre-training objectives for capturing the anchor-document relevance in different views: representative query prediction, query disambiguation, representative document prediction, and anchor co-occurrence modeling. Based on the four tasks, we can build a large number of pair-wise query-document pairs based on hyperlinks and anchor texts. Then, we pre-train the Transformer model to predict pairwise preference jointly with Masked Language Model (MLM) objective. Via such a pre-trained method, HARP can effectively fuse the anchor-document relevance signal data, and learn context-aware language representations. Besides, HARP is able to characterize different situations of ad-hoc retrieval during the pre-training process in a general way. Finally, we fine-tune the learned Transformer model on downstream ad-hoc retrieval tasks to evaluate the performance.

We pre-train the HARP model on English Wikipedia, which contains tens of millions of well-formed wiki articles and hyperlinks. At the fine-tuning stage, we use a ranking model with the same architecture as the pre-trained model. We use the parameters of the pre-trained model to initialize the ranking model, and fine-tune the ranking model on two open-accessed ad-hoc retrieval datasets, including MS-MARCO Document Ranking and Trec DL. Experimental results show that HARP achieves state-of-the-art performance compared to a number of competitive methods.

Our contributions are three-fold: (1) We introduce the hyperlinks and anchor texts into pre-training for ad-hoc retrieval. By this means, our method can leverage the supervised signal brought by anchor-document relevance, which is more accurate and reliable than the existing methods based on specific sampling algorithms. (2) We design four self-supervised pre-training objectives, including representative query prediction, query disambiguation modeling, representative document prediction, and anchor co-occurrence modeling to pre-train the Transformer model. In such a way, we are able to simulate the query-document matching at the pre-training stage, and capture the relevance matches in different views. (3) We leverage the context semantics around the anchors instead of using the anchor-document relevance straightly. This helps to build more accurate pseudo queries, and further enhance the relevance estimation of the pre-trained model.

Related Work

In recent years, pre-trained language models with deep neural networks have dominated across a wide range of NLP tasks (Peters et al., 2018; Devlin et al., 2019; Yang et al., 2019a; Zhu et al., 2021). They are firstly pre-trained on a large-scale unlabeled corpus, and fine-tuned on downstream tasks with limited data. With the strong ability to aggregate context, Transformer (Vaswani et al., 2017) becomes the mainstream module of these pre-trained models. Some researchers firstly tried to design generative pre-training language models based on uni-directional Transformer (Radford et al., 2018a; Yang et al., 2019a; Radford et al., 2018b). To model the bi-directional context, Devlin et al. (2019) pre-trained BERT, which is a large-scale bi-directional Transformer encoder to obtain contextual language representations. Following BERT, many pre-trained methods have achieved encouraging performance, such as robust optimization (Liu et al., 2019), parameter reduction (Lan et al., 2020), discriminative training (Clark et al., 2020), and knowledge incorporation (Sun et al., 2020; Zhang et al., 2019). Inspired by the powerful capacity of BERT for modeling language representations, the IR community has also explored to apply pre-trained models for better measuring the information relevance. By concatenating the query and document with special tokens and feeding them into BERT, many methods has achieved great performance by fine-tuning with BERT (Nogueira et al., 2019; Nogueira and Cho, 2019; Qiao et al., 2019; Dai and Callan, 2019; Yang et al., 2019b; Wei et al., 2020; Gao et al., 2021; Su et al., 2021).

2. Pre-training Objectives for IR

Although fine-tuning the downstream IR tasks with pre-trained models has achieved promising results, designing a suitable pre-training objective for ad-hoc retrieval has not been well explored. There have been several successful pre-training tasks for NLP, such as masked language modeling (Taylor, 1953; Devlin et al., 2019), next sentence prediction (Devlin et al., 2019), permutation language modeling (Yang et al., 2019a) and replaced token detection (Clark et al., 2020). However, they are designed to model the general contextual dependency or sentence coherence, not the relevance between query-document pairs. A good pre-training task should be relevant to the downstream task for better fine-tuning performance (Chang et al., 2020). Some researchers proposed to pre-train on a large-scale corpus with Inverse Cloze Task (ICT) for passage retrieval, where a passage is treated as the document and its inner sentences are treated as queries (Lee et al., 2019; Chang et al., 2020). Chang et al. (2020) also designed Body First Selection and Wiki Link Prediction to capture the inner-page and inter-page semantic relation. Ma et al. (2021a) proposed Representative Words Prediction (ROP) task for pre-training in a pair-wise way. They assumed that the sampled word set with higher query likelihood is a more “representative” query. Then, they train the Transformer encoder to predict pairwise scores between two sampled word sets, and achieve state-of-the-art performance.

Different from the above approaches, we propose using the correlations brought by hyperlinks and anchor texts as the supervised signals for the pre-trained language model. Hyperlinks and anchor texts have been used in various existing IR studies, including ad-hoc retrieval (Zhang et al., 2020), query refinement (Kraft and Zien, 2004), document expansion (Dou et al., 2009), and query suggestion (Dang and Croft, 2010). However, none of them consider using hyperlinks to design the pre-training objectives for IR. Since hyperlinks widely exist in web documents and can bring complementary descriptions of the target documents, we believe they can bring stronger and more reliable supervised signals for pre-training, which further improve downstream ad-hoc retrieval performance.

Methodology

The key idea of our approach is to leverage the hyperlinks and anchor texts for designing better pre-training objectives tailored for ad-hoc retrieval, and further improve the ranking quality of the pre-trained language model. To achieve this, we design a framework HARP. As shown in Figure 2, the framework of HARP can be divided into two stages: (1) pre-training stage and (2) fine-tuning stage. In the first stage, we design four pre-training tasks to build the pseudo query-document pairs from the raw corpus with hyperlinks, then pre-train the Transformer model with the four pre-training objectives jointly with the MLM objective. In the second stage, we use the pre-trained model of the first stage to initialize the ranking model, then fine-tune it on the limited retrieval data for proving the effectiveness of our pre-trained model.

In this section, we first provide an overview of our proposed model HARP in Section 3.1, consisting of two stages of pre-training and fine-tuning. Then, we will give the details of the pre-training stage in Section 3.2, and the fine-tuning stage in Section 3.3.

We briefly introduce the two-stage framework of our proposed HARP as follows.

As shown in Figure 2, in the pre-training stage, we pre-train the Transformer model to learn the query-document relevance based on the hyperlinks and anchor texts. Thus, the input of this stage is the large-scale raw corpus $\mathcal{C}$ containing hyperlinks, and the output is the pre-trained Transformer model $\mathcal{M}$ . To achieve this, we design four pre-training tasks to capture different views of the anchor-document relevance, generate pseudo query-document pairs to simulate the downstream ad-hoc retrieval task, and pre-train the Transformer model toward these four objectives jointly with the MLM objective. After the offline pre-training on the raw corpus, the Transformer model can learn the query-document matching from the pseudo query-document pairs based on hyperlinks. Thus, it can achieve better performance when applied to the fine-tuning stage.

Based on the above assumptions, we formulate the pre-training stage as follows: Suppose that in a large corpus $\mathcal{C}$ (e.g., Wikipedia), we can obtain many textual sentences. We denote one sentence as $S=(w_{1},w_{2},\cdots,w_{n})$ , where $w_{i}$ is the $i$ -th word in $S$ . In sentence $S$ , some words are anchor texts within hyperlinks. We use $A=((a_{1},P_{1}),(a_{2},P_{2}),\cdots,(a_{m},P_{m}))$ to denote the set of anchor texts in sentence $S$ , where $a_{i}$ denotes the $i$ -th anchor word in the sentence, and $P_{i}$ is the destination page. Figure 1 shows an example of anchors in one sentence. For notation simplicity, we treat the multi-words anchor texts as one phrase in the word sequence. In our pre-training corpus with anchors, one source sentence can link to one or more different destination pages using different anchor texts. In the meanwhile, a destination page can also be linked by several source sentences using different anchor texts. In fact, these characteristics of hyperlinks are leveraged in our designed pre-trained tasks to simulate some situations of ad-hoc retrieval. Based on the dataset $\mathcal{C}$ , we train a Transformer model $\mathcal{M}$ on this corpus based on four pre-training tasks. The output of the pre-training stage is the model $\mathcal{M}$ . Since the pre-training does not depend on any ranking data, the pre-training stage can be done offline for obtaining a good language model $\mathcal{M}$ from the large-scale corpus.

1.2. Fine-tuning Stage

As shown in Figure 2, in the fine-tuning stage, we use the pre-trained Transformer model to measure the relevance between a query and a document. Fine-tuned on the limited ranking data, our model can learn the data distribution of the specific downstream task and be used for ranking. The formulation of the fine-tuning stage is the same as the ad-hoc retrieval task. Given a query $q$ and a candidate document $d$ , we learn a score function $s(q,d)$ to measure the relevance between $q$ and $d$ . Then, for each candidate document $d$ , we calculate the relevance score of them and return the documents with the highest scores. Specifically, we concatenate the query and document together, and feed them into the Transformer model. Note that the parameter and embeddings of this Transformer model are initialized by the pre-trained model $\mathcal{M}$ in the first stage. Then, we calculate the representations of the [CLS] token at the sequence head and apply a multi-layer perception(MLP) function over this representation to generate the relevance score.

2. Pre-training based on Hyperlinks

In the pre-training stage, a pre-training task that more closely resembles the downstream task can better improve the fine-tuning performance. As we introduced in Section 1, the relations between anchor texts and documents can match the relevance of query and documents. Thus, we can leverage these supervised signals brought by anchor texts to build reliable pre-training query-document pairs. Training on these pairs, the model can learn the query-document matching in the pre-training stage, further enhance the downstream retrieval tasks. To achieve this, we design four pre-training tasks based on hyperlinks to construct different loss functions. The architecture of the four pre-training tasks is shown in Figure 3. These four tasks try to learn the correlation of ad-hoc retrieval in different views. Thus, the focus of each task is how to build the query-document pair for pre-training. In the following, we will present the proposed four pre-training tasks in detail.

Based on the classic anchor intuition, the relation between anchor texts and the destination page can approximate the query-document relevance (West et al., 2015; Dou et al., 2009; Zhang et al., 2020; Dai and Davison, 2010; Yi and Allan, 2010). Therefore, our first idea is that the anchor texts could be viewed as a more representative query compared to the word set $S^{2}$ directly sampled from the destination page. However, since the anchor texts are usually too short, the semantics they carry could be limited (Zhou et al., 2020a; Leveling and Jones, 2010; Wu et al., 2014). Fortunately, with the contextual information in the anchor’s corresponding sentence $S$ , we can build a more informative pseudo query $S^{1}$ with not only the anchor text, but also the contexts in the sentence. The anchor-based context-aware query $S^{1}$ should be more representative than the query $S^{2}$ comprised of terms sampled from the destination page. We train the model to predict the pair-wise preference of the two queries $S^{1}$ and $S^{2}$ .

Specifically, inspired by the strong ability of BERT (Devlin et al., 2019) to aggregate context and model sequences, we firstly use BERT to calculate the contextual word representations of the sentence $S$ . Specifically, for a sentence $S=(w_{1},w_{2},\cdots,w_{n})$ , we get its contextual representation $H=(h_{1},h_{2},\cdots,h_{n})$ , where $h_{t}$ denotes a $d$ -dimension hidden vector of the $t$ -th sentence token. Assume that for anchor text $a$ in sentence $S$ , the corresponding hidden vector $a$ is $h_{a}$ . We calculate the self-attention weight $\alpha^{t}$ of each word $w_{t}$ based on the anchor text $a$ as the average weights across $D$ heads:

where $\alpha_{i}^{t}$ is the attention weight on the $i$ -th head. Typically, a term may appear multiple times within the same sentence. Thus, we add up the attention weights of the same tokens over different positions in the sentence $S$ . Specifically, for each distinct term $w_{k}$ in the vocabulary $V=\{w_{k}\}_{k=1}^{K}$ , we calculate the final weight of distinct token $w_{k}$ as:

Finally, we normalize the distinct weights of all terms in the vocabulary to obtain a distribution $p(w_{k})$ across the terms as:

The term distribution can measure the contextual similarity between the word $w_{k}$ and anchor text $a$ . Thus, we use this distribution for sampling $l$ words from the sentence $s$ to form the query $S^{1}$ . Based on this distribution, the words relevant to the anchor text can be sampled with a higher probability. Thus, we can build a query $S^{1}$ based on the reliable signals of the anchor texts. Following (Ma et al., 2021a; Azzopardi et al., 2007; Ma et al., 2021b), the size $l$ of the pseudo query is calculated through a Poisson distribution as:

Finally, we collect the $l$ sampled words and the anchor text $a$ together, and construct a word set $S^{1}$ of length $l$ +1. Since $S^{1}$ is formed from the anchor text $a$ and its contextual words, there could be high relevance between this word set and destination page of $a$ .

For constructing the pair-wise loss, we also need to construct the negative query. Since proper hard negatives can help to train a better ranking model (Xiong et al., 2020; Karpukhin et al., 2020), we propose to sample representative words from the destination page $P$ to construct the hard negative pseudo query for the page, rather than sample words from unrelated pages randomly. For selecting representative words to build the negative query, we firstly use BERT to generate the contextual representations of $P=(p_{1},p_{2},...)$ as $(h^{P}_{1},h^{P}_{2}...)$ , where $h^{P}_{t}$ is the hidden state of the $t$ -th term in $P$ . Then, we also use the self-attention weights to measure the sampling probability of terms in page $P$ . Unlike the phrase for constructing $S^{1}$ , we calculate the self-attention weights of each terms based on the special token [CLS] as:

For the term in anchor text $a$ , we set their weight to 0. Thus, the term in anchor text will not be selected into $S^{2}$ , and the relevance signal of anchor text in $S^{1}$ will indeed enhance the Transformer model. We then perform sum operation for repetitive words following Equation (2), normalize the term distribution following Equation (3), and generate the word set $S^{2}$ from passage $P$ . The word set $S^{2}$ generated from passage $P$ will be used as the negative query.

Finally, we formulate the objective of the Representative Query Prediction task by a typical pairwise loss, i.e., hinge loss for the pre-training as:

where $p(S|P)$ is the matching score between the word set $S$ and the page $P$ . We concatenate the word set $S$ and $P$ as a single input sequence and feed into the the Transformer with delimiting tokens [SEP]. Then, we calculate the matching score by applying a MLP function over the classification token’s representation as:

2.2. Query Disambiguation Modeling (QDM)

In real-world applications, the queries issued by users are often short and ambiguous (Silverstein et al., 1999; Zhou et al., 2020b, 2021; Ma et al., 2020), such as the query “Apple” (Apple fruit or Apple company?). Thus, building an accurate encoding of the input query is difficult, which further leads to the poor quality of these ambiguous queries. Fortunately, with hyperlinks and anchor texts, we can endow the language representation model with the ability to disambiguate queries in the pre-training stage. We observe that the same anchor texts could link to one or more different pages. Under these circumstances, the anchor could be viewed as an ambiguous query, while the context around the anchor text could help to disambiguate the query. We train the model to predict the true destination page with the semantic information brought by the query context, thus learn disambiguation ability while pre-training.

where $p(S|P)$ follows the [CLS] score calculation in Equation (7). As illustrated above, the anchor-based contextual word set $S^{1}$ sampled from the sentence $S$ can provide additional semantic information for the pre-trained model. Thus, even the anchor text has also pointed to the negative sample, the model can be trained to learn the fine-grained relevance based on the context information around the anchor text. By leveraging the context, the model will learn the ability to query disambiguation.

2.3. Representative Document Prediction (RDP)

Although most queries presented to search engines vary between one to three terms in length, a gradual increase in the average query length has been observed in recent studies (Kumaran and Carvalho, 2009; Datta and Varma, 2011; Balasubramanian et al., 2010). Even though these queries could convey more sophisticated information needs of users, they also carry more noises to the search engine. A common strategy to deal with the long queries is to let the model distinguish the important terms in the queries, then focus more on these terms to improve retrieval effectiveness (Bendersky and Croft, 2008; Bendersky et al., 2010; Lease et al., 2009). At the pre-training stage of ad-hoc retrieval, if the model can be trained with more samples with long queries, it will get more robust when fine-tuning for long queries. Besides, the language model should be trained to predict the most representative document for the long query, since the long query could focus on different views. Fortunately, the hyperlinks can help to build pre-training samples containing long queries. We observe that there could be more than one anchor text appearing in one sentence. If we treat the sentence as the query, the destination pages could be the relevant documents for the sentence. Besides, if the anchor text is more important in the sentence, its destination page would be more representative for the sentence. Inspired by this, we propose to pre-train the model to predict the relevant document for the sentence containing more than one hyperlinks.

Specifically, for sentence $S=(w_{1},w_{2},\cdots,w_{n})$ and its anchor texts set $A=((a_{1},P_{1}),(a_{2},P_{2}),\cdots,(a_{m},P_{m}))$ , we sample two anchors based on the anchor importance of this sentence. We will treat the sentence $S$ as a long query, and the two destination pages as the documents. In this way, the page is deemed as a more representative document if its anchor text is of higher importance. To measure the importance of the anchor text, we use the Transformer encoder to build the context-aware representations of the terms, and calculate the hidden vectors $H=(h_{1},h_{2},\cdots,h_{n})$ . Assume that for anchor text $a$ , its hidden vector is $h_{a}$ . We calculate the self-attention weight of anchor $a$ based on the classification token [CLS] to measure its importance as:

where we average the attention weights across $D$ heads. The token [CLS] is an aggregate of the entire sequence representation, and it can represent the comprehensive understanding of the input sequence over all tokens. Thus, the attention weight $\gamma^{a}$ could measure the contribution of the anchor $a$ to the entire sentence. Then, we merge the repeat anchor texts in one sentence following Equation (2), normalize the weights to a probability distribution $p(a)$ over all anchor texts following Equation (3) as:

According to the importance likelihood $p(a)$ of anchors, we sample two anchor texts $(a_{1},P_{1})$ and $(a_{2},P_{2})$ from the sentence $S$ . Suppose that $a_{1}$ has a higher importance likelihood than $a_{2}$ according to Equation (11). We treat the sentence $S$ as the long query, $P_{1}$ as the more representative page and $P_{2}$ as the less representative page. We minimize the pair-wise loss $\mathcal{L}_{RDP}$ by:

where $p(S|P)$ follows the similar calculation in Equation (7).

2.4. Anchor Co-occurrence Modeling (ACM)

Language models try to learn the term semantics by modeling the term co-occurrence relation, including the term co-occurrence in one window (Pennington et al., 2014; Mikolov et al., 2013) and in one sequence (Devlin et al., 2019; Peters et al., 2018). As special terms, the anchor texts also share the co-occurrence relations. Besides, since the destination page can help to provide additional information to understand the anchor texts, we can learn more accurate semantics based on the co-occurrence relation by leveraging these destination pages. Therefore, we propose the Anchor Co-occurrence Modeling (ACM) task to model the similarity between the semantics of the anchors in one sentence. By pre-training with ACM, the model could obtain similar representations for the co-occurrenced anchor texts in one sentence, which further improves its ability to model semantics.

Suppose that for a sentence $S=(w_{1},w_{2},\cdots,w_{n})$ and its anchor texts set $A=((a_{1},P_{1}),(a_{2},P_{2}),\cdots,(a_{m},P_{m}))$ , the anchors in $A$ all share the co-occurrence characteristics with each other. We sample a pair of anchors randomly as $(a_{1},P_{1})$ and $(a_{2},P_{2})$ . We then sample some important words from the page $P_{1}$ to form a word set $S^{1}$ . Then, we let the model learn the semantic matching between $S^{1}$ and the passage $P^{2}$ , thus incorporating the anchor co-occurrence into the pre-trained model. Specifically, for the destination page $P_{1}$ of anchor $a_{1}$ , we use a Transformer encoder to build contextual representations and use the attention weight of [CLS] to measure the term importance:

2.5. Final Training Objective

Besides the pair-wise loss to measure the relevance between pseudo queries and documents, the pre-trained model also needs to build good contextual representations for them. Following (Ma et al., 2021a; Devlin et al., 2019), we also adopt the Masked Language Modeling (MLM) as one of our objectives. MLM is a fill-in-the-blank task, which firstly masks out some tokens from the input, then trains the model to predict the masked tokens by the rest tokens. Specifically, the MLM loss is defined as:

where $X$ denote the input sentence, and $m(X)$ and $x_{\backslash m(X)}$ denotes the masked tokens and the rest tokens from $X$ , respectively.

Finally, we pre-train the Transformer model $\mathcal{M}$ towards the proposed four objectives jointly with the MLM objective as:

All parameters are optimized by the loss $\mathcal{L}$ , and the whole model is trained in an end-to-end manner.

3. Fine-tuning for Document Ranking

In the previous pre-training stage, we pre-train the Transformer model $\mathcal{M}$ to learn the IR matching from the raw corpus based on the hyperlinks and anchor texts. We now incorporate $\mathcal{M}$ into the downstream document ranking task to evaluate the effectiveness of our proposed pre-trained method.

Previous studies have explored utilizing Transformer to measure the sequence pair relevance for ad-hoc document ranking (Nogueira et al., 2019; Nogueira and Cho, 2019; Qiao et al., 2019). For the query $q$ and a candidate document $d$ , we aim to calculate a ranking score $s(q,d)$ to measure the relevance between them based on the pre-trained Transformer. Therefore, in this stage, we firstly use the same Transformer architecture as the pre-trained model $\mathcal{M}$ , and use the parameters and embeddings of $\mathcal{M}$ to initialize the Transformer model. Then, we add special tokens and concatenate the query and the document as $Y=([{\rm CLS}];q;[{\rm SEP}];d;[{\rm SEP}])$ , where $[;]$ is the concatenation operation. A $[{\rm SEP}]$ token is added at the tail of query and document, while a $[{\rm CLS}]$ token is added at the sequence head for summary. Finally, We feed the concatenated sequence into Transformer, and use the $[{\rm CLS}]$ representation $z^{\rm[CLS]}$ to calculate the final ranking score as:

To train the model, we use the cross-entropy loss for optimization:

where $N$ is the number of samples in the training set.

Experiments

We use English Wikipedia (2021/01/01)https://dumps.wikimedia.org/enwiki/ as the pre-training corpus, since they are publicly available and have a large-scale collection of documents with hyperlinks for supporting pre-training. Following (Sun et al., 2020; Ma et al., 2021a), we use the public WikiExtractorhttps://github.com/attardi/wikiextractor to process the download Wikipedia dump while preserving the hyperlinks. After removing the articles whose length is less than 100 words for data cleaning, it comprises 15,492,885 articles. The data for our proposed four tasks are generated from these articles, and the statistics are reported in Table 1. We pre-train the model on one combined set of query-document pairs, where each pair is uniformly sampled from the four pre-training tasks.

1.2. Fine-tuning Datasets

To prove the effectiveness of the proposed pre-training methods, we conduct fine-tuning experiments on two representative ad-hoc retrieval datasets.

MS MARCO Document Ranking (MS MARCO)https://github.com/microsoft/MSMARCO-Document-Ranking (Nguyen et al., 2016): It is a large-scale benchmark dataset for document retrieval task. It consists of 3.2 million documents with 367 thousand training queries, 5 thousand development queries, and 5 thousand test queries. The relevance is measured in 0/1.

TREC 2019 Deep Learning Track (TREC DL)https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html (Craswell et al., 2020): It replaces the test queries in MS MARCO with a novel set with more comprehensive notations. Its test set consists of 43 queries, and the relevance is scored in 0/1/2/3.

1.3. Evaluation Metrics

Following the official instructions, we use MRR@100 and nDCG@10 to measure and evaluate the top-ranking performance. Besides, we also calculate MRR@10 and nDCG@100 for MS MARCO and TREC DL, respectively.

2. Baselines

We evaluate the performance of our approach by comparing it with three groups of highly related and strong baseline methods:

(1) Traditional IR models. QL (Zhai and Lafferty, 2017) is one of the best performing models which measure the query likelihood of query with Dirichlet prior smoothing. BM25 (Robertson and Walker, 1994) is another famous and effective retrieval method based on the probability retrieval model.

(2) Neural IR models. DRMM (Guo et al., 2016) is a deep relevance matching model which performs histogram pooling on the transition matrix and uses the binned soft-TF as the input to a ranking neural network. DUET (Mitra et al., 2017) propose to use two separate networks to match queries and documents with local and learned distributed representations, respectively. The two networks are jointly trained as part of a single neural network. KNRM (Xiong et al., 2017) is a neural ranking model which extracts the features of interaction between query and document terms. The kernel-pooling is used to provide soft match signals for ranking. Conv-KNRM (Xiong et al., 2017) is an upgrade of the KNRM model. It adds a convolutional layer for modeling n-gram soft matches and fuse the contextual information of surrounding words for matching.

(3) Pre-trained Models. BERT (Devlin et al., 2019) is the multi-layer bi-directional Transformer pre-trained with Masked Language Modeling and Next Sentence Prediction tasks. $\bm{{\rm Transformer}_{\rm ICT}}$ (Chang et al., 2020) is the BERT model retrained with the Inverse Cloze Task (ICT) and MLM. It is specifically designed for passage retrieval in QA scenarios, which teaches the model to predict the removed sentence given a context text. $\bm{{\rm Transformer}_{\rm WLP}}$ (Chang et al., 2020) is the BERT model retrained with the Wiki Link Prediction (WLP) and MLM. It is designed for capturing inter-page semantic relations. PROP (Ma et al., 2021a) is the state-of-the-art pre-trained model tailored for ad-hoc retrieval. It uses the Representative Words Prediction task for learning the matching between the sampled word sets. We experiment with both of the released models pre-trained on Wikipedia and MS MARCO corpus, i.e., $\rm{PROP}_{Wiki}$ and $\rm{PROP}_{MARCO}$ , respectively.

3. Implementation Details

For our methods HARP, we use the same Transformer encoder architecture as $\rm{BERT}_{base}$ in BERT (Devlin et al., 2019). The hidden size is 768, and the number of self-attention heads is 12. For a fair comparison, all of the pre-trained baseline models use the same architecture as our model. We use the HuggingFace’s Transformers for the model implementation (Wolf et al., 2020).

3.2. Pre-training Settings

For the construction of the pseudo queries, we set the expectation of interval $\lambda$ as 3, and remove the stopwords using the INQUERY stopwords list following (Ma et al., 2021a). We use the first section to denote the destination page because it is usually the description or summary of a long document (Chang et al., 2020; Nogueira et al., 2019; Dai and Callan, 2019). For the MLM objective, we follow the settings in BERT, where we randomly select 15% words for prediction, and the selected tokens are (1) the [MASK] token 80% of the time, (2) a random token 10% of the time, and (3) the unchanged token 10% of the time. We use the Adam optimizer with a learning rate of 1e-4 for 10 epochs, where the batch size is set as 128. For the large cost of training from scratch, we use $\rm{BERT}_{base}$ to initialize our method and baseline models.

3.3. Fine-tuning Settings

In the fine-tuning stage, the learned parameters in the pre-training stage are used to initialize the embedding and self-attention layers of our model. Following the previous works, we only test the performance of our model on the re-ranking stage (Ma et al., 2021a; Nogueira and Cho, 2019). To test the performance of our models with different quality of the candidate document set, we re-rank the document from the two candidate sets, i.e., ANCE Top100 and Official Top100. ANCE Top100 is retrieved based on the ANCE model proposed by Xiong et al. (2020), and Official Top100 is released by the official MS MARCO and TREC teams. While fine-tuning, we concatenate the title, URL and body of one document as the document content. The batch size is set as 128, and the maximum length of the input sequence is 512. We fine-tune for 2 epochs, with a 1e-5 learning rate and a warmup portion 0.1. Our code is available onlinehttps://github.com/zhengyima/anchors.

4. Experimental Results

Since the MS MARCO leaderboard limits the frequency of submissions, we evaluate our method and baseline methods on MS MARCO’s development set. For TREC DL, we evaluate the test set of 43 queries. The overall performance on the two datasets is reported in Table 2. We can observe that:

(1) Among all models, HARP achieves the best results in terms of all evaluation metrics. HARP improves performance with a large margin over two strongest baselines $\rm{PROP}_{Wiki}$ and $\rm{PROP}_{MARCO}$ , which also design objectives tailored for IR. Concretely, HARP significantly outperforms $\rm{PROP}_{MARCO}$ by 6.4% in MRR@100 on MS MARCO ANCE Top100. On TREC DL ANCE Top100 in terms of nDCG@100, HARP outperforms $\rm{PROP}_{MARCO}$ by 1.1%. The reason for the improvement reduction on the TREC DL set is that it uses binary notations in the training set but a multi-label notation in the test set, which leads to a gap and difficulty increase. Besides, HARP outperforms the best baselines for both the ANCE Top100 set and the Official Top100 set. These results demonstrate that HARP can capture better matching under different quality of the candidate list, while not being limited by the lower candidate quality or confused by the harder negatives. All these results prove that introducing hyperlinks into pre-training can improve the ranking quality of the pre-trained language model.

(2) All pre-trained methods outperform methods without pre-training, indicating that pre-training and fine-tuning are helpful for improving the relevance measuring of models for downstream ad-hoc retrieval. Traditional IR models QL and BM25 are strong baselines on the two datasets, but loses the ability to model semantic relevance. Neural IR models use distributed representations to denote the query and document, then apply deep neural networks to measure the IR relevance. Thus, the neural method Conv-KNRM significantly outperforms the traditional methods. The pre-trained methods have dramatic improvements over other methods. This indicates that pre-training on a large corpus and then fine-tuning on downstream tasks is better than training a neural deep ranking model from scratch.

(3) Among all pre-trained methods, the ones designing objectives tailored for IR perform better. ${\rm Transformer}_{\rm ICT}$ show better performance than BERT, confirming that using a pre-trained task related to retrieval is helpful for downstream tasks. However, $\bm{{\rm Transformer}_{\rm WLP}}$ performs worse than BERT and $\bm{{\rm Transformer}_{\rm ICT}}$ . One possible reason is that the queries straightly generated from WLP are noisy since there could be many links in the passage that contribute little to the passage semantics. $\rm{PROP}_{Wiki}$ and $\rm{PROP}_{MARCO}$ are the state-of-the-art baselines, which design Representative Words Prediction task tailored for IR. Different from the existing objectives, we design four pre-training tasks based on hyperlinks and anchor texts, which bring more accurate and reliable supervised signals. Hence, HARP achieves significant improvements compared with the existing pre-trained methods.

Besides, to further prove the effectiveness of HARP, we also report some leaderboard results of MS MARCO on eval set in Table 3. We select some representative methods from the leaderboard as the baselines (Ma et al., 2021a; Gao et al., 2021; Boytsov and Kolter, 2021). Following other recent leaderboard submissions, we further incorporate model ensemble. Our ensemble entry uses an trained ensemble of using BERT, RoBERTa (Liu et al., 2019) and ELECTRA (Clark et al., 2020) to fine-tune the downstream task. The leaderboard results confirm the effectiveness of our proposed HARP model.

5. Further Analysis

We further analyze the influence of different pre-training tasks we proposed ( Section 4.5.1), and the performance under different scales of fine-tuning data (Section 4.5.2).

Our proposed pre-training approach HARP designs four pre-training objectives to leverage hyperlinks and anchor texts tailored for IR. We remove one of them once a time to analyze its contribution. Note that when none of the pre-training tasks are used, our model degenerates to using BERT for fine-tuning directly. Thus, we also provide the result of BERT for comparison. We report the MRR@100 and MRR@10 on MS MARCO dataset.

From the results in Table 4, we can observe that removing any pre-training task would lead to a performance decrease. It indicates that all the pre-training tasks are useful to improve the ranking performance. Specifically, removing RQP causes the most decline in all metrics, which confirms that the correlations and supervised signals brought by hyperlinks can improve the ranking ability of our model in the pre-training phase. The significant performance degradation caused by removing RDP shows that pre-training with long queries contributes to further enhancement of ranking relevance modeling. The influence of removing QDM and ACM is relatively smaller. It proves that considering ambiguous query and modeling the anchor co-occurrence are effective but limited, since the pre-training pairs of QDM are less than other tasks, and the queries sampled from the neighboring anchors in ACM are noisier than the anchors. Removing MLM shows the slightest performance decrease, which indicates that good representations obtained by MLM may not be sufficient for ad-hoc retrieval tasks. It is clearly seen that all model variants perform better than BERT, which is not pre-trained by the IR-oriented objectives.

5.2. Low-Resource and Large-scale Settings

Neural ranking models require a considerable amount of training data to learn the representations and matching features. Thus, they are likely to suffer from the low-resource settings in real-world applications, since collecting relevant labels for large-scale queries and documents is costly and time-consuming. This problem can be alleviated by our proposed method, because the pre-training tasks based on hyperlinks and anchor texts can better measure the matching features and resemble the downstream retrieval tasks. To prove that, we simulate the sparsity scenarios by using different scales of queries. For low-resource settings, we randomly pick 5/10/15/20 queries and fine-tune our model. Besides, we also pick 50k/100k/150k/200k queries to evaluate the performance on different large-scale queries. We report MRR@100 to evaluate the performance. We find:

(1) As shown in Figure 4(a), under few-shot settings, HARP can achieve better results compared to other models, showing the scalability for a small number of supervised data. This is consistent with our speculation as tailoring pre-training objectives for IR can provide a solid basis for fine-tuning, which alleviates the influence of data sparsity problem for ranking to some extent.

(2) As shown in Figure 4(b), under large-scale settings, HARP is consistently better than baselines in all cases. This further proves the effectiveness of our proposed methods to introduce hyperlinks and anchor texts for designing pre-training objectives for IR.

(3) When there are large-scale queries, HARP stably performs better when more queries can be used for training. This implies that HARP is able to make better use of fine-tuning data based on the better understandings of IR learned from the pre-training stage.

Conclusion

In this work, we propose a novel pre-training framework HARP tailored for ad-hoc retrieval. Different from existing pre-training objectives tailored for IR, we propose to leverage the supervised signals brought by hyperlinks and anchor texts. We devise four pre-training tasks based on hyperlinks, and capture the anchor-document correlations in different views. We pre-train the Transformer model to predict the pair-wise loss functions built by the four pre-training tasks, jointly with the MLM objective. To evaluate the performance of the pre-trained model, we fine-tune the model on the downstream document ranking tasks. Experimental results on two large-scale representative and open-accessed datasets confirm the effectiveness of our model on document ranking.