Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, Allan Hanbury

Introduction

The same principles that applied to traditional IR systems to achieve low query latency also apply to novel neural ranking models: We need to transfer as much computation and data transformation to the indexing phase as possible to require less resources at query time (Mackenzie et al., 2020; Manning et al., 2008). For the most effective BERT-based (Devlin et al., 2019) neural ranking models, which we refer to as BERT ${}_{\textbf{CAT}}$ , this transfer is simply not possible, as the concatenation of query and passage require all Transformer layers to be evaluated at query time to receive a ranking score (Nogueira and Cho, 2019).

To overcome this architecture restriction the neural-IR community proposed new architectures by deliberately choosing to trade-off effectiveness for higher efficiency. Among these low query latency approaches are: TK (Hofstätter et al., 2020b) with shallow Transformers and separate query and document contextualization; ColBERT (Khattab and Zaharia, 2020) with late-interactions of BERT term representations; PreTT (MacAvaney et al., 2020a) with a combination of query-independent and query-dependent Transformer layers; and a BERT-CLS dot product scoring model which we refer to as BERT ${}_{\textbf{DOT}}$ , also known in the literature as Tower-BERT (Chang et al., 2020), BERT-Siamese (Xiong et al., 2020), or TwinBERT (Lu et al., 2020).Yes, we see the irony: https://xkcd.com/927/ Each approach has unique characteristics that make them suitable for production-level query latency which we discuss in Section 2.

An increasingly common way to improve smaller or more efficient models is to train them, as students, to imitate the behavior of larger or ensemble teacher models via Knowledge Distillation (KD) (Hinton et al., 2015). This is typically applied to the same architecture with fewer layers and dimensions (Jiao et al., 2019; Sanh et al., 2019) via the output or layer-wise activations (Sun et al., 2019). KD has been applied in the ranking task for the same architecture with fewer layers (Li et al., 2020; Gao et al., 2020b; Chen et al., 2020) and in constrained sub-tasks, such as keyword-list matching (Lu et al., 2020).

In this work we propose a model-agnostic training procedure using cross-architecture knowledge distillation from $\operatorname*{BERT_{\text{CAT}}}$ with the goal to improve the effectiveness of efficient passage ranking models without compromising their query latency benefits.

A unique challenge for knowledge distillation in the ranking task is the possible range of scores, i.e. a ranking model outputs a single unbounded decimal value and the final result solely depends on the relative ordering of the scores for the candidate documents per query. We make the crucial observation, depicted in Figure 1, that different architectures during their training gravitate towards unique range patterns in their output scores. The $\operatorname*{BERT_{\text{CAT}}}$ model exhibits positive relevant-document scores, whereas on average the non-relevant documents are below zero. The $\operatorname*{TK}$ model solely produces negative averages, and the $\operatorname*{BERT_{\text{DOT}}}$ and $\operatorname*{ColBERT}$ models, due to their dot product scoring, show high output scores. This leads us to our main research question:

How can we apply knowledge distillation in retrieval across architecture types?

To optimally support the training of cross-architecture knowledge distillation, we allow our models to converge to a free scoring range, as long as the margin is alike with the teacher. We make use of the common triple (q, relevant doc, non-relevant doc) training regime, by distilling knowledge via the margin of the two scoring pairs. We train the students to learn the same margin as their teachers, which leaves the models to find the most comfortable or natural range for their architecture. We optimize the student margin to the teacher margin with a Mean Squared Error loss (Margin-MSE). We confirm our strategy with an ablation study of different knowledge distillation losses and show the Margin-MSE loss to be the most effective.

Thanks to the rapid advancements and openness of the Natural Language Processing community, we have a number of pre-trained BERT-style language models to choose from to create different variants of the $\operatorname*{BERT_{\text{CAT}}}$ architecture to study, allowing us to answer:

How effective is the distillation with a single teacher model in comparison to an ensemble of teachers?

We train three different $\operatorname*{BERT_{\text{CAT}}}$ versions as teacher models with different initializations: BERT-Base (Devlin et al., 2019), BERT-Large with whole word masking (Devlin et al., 2019), and ALBERT-large (Lan et al., 2019). To understand the behavior that the different language models bring to the $\operatorname*{BERT_{\text{CAT}}}$ architecture, we compare their training score margin distributions and find that the models offer variability suited for an ensemble.

We created the teacher ensemble by averaging each of the three scores per query-document pair. We conduct the knowledge distillation with a single teacher and a teacher ensemble. The knowledge distillation has a general positive effect on all retrieval effectiveness metrics of our student models. In most cases the teacher ensemble further improves the student models’ effectiveness in the re-ranking scenario above the already improved single teacher training.

The dual-encoder $\operatorname*{BERT_{\text{DOT}}}$ model can be used for full collection indexing and retrieval with a nearest neighbor vector search approach, so we study:

How effective is our distillation for dense nearest neighbor retrieval?

We observe similar trends in terms of effectiveness per teacher strategy, with increased effectiveness of $\operatorname*{BERT_{\text{DOT}}}$ models for a single teacher and again a higher increase for the ensemble of teachers. Even though we do not add dense retrieval specific training methods, such as index-based passage sampling (Xiong et al., 2020) or in-batch negatives (Lin et al., 2020) we observe very competitive results compared to those much more costly training approaches.

To put the improved models in the perspective of the efficiency-effectiveness trade-off, we investigated the following question:

By how much does effective knowledge distillation shift the balance in the efficiency-effectiveness trade-off?

We show how the knowledge distilled efficient architectures outperform the $\operatorname*{BERT_{\text{CAT}}}$ baselines on several metrics. There is no longer a compromise in utilizing $\operatorname*{PreTT}$ or $\operatorname*{ColBERT}$ and the effectiveness gap, i.e. the difference between the most effective and the other models, of $\operatorname*{BERT_{\text{DOT}}}$ and $\operatorname*{TK}$ is significantly smaller.

The contributions of this work are as follows:

We propose a cross-architecture knowledge distillation procedure with a Margin-MSE loss for a range of neural retrieval architectures

We conduct a comprehensive study of the effects of cross-architecture knowledge distillation in the ranking scenario

We publish our source code as well as ready-to-use teacher training files for the community at: https://github.com/sebastian-hofstaetter/neural-ranking-kd

Retrieval Models

We study the effects of knowledge distillation on a wide range of recently introduced Transformer- & BERT-based ranking models. We describe their architectures in detail below and summarize them in Table 1.

The common way of utilizing the BERT pre-trained Transformer model in a re-ranking scenario (Nogueira and Cho, 2019; MacAvaney et al., 2019a; Yilmaz et al., 2019) is by concatenating query and passage input sequences. We refer to this base architecture as BERT ${}_{\text{CAT}}$ . In the BERT ${}_{\text{CAT}}$ ranking model, the query ${q}_{1:m}$ and passage ${p}_{1:n}$ sequences are concatenated with special tokens (using the ; operator) and the CLS token representation computed by BERT (selected with 1) is scored with single linear layer $W_{s}$ :

We utilize BERT ${}_{\text{CAT}}$ as our teacher architecture, as it represents the current state-of-the art in terms of effectiveness, however it requires substantial compute at query time and increases the query latency by seconds (Hofstätter and Hanbury, 2019; Xiong et al., 2020). Simply using smaller BERT variants does not change the design flaw of having to compute every representation at query time.

2. BERTDOTDOT{}_{\textbf{DOT}} Dot Product Scoring

In contrast to BERT ${}_{\text{CAT}}$ , which requires a full online computation, the BERT ${}_{\text{DOT}}$ model only matches a single CLS vector of the query with a single CLS vector of a passage (Xiong et al., 2020; Luan et al., 2020; Lu et al., 2020). The BERT ${}_{\text{DOT}}$ model uses two independent $\operatorname*{BERT}$ computations as follows:

which allows us to pre-compute every contextualized passage representation $\hat{p}$ . After this, the model computes the final scores as the dot product $\cdot$ of $\hat{q}$ and $\hat{p}$ :

BERT ${}_{\text{DOT}}$ , with its bottleneck of comparing single vectors, compresses information much more strongly than BERT ${}_{\text{CAT}}$ , which brings large query time improvements at the cost of lower effectiveness, as can be seen in Table 1.

3. ColBERT

The $\operatorname*{ColBERT}$ model (Khattab and Zaharia, 2020) is similar in nature to $\operatorname*{BERT_{\text{DOT}}}$ , by delaying the interactions between query and document to after the BERT computation. $\operatorname*{ColBERT}$ uses every query and document representation:

where the $\operatorname{rep}(MASK)$ method repeats the MASK token a number of times, set by a hyperparameter. Khattab and Zaharia (2020) introduced this query augmentation method to increase the computational capacity of the BERT model for short queries. We independently confirmed that adding these MASK tokens improves the effectiveness of $\operatorname*{ColBERT}$ . The interactions in the $\operatorname*{ColBERT}$ model are aggregated with a max-pooling per query term and sum of query-term scores as follows:

The aggregation only requires $n*m$ dot product computations, making it roughly as efficient as $\operatorname*{BERT_{\text{DOT}}}$ , however the storage cost of pre-computing passage representations is much higher and depends on the total number of terms in the collection. Khattab and Zaharia (2020) proposed to compress the dimensions of the representation vectors by reducing the output features of $W_{s}$ . We omitted this compression, as storage space is not the focus of our study and to better compare results across different models.

4. PreTT

The $\operatorname*{PreTT}$ architecture (MacAvaney et al., 2020a) is conceptually between $\operatorname*{BERT_{\text{CAT}}}$ and $\operatorname*{ColBERT}$ , as it allows to compute $b$ BERT-layers separately for query and passage:

Then $\operatorname*{PreTT}$ concatenates the sequences with a SEP separator token and computes the remaining layers to compute a total of $\hat{b}$ BERT-layers. Finally, the CLS token output is pooled with single linear layer $W_{s}$ :

Concurrently to $\operatorname*{PreTT}$ , DC-BERT (Zhang et al., 2020) and EARL (Gao et al., 2020a) have been proposed with very similar approaches to split Transformer layers. We selected $\operatorname*{PreTT}$ simply as a representative of this group of models. Similar to $\operatorname*{ColBERT}$ , we omitted the optional compression of representations for better comparability.

5. Transformer-Kernel

The Transformer-Kernel (TK) model (Hofstätter et al., 2020b) is not based on BERT pre-training, but rather uses shallow Transformers. TK independently contextualizes query ${q}_{1:m}$ and passage ${p}_{1:n}$ based on pre-trained word embeddings, where the intensity of the contextualization (Transformers as $\operatorname*{TF}$ ) is set by a gate $\alpha$ :

The sequences $\hat{q}_{1:m}$ and $\hat{p}_{1:n}$ interact in a match-matrix with a cosine similarity per term pair and each similarity is activated by a set of Gaussian kernels (Xiong et al., 2017):

Kernel-pooling is a soft-histogram, which counts the number of occurrences of similarity ranges. Each kernel $k$ focuses on a fixed range with center $\mu_{k}$ and width of $\sigma$ .

These kernel activations are then summed, first by the passage term dimension $j$ , log-activated, and then the query dimension is summed, resulting in a single score per kernel. The final score is calculated by a weighted sum using $W_{s}$ :

6. Comparison

In Table 1 we summarize our evaluated models. We compare the efficiency and effectiveness trade-off in the leftmost section, followed by a general overview of the model capabilities in the right most section. We measure the query latency for 1 query and 1000 documents with cached document representations where applicable and report the peak GPU memory requirement for the inference of the validation set. We summarize our observations of the different model characteristics:

The query latency of $\operatorname*{BERT_{\text{CAT}}}$ is prohibitive for efficient production use (Except for head queries that can be fully pre-computed).

$\operatorname*{BERT_{\text{DOT}}}$ is the most efficient BERT-based model with regards to storage and query latency, at the cost of lower effectiveness compared to ColBERT and PreTT.

PreTT highly depends on the choice of the concatenation-layer hyperparameter, which we set to 3 to be between $\operatorname*{BERT_{\text{CAT}}}$ and $\operatorname*{ColBERT}$ .

$\operatorname*{ColBERT}$ is especially suited for small collections, as it requires a large passage cache.

$\operatorname*{TK}$ is less effective overall, however it is much cheaper to run than the other models.

The most suitable neural ranking model ultimately depends on the exact scenario. To allow people to make the choice, we evaluated all presented models. we use $\operatorname*{BERT_{\text{CAT}}}$ as our teacher architecture and the other presented architectures as students.

Cross-Architecture Knowledge Distillation

The established approach to training deep neural ranking models is mainly based on large-scale annotated data. Here, the MSMARCO collection is becoming the de-facto standard. The MSMARCO collection only contains binary annotations for fewer than two positive examples per query, and no explicit annotations for non-relevant passages. The approach proposed by Bajaj et al. (2016) is to utilize randomly selected passages retrieved from the top 1000 candidates of a traditional retrieval system as negative examples. This approach works reasonably well, but accidentally picking relevant passages is possible.

Neural retrieval models are commonly trained on triples of binary relevance assignments of one relevant and one non-relevant passage. However, they are used in a setting that requires a much more nuanced view of relevance when they re-rank a thousand possibly relevant passages. The $\operatorname*{BERT_{\text{CAT}}}$ architecture shows the strongest generalization capabilities, which other architectures do not posses.

Following our observation of distinct scoring ranges of different model architectures in Figure 1, we propose to utilize a knowledge distillation loss by only optimizing the margin between the scores of the relevant and the non-relevant sample passage per query. We call our proposed approach Margin Mean Squared Error (Margin-MSE). We train ranking models on batches containing triples of queries $Q$ , relevant passages $P^{+}$ , and non-relevant passages $P^{-}$ . We utilize the output margin of the teacher model $M_{t}$ as label to optimize the weights of the student model $M_{s}$ :

MSE is the Mean Squared Error loss function, calculating the mean of the squared differences between the scores $S$ and the targets $T$ over the batch size:

The Margin-MSE loss discards the original binary relevance information, in contrast to other knowledge distillation approaches (Li et al., 2020), as the margin of the teacher can potentially be negative, which would indicate a reverse ordering from the original training data. We observe that the teacher models have a very high pairwise ranking accuracy during training of over 98%, therefore we view it as redundant to add the binary information in the ranking loss.We do not analyze this statistic further in this paper, as we did not see a correlation or interesting difference between models on this pairwise training accuracy metric.

In Figure 2 we show the staged process of our knowledge distillation. For simplicity and ease of re-use, we utilize the same training triples for every step. The process begins with training a $\operatorname*{BERT_{\text{CAT}}}$ teacher model on the collection labels with a RankNet loss (Burges, 2010). After the teacher training is finished, we use the teacher model again to infer all scores for the training data, without updating its weights. This allows us to store the teacher scores once, for an efficient experimentation and sharing workflow. Finally, we train our student model of a different architecture, by using the teacher scores as labels with our proposed Margin-MSE loss.

Experiment design

For our neural re-ranking training and inference we use PyTorch (Paszke et al., 2017) and the HuggingFace Transformer library (Wolf et al., 2019). For the first stage indexing and retrieval we use Anserini (Yang et al., 2017).

We use the MSMARCO-Passage (Bajaj et al., 2016) collection with sparsely-judged MSMARCO-DEV query set of 49,000 queries as well as the densely-judged query set of 43 queries derived from TREC-DL’19 (Craswell et al., 2019). For TREC graded relevance labels we use a binarization point of 2 for MRR and MAP. MSMARCO is based on sampled Bing queries and contains 8.8 million passages with a proposed training set of 40 million triples sampled. We evaluate our teachers on the full training set, so to not limit future work in terms of the number of triples available. We cap the query length at $30$ tokens and the passage length at $200$ tokens.

2. Training Configuration

We use the Adam (Kingma and Ba, 2014) optimizer with a learning rate of $7*10^{-6}$ for all BERT layers, regardless of the number of layers trained. TK is the only model trained on a higher rate of $10^{-5}$ . We employ early stopping, based on the best nDCG@10 value of the validation set. We use a training batch size of 32.

3. Model Parameters

All student language models use a 6-layer DistilBERT (Sanh et al., 2019) as their initialization standpoint. We chose DistilBERT over BERT-Base, as it has been shown to provide a close lower bound on the results at half the runtime (Sanh et al., 2019; MacAvaney et al., 2020a). For our ColBERT implementation we repeat the query MASK augmentation 8 times, regardless of the amount of padding in a batch in contrast to Khattab and Zaharia (2020). For PreTT we decided to concatenate sequences after 3 layers of the 6 layer DistilBERT, as we want to evaluate it as a mid-choice between ColBERT and $\operatorname*{BERT_{\text{CAT}}}$ . For TK we use the standard 2 layer configuration with 300 dimensional embeddings. For the traditional BM25 we use the tuned parameters from the Anserini documentation.

Results

We now discuss our research questions, starting with the study of our proposed Margin-MSE loss function; followed by an analysis of different teacher model results and their impact on the knowledge distillation; and finally examining what the knowledge distillation improvement means for the efficiency-effectiveness trade-off.

We validate our approach presented in Section 3 and our research question RQ1 How can we apply knowledge distillation in retrieval across architecture types? by comparing Margin-MSE with different knowledge distillation losses using the same training data. We compare our approach with a pointwise MSE loss, defined as follows:

This is a standard approach already used by Vakili Tahami et al. (2020) and Li et al. (2020). Additionally, we utilize a weighted RankNet loss, where we weight the samples in a batch according to the teacher margin:

We show the results of our ablation study in Table 2 for three distinct ranking architectures that significantly differ from the $\operatorname*{BERT_{\text{CAT}}}$ teacher model. We use a single (BERT-BaseCAT) teacher model for this study. For each of the three architectures the Margin-MSE loss outperforms the pointwise MSE and weighted RankNet losses on all metrics. However, we also note that applying knowledge distillation in general improves each model’s result over the respective original baseline. Our aim in proposing to use the Margin-MSE loss was to create a simple yet effective solution that does not require changes to the model architectures or major adaptions to the training procedure.

2. Knowledge Distillation Results

Utilizing our proposed Margin-MSE loss in connection with our trained teacher models, we follow the procedure laid out in Section 3 to train our knowledge-distilled student models. Table 3 first shows our baselines, then in the second section the results of our teacher models, and in the third section our student architectures. Each student has a baseline result without teacher training (depicted by –) and a single teacher T1 as well as the teacher ensemble denoted with T2. With these results we can now answer:

How effective is the distillation with a single teacher model in comparison to an ensemble of teachers?

We selected BERT-Base ${}_{\text{CAT}}$ as our single teacher model, as it is a commonly used instance in neural ranking models. The ensemble of different larger $\operatorname*{BERT_{\text{CAT}}}$ models shows strong and consistent improvements on all MSMARCO DEV metrics and MAP@1000 of TREC-DL’19. When we compare our teacher model results with the best re-ranking entry (Yan et al., 2020) of TREC-DL’19, we see that our teachers, especially the ensemble outperform the TREC results to represent state-of-the-art results in terms of effectiveness.

Overall, we observe that either a single teacher or an ensemble of teachers improves the model results over their respective original baselines. The ensemble T2 improves over T1 for all models on the sparse MSMARCO-DEV labels with many queries. Only on the TREC-DL’19 query set does T2 fail to improve over T1 for TK and PreTT. The only outlier in our results is BERT-Base ${}_{\text{DOT}}$ trained on T1, where there is no improvement over the baseline, T2 however does show a substantial improvement. This leads us to the conclusion that utilizing an ensemble of teachers is overall preferred to a single teacher model.

Furthermore, when we compare the BERT type for the $\operatorname*{BERT_{\text{CAT}}}$ architecture, we see that DistilBERT ${}_{\text{CAT}}$ -T2 outperforms any single teacher model with twice and four times the layers on almost all metrics. For the $\operatorname*{BERT_{\text{DOT}}}$ architecture we also compared BERT-Base and DistilBERT, both as students, and here BERT-Base has a slight advantage trained on T2. However, its T1 results are inconsistent, where almost no improvement is observable, whereas DistilBERT ${}_{\text{DOT}}$ exhibits consistent gains first for T1 and then another step for T2.

Our T2 training improves both instances of the $\operatorname*{BERT_{\text{DOT}}}$ architecture in comparison to the ANCE (Xiong et al., 2020) trained $\operatorname*{BERT_{\text{DOT}}}$ model and evaluated in the re-ranking setting.

To also compare the $\operatorname*{BERT_{\text{DOT}}}$ model in the full collection vector retrieval setting we set out to answer:

How effective is our distillation for dense nearest neighbor retrieval?

The difference to previous results in Table 3 is that now we only use the score of a nearest neighbor search of all indexed passages, without re-ranking BM25. Because we no longer re-rank first-stage results, the pipeline overall becomes more efficient and less complex, however the chance of false positives becomes greater and less interpretable in a dense vector space retrieval. The $\operatorname*{ColBERT}$ architecture also includes the possibility to conduct a dense retrieval, however at the expense of increasing the storage requirements of 2GB plain text to a 2TB index, which stopped us from conducting extensive experiments with $\operatorname*{ColBERT}$ .

We show nearest neighbor retrieval results of our $\operatorname*{BERT_{\text{DOT}}}$ models (using both BERT-Base and DistilBERT encoders) and baselines for dense retrieval in Table 4. Training with a teacher ensemble is again more effective than training with a single teacher, which is still more effective than training the $\operatorname*{BERT_{\text{DOT}}}$ alone without teachers. Interestingly, DistilBERT outperforms BERT-Base across the board with half the Transformer layers. As we let the models train as long as they improved the early stopping set, it suggests, for the retrieval task we may not need more model capacity, which is a sure bet to improve results on the $\operatorname*{BERT_{\text{CAT}}}$ architecture.

Our dense retrieval results are competitive with related methods, even though they specifically train for the dense retrieval task. Our approach, while not specific to dense retrieval training is competitive with the more costly and complex approaches ANCE and TCT-ColBERT. On MSMARCO DEV MRR@10 we are at a slight disadvantage, however we outperform the models that also published TREC-DL’19 results. RocketQA, the current state-of-the-art dense retrieval result on MSMARCO DEV requires a batch size of 4,000 and enormous computational resources, which are hardly comparable to our technique that only requires a batch size of 32 and can be trained on a single GPU.

3. Closing the Efficiency-Effectiveness Gap

We round off our results with a thorough look at the effects of knowledge distillation on the relation between effectiveness and efficiency in the re-ranking scenario. We measure the median query latency under the conditions that we have our cached document representation in memory, contextualize a single query, and computed the respective model’s interaction pattern for 1 query and 1000 documents in a single batch on a TITAN RTX GPU with 24GB of memory. The large GPU memory allows us to also compute the same batch size for $\operatorname*{BERT_{\text{CAT}}}$ , which for inference requires 16GB of total reserved GPU memory in the BERT-Base case. We measure the latency of the neural model in PyTorch inference mode (without accounting for pre-processing or disk access times, as those are highly dependent on the use of optimized inference libraries) to answer:

By how much does effective knowledge distillation shift the balance in the efficiency-effectiveness trade-off?

In Figures 3 and 4, we plot the median query latency on the log-scaled x-axis versus the effectiveness on the y-axis. The teacher trained models are indicated with T1 and T2. The latency for different teachers does not change, as we do not change the architecture, only the weights of the models. The T1 teacher model $\operatorname*{BERT_{\text{CAT}}}$ is indicated with the red square. The TREC-DL’19 results in Figure 3 show how DistilBERT ${}_{\text{CAT}}$ , PreTT, and ColBERT not only close the gap to BERT-Base ${}_{\text{CAT}}$ , but improve on the single instance BERT-Base ${}_{\text{CAT}}$ results. The $\operatorname*{BERT_{\text{DOT}}}$ and TK models, while not reaching the effectiveness of the other models, are also improved over their baselines and are more efficient in terms of total runtime (TK) and index space ( $\operatorname*{BERT_{\text{DOT}}}$ ). The MSMARCO DEV results in Figure 4 differ from Figure 3 in DistilBERT ${}_{\text{CAT}}$ and PreTT outperforming BERT-Base ${}_{\text{CAT}}$ as well as the evaluated $\operatorname*{BERT_{\text{DOT}}}$ variants under-performing overall in comparison to TK and ColBERT.

Even though in this work we measure the inference time on a GPU, we believe that the most efficient models — namely TK, ColBERT, and $\operatorname*{BERT_{\text{DOT}}}$ — allow for production CPU inference, assuming the document collection has been pre-computed on GPUs. Furthermore, in a cascading search pipeline, one can hide most of the remaining computation complexity of the query contextualization during earlier stages.

Teacher analysis

Finally, we analyse the distribution of our teacher score margins, to validate the intuition of using a teacher ensemble and we look at per-query nDCG changes for two models between teacher-trained instances and the baseline.

To validate the use of an ensemble of teachers for RQ2, we analyze the output score margin distribution of our teacher models in Figure 5, to see if they bring diversity to the ensemble mix. This is the margin used in the Margin-MSE loss. We observe that the same $\operatorname*{BERT_{\text{CAT}}}$ architecture, differing only in the BERT language model used, shows three distinct score patterns. We view this as a good sign for the applicability of an ensemble of teachers, indicating that the different teachers have different viewpoints to offer. To ensemble our teacher models we computed a mean of their scores per example used for the knowledge distillation, to not introduce more complexity in the process.

An interesting quirk of our Margin-MSE definition is the possibility to reverse orderings if the margin between a pair is negative. In Figure 5 we can see the reversal of the ordering of pairs in the distribution for the ¡ 0 margin. It happens rarely and if a swap occurs the score difference is small. We investigated this issue by qualitatively analyzing a few dozen cases and found that the teacher models are most of the time correct in their determination to reverse or equalize the margin. Because it only affects a few percent of the training data we retained those samples as well to not change the training data.

2. Per-Query Teacher Impact Analysis

In addition to the aggregated results presented in Table 3, we now take a closer look at the impact of T1 and T2 teachers in a per-query analysis for ColBERT and DistilBERT ${}_{\text{DOT}}$ in Figure 6. We plot the differences in nDCG@10 per query on the TREC-DL’19 set between the original training results and the T1 and T2 training respectively. A positive change means the T1/T2 trained model does better on this particular query. We sorted the queries by the T2 changes for both plots, and plotted the corresponding query results for T1 at the same position. Overall, the T1 & T2 training for both models roughly improves 60 % of queries and decreases results on 33 % with the rest of queries unchanged. Interestingly, the average change in each direction between the T1 and T2 training shows that T2 results become more extreme, as they improve more on average (DistilBERT ${}_{\text{DOT}}$ from T1 $+10$ % to T2 $+13$ %; ColBERT from T1 $+6$ % to T2 $+9$ %), but also decrease stronger on average (DistilBERT ${}_{\text{DOT}}$ from T1 $-6.8$ % to T2 $-7.2$ %; ColBERT from T1 $-4.3$ % to T2 $-7.8$ %). As we saw in Table 3 the aggregated results, still put T2 in front of T1 overall. However, we caution, that these stronger decreases show a small limitation of our knowledge distillation approach.

Related Work

Recent studies have investigated different approaches for improving the efficiency of relevance models. Ji et al. (2019) demonstrate that approximations of interaction-based neural ranking algorithms using kernels with locality-sensitive hashing accelerate the query-document interaction computation. In order to reduce the query processing latency, Mackenzie et al. (2020) propose a static index pruning method when augmenting the inverted index with precomputed re-weighted terms (Dai and Callan, 2020). Several approaches aim to improve the efficiency of transformer models with windowed self-attention (Hofstätter et al., 2020a), using locality-sensitive hashing (Kitaev et al., 2020), replacing the self-attention with a local windowed and global attention (Beltagy et al., 2020) or by combining an efficient transformer-kernel model with a conformer layer (Mitra et al., 2020).

Adapted training procedures

In order to tackle the challenge of a small annotated training set, Dehghani et al. (2017) propose weak supervision controlled by full supervision to train a confident model. Subsequently they demonstrate the success of a semi-supervised student-teacher approach for an information retrieval task using weakly labelled data where the teacher has access to the high quality labels (Dehghani et al., 2018). Examining different weak supervision sources, MacAvaney et al. (2019b) show the beneficial use of headline - content pairs as pseudo-relevance judgements for weak supervision. Considering the success of weak supervision strategies for IR, Khattab and Zaharia (2020) train ColBERT (Khattab and Zaharia, 2020) for OpenQA with guided supervision by iteratively using ColBERT to extract positive and negative samples as training data. Similarly Xiong et al. (2020) construct negative samples from the approximate nearest neighbours to the positive sample during training and apply this adapted training procedure for dense retrieval training. Cohen et al. (2019) demonstrate that the sampling policy for negative samples plays an important role in the stability of the training and the overall performance with respect to IR metrics. MacAvaney et al. (2020b) adapt the training procedure for answer ranking by reordering the training samples and shifting samples to the beginning which are estimated to be easy.

Knowledge distillation

Large pretrained language models advanced the state-of-the-art in natural language processing and information retrieval, but the performance gains come with high computational cost. There are numerous advances in distilling these models to smaller models aiming for little effectiveness loss.

Creating smaller variants of the general-purpose BERT mode, Jiao et al. (2019) distill TinyBert and Sanh et al. (2019) create DistilBERT and demonstrate how to distill BERT while maintaining the models’ accuracy for a variety of natural language understanding tasks.

In the IR setting, Tang and Wang (2018) distill sequential recommendation models for recommender systems with one teacher model. Vakili Tahami et al. (2020) study the impact of knowledge distillation on BERT-based retrieval chatbots. Gao et al. (2020b) and Chen et al. (2020) distilled different sizes of the same $\operatorname*{BERT_{\text{CAT}}}$ architecture and the TinyBert library (Jiao et al., 2019). As part of the PARADE document ranking model Li et al. (2020) showed a similar $\operatorname*{BERT_{\text{CAT}}}$ to $\operatorname*{BERT_{\text{CAT}}}$ same-architecture knowledge distillation for different layer and dimension hyperparameters. A shortcoming of these distillation approaches is that they are only applicable to the same architecture which restricts the retrieval model to full online inference of the $\operatorname*{BERT_{\text{CAT}}}$ model. Lu et al. (2020) utilized knowledge distillation from $\operatorname*{BERT_{\text{CAT}}}$ to $\operatorname*{BERT_{\text{DOT}}}$ in the setting of keyword matching to select ads for sponsored search. They first showed, that a knowledge transfer from $\operatorname*{BERT_{\text{CAT}}}$ to $\operatorname*{BERT_{\text{DOT}}}$ is possible, albeit in a more restricted setting of keyword list matching in comparison to our fulltext ranking setting.

Conclusion

We proposed to use cross-architecture knowledge distillation to improve the effectiveness of query latency efficient neural passage ranking models taught by the state-of-the-art full interaction $\operatorname*{BERT_{\text{CAT}}}$ model. Following our observation that different architectures converge to different scoring ranges, we proposed to optimize not the raw scores, but rather the margin between a pair of relevant and non-relevant passages with a Margin-MSE loss. We showed that this method outperforms a simple pointwise MSE loss. Furthermore, we compared the performance of a single teacher model with an ensemble of large $\operatorname*{BERT_{\text{CAT}}}$ models and find that in most cases using an ensemble of teachers is beneficial in the passage retrieval task. Trained with a teacher ensemble, single instances of efficient models even outperform their single instance teacher models with much more parameters and interaction capacity. We observed a drastic shift in the effectiveness-efficiency trade-off of our evaluated models towards more effectiveness for efficient models. In addition to re-ranking models, we show our general distillation method to produce competitive effectiveness compared to specialized training techniques for the dual-encoder $\operatorname*{BERT_{\text{DOT}}}$ model in the nearest neighbor retrieval setting. We published our teacher training files, so the community can use them without significant changes to their setups. For future work we plan to combine our knowledge distillation approach with other neural ranking training adaptations, such as curriculum learning or dynamic index sampling for end-to-end neural retrieval.