Multi-task Retrieval for Knowledge-Intensive Tasks

Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oğuz, Veselin Stoyanov, Gargi Ghosh

Introduction

Knowledge-intensive tasks is the common designation for a class of real-world NLP problems which, because of their nature, require large amounts of knowledge about the world (Petroni et al., 2020). For example, open-domain question answering requires producing answers to general factoid questions; fact checking involves determining the veracity of claims based on a database of trusted evidence. Practical solutions to these tasks usually involve an efficient retrieval component that, given an input query, selects a limited subset of relevant information from a large knowledge source. Sophisticated downstream models then consider the input only in the context of the retrieved information, and perform the final task.While large pre-trained neural models have been shown to incorporate real-world knowledge in their parameters and thus may skip retrieval (Petroni et al., 2019), they still have limited capacity and suffer from a lack of explainability.

The standard retrieval component in many systems (e.g., Thorne et al., 2018; Wang et al., 2018; Chen et al., 2017) has long relied on term-matching methods, such as tf-idf or BM25 (Robertson and Zaragoza, 2009). These methods rely on efficient algorithms and usually perform reasonably well regardless of the problem. In contrast, recent neural retrieval models, such as ICT (Lee et al., 2019), DPR (Karpukhin et al., 2020) and RAG (Lewis et al., 2020b) achieve better results by learning directly from task-specific training data and going beyond simple keyword matching. While task specialisation results in improved task performance, researchers have observed that a retriever trained for one specific domain will typically achieve low out-of-domain performance, and even lower performance on entirely different tasks (Petroni et al., 2020). This has two implications. First, unlike tf-idf or BM25, neural retrieval models are unsuitable for low data regimes such as few- and zero-shot settings. Second, task-specific retrievers complicate practical applications where multiple knowledge-intensive tasks may need to be performed using the same supporting database or over the same input text. It may not be practical to deploy multiple separate specialised models due to computational performance or memory concerns.

We ask the following question in this work: can we develop a universal neural retriever? Namely, we target a retriever that can perform well on a wide variety of problems, without task-specific fine-tuning, but, if additional in-domain labelled data is available, it can be further fine-tuned to improve the performance. We perform a large experimental study to attempt to build such a universal retrieval model. We find that, by jointly training on an extensive selection of retrieval tasks, we obtain a model which is not only more robust than previous approaches, but also can lead to better performance on the downstream knowledge-intensive tasks when plugged into an existing system. Our approach combines the benefits from IR-based models with those of task-specific neural retrievers – namely, good performance when no (or not enough) training data is available and high task performance due to its ability to learn highly specialised representations.

Our contributions can be summarised as follows.

We propose a single general-purpose “universal” retrieval model, able to perform comparably or better than specialised retriever approaches in both zero-shot (leave-one-out) and few-shot retrieval. We investigate several model variants, shedding light on what are the aspects of the architecture that affect its performance.

We show that our model’s gains in terms of retrieval directly translate into performance gains for a variety of downstream knowledge-intensive tasks.

We will share the implementation as well as our best model. This is in the form of a readily available BERT checkpoint which, as we will show, can be used by NLP practitioners as a strong out-of-the-box retrieval system, but which can also undergo further in-domain training for even higher performance.

Background

In this section, we first give an overview of retrieval methods based on sparse and dense representations. We then discuss a wide range of knowledge-intensive NLP tasks, where retrieval plays a crucial role in solving the problems.

Given a large collection of unstructured text passages, information retrieval (IR) can be broadly defined as finding a small set of passages that satisfies an information need, often presented in the form of a short-text query Manning et al. (2008). Traditional IR methods, such as tf-idf and BM25 Robertson and Zaragoza (2009), match keywords efficiently with an inverted index. Such methods can be seen as representing queries and passages in high-dimensional, sparse vectors, where each dimension corresponds to a term in the vocabulary and the weight indicates its importance.

In contrast to tf-idf and BM25, dense retrieval methods encode text as a latent semantic vector of a fixed, much smaller dimensionality. Whether a passage is relevant to a given query is determined by the distance of their vectors Deerwester et al. (1990). Although dense representations do not encode tokens explicitly and can potentially map paraphrases of completely different tokens to close vectors, performance of early dense retrieval methods was often inferior to term-matching approaches, except when large labelled data is available Yih et al. (2011); Gao et al. (2011); Huang et al. (2013). Thanks to success of large pre-trained models Devlin et al. (2019); Liu et al. (2019b), however, recent dense retrieval methods have shown to outperform the sparse counterparts, when fine-tuned on a small set of in-domain labelled data Karpukhin et al. (2020); Lewis et al. (2020b); Xiong et al. (2020). Efficient index and search of dense vectors are made possible by maximum inner product search (MIPS) algorithms (e.g., Shrivastava and Li, 2014; Guo et al., 2016), as well as tools like FAISS Johnson et al. (2019).

Our work is built upon the Dense Passage Retriever (DPR) architecture of Karpukhin et al. (2020), which was initially proposed for the task of open-domain question answering. DPR is a neural bi-encoder model which embeds queries with an encoder $\boldsymbol{f}\,(\cdot)$ and passages with a separate encoder $\boldsymbol{g}\,(\cdot)$ . Given an input query $x$ and a target passage $y$ , we have

where the similarity score $\operatorname{sim}\left(x,y\right)$ is defined as the inner product of the embeddings of its arguments, $\boldsymbol{f}(x)\cdot\boldsymbol{g}(y)$ . Given a query at inference time, calculating its similarity with every possible passage would be prohibitive for large knowledge sources. Therefore, DPR makes use of the FAISS library (Johnson et al., 2019) to perform fast approximate nearest neighbour search in sub-linear time.

Training is based on a contrastive loss. Given a query $x$ , a relevant passage $y$ , and a set of $n$ irrelevant passages $y^{-}_{i}$ , we train the model by optimising the following negative log likelihood:

As the set of irrelevant passages, we use the relevant passages for other queries within the same batch, as well as a specially selected “hard” confounder. This is a passage which has high lexical overlap with the query (high BM25 score), but is not among the set of relevant passages for the given data point. Karpukhin et al. (2020) have shown that the inclusion of such “hard” confounders leads to substantially improved training results. This training process is illustrated in Figure 1.

2 Knowledge-intensive Tasks

For the training and evaluation of all models in the paper we make use of KILT, a benchmark and library of datasets (Petroni et al., 2020). KILT consists of a selection of datasets spanning five varied classes of knowledge-intensive tasks (i.e., question answering, slot filling, fact checking, dialogue, entity linking), with the aim to cover many different ways of seeking knowledge. Input queries can vary wildly from one task to the other, and include classic examples of open-domain retrieval tasks such as natural language questions and claims to be verified, as well as more unusual examples like conversation fragments and long chunks of annotated text. Crucially, all datasets distributed in KILT have been re-aligned such that they are all grounded in the same snapshot of Wikipedia, which the authors distribute. The knowledge required to answer any of the queries in the library of tasks can thus be found within the same unified knowledge source.

To illustrate the variety of ways in which the input queries for different tasks can be formulated, we provide a few simple examples in Table 1. In spite of the differences between query formulations, all these tasks share one crucial aspect: they all require a retriever to fetch the relevant passages from the knowledge source, in order to support the final downstream task.

Methods

Using task-specific models to tackle our collection of retrieval tasks would involve completely separate models, one per dataset. As illustrated in Figure 2, this would lead to a proliferation of models and data, down to separate indexed copies of the knowledge source itself (Wikipedia). This setup will form one of our baselines.

Multi-task training has been successfully used to allow models to leverage cross-task data, as well as to provide a regularisation effect leading to better generalisation ability (Liu et al., 2019a). We apply this concept to neural retrievers, with the aim of improving performance by jointly leveraging multiple different retrieval datasets.

Our base setup is illustrated in Figure 3(b) and involves using a shared passage encoder — so that a single index of encoded passages can be used — as well as a query encoder that is shared across all tasks. In essence, in this setup a single DPR model is used to perform all retrieval tasks.

Due to the complexity of training and evaluating retrieval models (which involves training the retriever, embedding all of Wikipedia, and building an index), our main set of experiments is all based on this configuration, which was found to work well in preliminary experiments. However, in order to report on the performance of alternative architectures, we also investigate the following additional variants in a restricted experimental setting, limited to a few tasks:

Task-specific query encoder. A different query encoder is used for each family of tasks, e.g. all question answering tasks use the same query encoder, but fact checking uses a different one. This is meant to allow for potentially different needs in processing queries, given the fundamentally diverse nature of the tasks at hand. This setup configuration is illustrated in Figure 3(a).

Task markers. This approach is similar to our base setup, where a single model performs all tasks. Additionally, we introduce specialised tokens which are inserted at the beginning of each query. Their aim is to help the model distinguish between the different tasks, by marking them. We use one task marker for each of the five task classes of KILT, such that all question answering tasks share the same marker.

2 Adversarial confounder selection

We saw in § 2.1 how “hard” confounder passages are collected using a BM25 baseline, following the standard approach in DPR. However, any other retriever can be used to select such confounders, including the very retriever being trained, leading to an iterative, self-adversarial training. Concretely, this amounts to following steps: (1) a first version of the retriever is trained with BM25 confounders; (2) new confounders are selected with the trained model, by retrieving high-ranking passages which are not among the set of relevant ones; (3) a second version of the model is trained using the additional new confounders.

Intuitively, it is expected that this approach should lead to higher quality confounders compared to those selected by BM25 based on simple keyword matching. Based on our own experience as well as relevant literature (Khattab et al., 2020), this adversarial approach has been shown to work well for question answering.

As a way of further pushing the performance of the model, we experiment with this adversarial confounder selection on two datasets, Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). We selected these two datasets since, out of all of the tasks we are considering, they have an easy way of checking whether a certain passage is relevant or not for a given query – namely, by checking whether the answer is present in the passage. This enabled us to automatically build sets of confounders, ensuring relevant passages would be excluded.Strictly speaking, assuming a passage to be irrelevant because of the absence of the answer span is not formally correct. However, experiments show a good correlation between this simple check and the overall model quality.

Experiments

For our experiments we select the eight KILT datasets listed in Table 2, which cover all five task classes and include a training split, a validation split, and a held-out test split.

Preprocessing

Starting from the raw KILT data, we split each Wikipedia article into disjoint 100-token chunks which form our basic retrieval units, following the approach of Wang et al. (2019) and Karpukhin et al. (2020). To maintain the same language introduced in §3, we will simply call these chunks passages.

This preprocessing results in a knowledge source of 36 million passages. In order to harmonise all datasets to the same knowledge source, KILT used a mapping strategy based on the BLEU metric to map relevant passages in the original versions of its datasets to passages in its own shared knowledge source (Petroni et al., 2020). Entries included in the KILT training sets which have a mapping BLEU score below 0.5 are likely to be noise, and we exclude them from training.

Multi-tasking

Training is performed on the union of all training sets. Since two of the training sets are of different orders of magnitude, we use a simple downsampling strategy to bring them to the same order of magnitude as the others. Preliminary experiments with more complex sampling methods, like resampling all datasets so that each epoch would see an equal number of samples from each, found that they had no measurable effect compared to this simpler approach.

Encoders

Our query and passage encoders are initialised as two distinct BERT base uncased encoders (Devlin et al., 2019), trained separately. As pooling mechanism we find it effective to simply take the [CLS] token representation at the topmost layer.

Training

We train our models for up to 80 epochs. To select the best checkpoint, we perform full evaluations of the validation set retrieval performance at regular intervals. We use the Adam optimiser (Kingma and Ba, 2015) with a learning rate of $2\cdot 10^{-5}$ with warmup and a linear decay schedule, and a dropout rate of $0.1$ . The batch size is set to $128$ samples, and in preliminary experiments we found no benefit in increasing this further. We use an additional “hard” confounder per batch, selected based on BM25 score as in Karpukhin et al. (2020).

Downstream evaluation

When evaluating our retriever within a larger architecture to perform a knowledge-intensive task, we replicate the DPR + BART setup of Petroni et al. (2020). This uses DPR to retrieve and prepend the top 3 passages to the query, which is then processed by a task-specific fine-tuned BART model to generate the final answer for the end task.

2 Universal retrieval

The results of the evaluations reported in Petroni et al. (2020) show that retrievers trained for question answering have poor performance outside of their domain. We would like to understand if it is possible to design a single model which can accurately satisfy the information needs of a wide variety of knowledge-intensive tasks. In short: Can a neural retriever be universal?

We perform a comprehensive evaluation of several models on the eight tasks of Table 2. The setups we evaluate include eight task-specific models (one trained on each of the eight datasets), for which we measure both in-domain and out-of-domain performance, and a BM25 baseline. Additionally, we include a multi-task trained model – as described in §3.1 – with the hope that it can learn to perform all tasks satisfyingly. This amounts to 10 models evaluated on eight tasks each, for a total of 80 evaluations.

To measure retrieval performance, we adopt the main metric used for the KILT benchmark, $R$ -precision. This is calculated as $r/R$ , where $R$ is the total number of relevant passages for a given query, and $r$ is the number of relevant passages returned among the top- $R$ retrieval results. For the case of $R=1$ this is therefore equivalent to precision@1. Table 3 shows retrieval performance on the validation data, with the best performance on a given dataset marked in bold, and the second best performance underlined.

While the KILT evaluation focuses on retrieval at the level of Wikipedia pages (thereby marking as “hits” any results that lie within the correct page), we are also interested in performing an evaluation at a more fine-grained level. We therefore also evaluate our models at the passage level, using a modified version of the official KILT evaluation scripts. These are shown as the second number in each column.

We straight away notice that the task-specific models tend to achieve high performance on their respective tasks, often taking one of the top two spots. Interestingly, we also note that these neural retrievers consistently outperform the BM25 baseline, showing that the result which Karpukhin et al. (2020) achieved for open-domain question answering also holds for other knowledge-intensive tasks.

The results reveal a strong performance for the multi-task model, confirming the hypothesis that a single model can be successfully trained to perform a wide variety of retrieval tasks. With the exception of one dataset, the shared model achieves the best retrieval performance or is within a few percentage points of the top score. We note that the one exception is the Zero-shot RE task (Levy et al., 2017), a trivial task in which the query will always contain the title of the page to be retrieved. Indeed, the model specific to this task manages to achieve a near-perfect score.

Another task which stands out for being markedly different in formulation is AIDA-YAGO 2 (Hoffart et al., 2011). As shown in Table 2, models that were not trained on this specific task perform it very poorly. Entity linking is a task that is normally better performed by models which are explicitly designed for it (Cao et al., 2020). We nevertheless include it to showcase the ability of neural retrievers to adapt to it, and note how well the multi-task retriever performs on it in spite of its unusual nature.

3 Downstream performance

We saw that our proposed approach achieves strong performance across a variety of retrieval tasks. However, our interest in neural retrievers stems from their use as components within larger systems, to perform tasks such as question answering. Our next experimental question is therefore: Can a universal retriever lead to better downstream performance in knowledge-intensive tasks?

We perform a downstream evaluation of our approach used in conjunction with BART (Lewis et al., 2020a) as the generative component or classifier, adopting the same setup as Petroni et al. (2020). Results are reported in Table 4, with bold and underline marking the best and second best scores respectively.

The DPR + BART line refers to a setup similar to our own, but with the simpler retriever of Karpukhin et al. (2020). Therefore, comparing its performance to ours gives us a clear indication of the contribution of multi-task training on the overall performance on knowledge-intensive tasks. Our proposed model achieves significantly better performance than this baseline in AY2, zsRE and HoPo; while for the other tasks, the discrepancy is always below two points. This fact is reflected in the last column too, showing that on average multi-task training leads to better downstream performance. The model also compares favourably to RAG (Lewis et al., 2020b), a more advanced system in which the query encoder is fine-tuned on the end task.

4 Zero- and few-shot performance

Task-specific neural retrievers can achieve higher performance than IR-based methods, but they are not suitable for cases where no training data (or not enough) is available. In those cases, tf-idf and BM25 are the better choice. To evaluate the performance of a multi-task retriever as a suitable replacement for them in this scenario, we run a series of experiments in the low data regimes (few-shot and zero-shot).

We start by training a set of multi-task retrievers (using the base setup) in the leave-one-out setting for each of the datasets, in order to see how a neural retriever will perform when trained on all domains except for the one it is to be evaluated on. The results of these zero-shot experiments are reported in the second line of Table 5 (again, text here is in bold for the best overall performance, and underlined for second best). They show that, even in the zero-shot setting, the multi-task neural retriever achieves performance that is competitive to BM25, with retrieval being 10 points higher at the page level and 5 points lower at the passage level on average.

The advantage of neural retrievers over BM25 lies in their ability to improve with training. We therefore look at few-shot training for each task, and create two smaller copies for each of the original training sets with a random sample of 128 and 1,024 examples respectively. In order to evaluate the suitability of a multi-task trained retriever as a starting checkpoint for few-shot training, we take the various leave-one-out models and fine-tune them on our few-shot training sets. To check whether multi-task pre-training is effective, we also compare these to DPR models (which are just initialised with BERT weights) fine-tuned on the same data.

The bottom two sections of Table 5 report the results. The most dramatic gains from fine-tuning are seen for AY2, an “outlier” task whose formulation differs from that of the other tasks, and which seems to benefit the most from seeing in-domain data. The zsRE performance does not seem to improve from fine-tuning on the smaller dataset, but sees a very big jump when switching to the larger dataset. As a reminder, in this trivial task the title of the page to be retrieved always appears at the start of the query. It is therefore not surprising that models specifically fine-tuned on it can achieve near-perfect scores, as long as enough training data is provided.

In spite of the fine-tuning, we note that both DPR and the multi-task model fail to improve on their performance for T-REx, suggesting that large amounts of training data are required to learn this task. Nevertheless, the multi-task model proves itself more robust, and achieves the top performance on it.

Finally, we note for 2 out of 8 tasks, namely zsRE and WoW, DPR achieves lower page-level retrieval scores than the multi-task model, but performs better at the passage level. This shows that fine-grained and coarse-grained retrieval performance are not always perfectly correlated.

Overall, the experiments show strong results for the multi-task model, with the average zero-shot performance being competitive to BM25, and the average few-shot performance being markedly better than the alternatives. The discrepancy in performance between a vanilla DPR model and the leave-one-out multi-task model is especially noticeable when using the smaller of the two datasets, in which case average performance for the latter is more than double that of vanilla DPR.

5 Model variants

In this set of experiments we compare our base multi-task model with the two variants described in § 3.1. Due to the high memory consumption of the “task-specific encoders” variant (requiring one full query encoder per task family, in addition to the passage encoder), it was only possible to perform these evaluations in a restricted setting of three datasets. The results in Table 6 do not reveal a clear winner, suggesting that the base architecture might be the better choice due to its simplicity and generally good performance.Not included in this table, due to very poor performance in preliminary experiments, are two further variants: a base model with a single encoder for both queries and passages, and a base model trained from scratch without BERT pre-training.

6 Adversarial confounder selection

Finally, we evaluate the adversarial confounder selection method described in § 3.2. This involves augmenting our regular training sets with additional confounders for TriviaQA and Natural Questions, selected using our top multi-task trained model. A new multi-task model is then trained from scratch on this augmented data. Its performance is reported in Table 7, showing an overall improvement across multiple tasks. While this approach is demonstrated here on our multi-task model, it is in fact orthogonal to it, and could be applied to any other neural retrievers trained with a contrastive loss.

Related work

The approach most closely related to ours is DPR (Karpukhin et al., 2020), upon which we built all our retrieval systems. This model is covered in detail in § 2.1, in addition to the historical context. Another closely related approach is the Retrieval-Augmented Generation (RAG) model of Lewis et al. (2020b). In its base configuration it augments DPR with a generative reader, and it trains the query encoder end-to-end (differing from traditional retriever-reader architectures which treat the two steps as disjoint). A natural extension of the work we have presented would be to combine RAG with our joint learning approach, to study whether it can lead to further gains in performance or robustness.

A number of promising techniques to boost retrieval performance have been proposed recently. These are orthogonal to our work, and as such they could be combined with it. Amongst these, pre-training methods form one class. Inverse Cloze Task (Lee et al., 2019) and its extensions (Chang et al., 2020) are self-supervised pre-training methods designed for retrieval in open-domain question answering. Whether such specific pre-training is beneficial to tasks other than question answering remains an open question. CERT (Fang et al., 2020) is an alternative pre-training approach, inspired by some recent advances in computer vision. While to our knowledge this has not been applied to retrieval problems, we believe it might be promising due to its focus on sentence-level semantics (as opposed to the more standard masked language modelling pre-training, which focuses on the token-level).

Another class of orthogonal improvements to dense retrieval involves models which embed passages into multiple fixed-size vectors. Of these, ColBERT (Khattab and Zaharia, 2020) and ME-BERT (Luan et al., 2020) are two representative examples. One further approach is ColBERT-QA (Khattab et al., 2020), which additionally uses a data augmentation strategy closely related to our own approach described in § 3.2.

Finally two entity linkers, GENRE (Cao et al., 2020) and BLINK (Wu et al., 2020), are worth mentioning. Being trained specifically for entity linking, these models will generally outperform retrieval-based approaches on that task. While they are not comparable to retrieval models and will not generally be applicable to information retrieval tasks, we mention them here to provide readers with a fuller context of the existing literature.

Conclusions

We have conducted a large-scale experimental study on knowledge-intensive tasks, and how retrieval models that tackle them seek the required information from knowledge bases such as Wikipedia.

The study started with the question of whether the way in which information is embedded for retrieval purposes is universal. Section 4.2 provided evidence that to a large extent it is, with a single “universal” retriever, trained jointly on 8 datasets, often performing comparably to task-specific models.

Armed with this knowledge, in Section 4.3 we plugged our single model in a larger pipeline, in order to see its contribution to the downstream performance on a wide range of knowledge-intensive tasks. This led to an overall improvement in downstream performance, setting new top results for a number of tasks in the KILT benchmark.

Next, in Section 4.4, we evaluated the model’s performance in the zero-shot and few-shot settings. By evaluating on a wide range of tasks, we were able to show that our proposed approach performs comparably to BM25 in the zero shot setting, and quickly overtakes it even with minimal in-domain training.

In Section 4.5 we evaluated a number of more complex variants of the model involving task specialisation, but failed to see clear performance improvements. Finally, in Section 4.6 we saw how a simple iterative approach to data augmentation can lead to better performance.

In the coming months we will provide a pretrained snapshot of our best-performing model, in the form of a BERT checkpoint. As shown, this model will be useful in zero-shot and few-shot settings as a better performing alternative to both IR-based approaches such as BM25, as well as task-specific models. The multi-task training approach demonstrated here can also be useful in industry settings where several retrieval operations may need to be performed on the same piece of content,E.g. fact checking and hate speech detection. and the deployment of multiple task-specific models might not be possible due to space or computational performance concerns.