KILT: a Benchmark for Knowledge Intensive Language Tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel

Introduction

There has been substantial progress on natural language processing tasks where the inputs are short textual contexts such as a sentences, paragraphs, or perhaps a handful of documents. Critically, we have seen the emergence of general-purpose architectures and pre-trained models that can be applied to a wide range of such tasks Devlin et al. (2019). However, for many real world problems, processing at this local level is insufficient. For example, in open-domain question answering Chen et al. (2017) models need to find answers within a large corpus of text. Fact checking a claim Thorne et al. (2018a) requires models to find evidence, often on the web. In knowledgeable open dialogue Dinan et al. (2019), models need access to knowledge from large corpora to sustain informed conversations.

In general, solving knowledge-intensive tasks requires–even for humans–access to a large body of information. Like in Information Retrieval (IR) this involves satisfying an information need leveraging large collections of text Manning et al. (2008). However, while IR focuses of finding relevant material (usually documents), the tasks we consider focus on more fine-grained behavior, such as producing specific answers to queries. For such knowledge-intensive tasks, general infrastructure and architectures across tasks have yet to emerge, and fundamental research questions remain open. For example, while it was long assumed that non-parametric and explicit memory accessed through retrieval is strictly required for competitive results Chen et al. (2017), recent large pre-trained sequence-to-sequence models such as T5 Raffel et al. (2019a) and BART Lewis et al. (2019) store all knowledge in their parameters while performing remarkably well Petroni et al. (2019). Likewise, while the classical approach of information extraction for populating a Knowledge Base (KB, Riedel et al., 2013; Surdeanu and Ji, 2014) seems out-of-fashion, recent results show that they remain contenders Fan et al. (2019a); Xiong et al. (2019).

While there are numerous datasets for knowledge-intensive tasks (e.g. Thorne et al., 2018a; Dinan et al., 2019; Kwiatkowski et al., 2019, to name just a few), it is difficult to answer the above questions generally across them. Each dataset comes in a different format, is pre-processed with different assumptions, and requires different loaders, evaluations, and analysis tools. Critically, they all use different knowledge sources, from different versions of Wikipedia to entirely different corpora. This makes task-to-task comparisons difficult and substantially increases computational overhead. For example, one cannot easily assess whether the same knowledge representation can be re-used if each dataset is tied to a different source. Moreover, if one decides to work with different sources across different tasks, many approaches require re-indexing and re-encoding large numbers of documents. If a language model is pre-trained on one snapshot of Wikipedia to capture its knowledge, tasks that use other snapshots might require re-training.

To facilitate research on models that must access specific information in a knowledge source, we introduce KILT, a benchmark and library for Knowledge Intensive Language Tasks. KILT aims to lower the entry barrier for such research by formulating several knowledge-intensive NLP tasks with respect to a common interface and the same unified knowledge source—a single Wikipedia snapshot. The KILT benchmark consists of eleven datasets spanning five distinct tasks, and includes the test set for all datasets considered.A brand new portion of the Natural Question (NQ) dataset, originally held out, is used as the KILT test set for NQ. An important aim of KILT is to cover many different ways of seeking knowledge. For this reason, we select tasks that provide a variety of ways to formulate both the input query (e.g., a claim to verify, a text chunk to annotate, a structured query, a natural question or a conversation) and the expected output (e.g., discrete, extractive, or abstractive). Moreover, while some tasks are factoid in nature (e.g., slot filling), others require using background knowledge to answer more complex questions (e.g, ELI5) or to sustain a conversation (e.g,. Wizard of Wikipedia). The format of the KILT benchmark is model-agnostic, so any system capable of producing a textual output given a textual input is eligible to participate. KILT is an in-KB resource Petroni et al. (2015), i.e., the evidence required to answer each of the ~3.2M instances in KILT is present somewhere in the knowledge source. Hence there are no unanswerable instances in KILT. Although recognizing unanswerable instances is important, we believe the in-KB setting already poses an hard challenge to current state-of-the-art techniques, and thus leave unanswerable instances as future work.

KILT enables researchers to develop general-purpose models and evaluate them across multiple domains, testing hypotheses around task-agnostic memory and knowledge representations without indexing different large-scale textual corpora or writing new IO routines. Furthermore, the KILT library provides general building blocks to ease research on knowledge intensive NLP. We provide various state-of-the-art information retrieval systems (both neural and non-neural) coupled with different models that read text in the knowledge source and make predictions for different tasks.

We evaluate several state-of-the-art models that represent diverse approaches to knowledge-intensive NLP, and find that a hybrid approach combining a neural retriever with a pretrained sequence-to-sequence model outperforms most task-specific solutions when trained end-to-end. We additionally evaluate whether systems can provide evidence for their predictions. With this aim, we augment every instance in KILT with provenance information in the form of textual spans in specific Wikipedia pages to corroborate the output. We additionally perform an annotation campaign via Amazon Mechanical Turk to increase the provenance coverage. Lastly, in addition to evaluating downstream performance with popular metrics we formulate novel KILT variants for those that award points only if systems find provenance Wikipedia pages for the output given the input. The poor absolute performance of our baselines for those metrics indicates the need for focused research on systems able to explain their decisions.

a publicly-available benchmark of knowledge-intensive tasks aligned to a single Wikipedia snapshot, to spur the development of general-purpose models and enable their comparison;

an open-source library to facilitate the development of new architectures for knowledge-intensive tasks;

a provenance indication for all instances in KILT, made more comprehensive with an annotation campaign, which allows to jointly assess output accuracy and ability to provide supporting evidence in the knowledge source;

a comparative performance of various modeling approaches, showing promising results for general baselines across all tasks.

Knowledge Source

A main feature of the KILT benchmark is the use of a unified knowledge source that contains all information necessary for all tasks. Defining a unified knowledge source is a challenging problem — although all tasks use Wikipedia, they consider different snapshots. As Wikipedia pages are constantly modified, added, and removed, the knowledge can differ drastically from snapshot to snapshot. Concretely, the KILT knowledge source is based on the 2019/08/01 Wikipedia snapshot and contains 5.9M articles. We describe how each dataset is represented in KILT, and our mapping strategy for aligning data to our chosen snapshot. Additional details are in the appendix.

The main challenge in defining a unified knowledge source is ensuring the knowledge for all task examples is available. We assume tasks provide an input (e.g. a question in question answering, or a conversation in dialogue) needed to produce an output (e.g. an answer or a subsequent utterance). In addition, tasks provide provenance, defined as a set of textual spans in Wikipedia that contain evidence for producing an output given a specific input. These provenance spans range from single entities, short answers, sentences, paragraphs, to whole articles. The idea of our mapping strategy is to identify provenance spans in the KILT knowledge source—if we find all the provenance spans for an input-output pair, the knowledge needed to produce the output is available in our snapshot. The provenance can be a span of any size, from a single token to a paragraph to an entire document.

Concretely, the mapping strategy operates as follows.Scripts for the mapping algorithm available on GitHub. First, we try to match Wikipedia pages in each dataset to our snapshot, relying on Wikipedia URL redirections for pages that changed title. Second, we look for the provenance span in the matched page. We scan the whole page and return the span with the highest BLEU Papineni et al. (2002) with the given provenance span.We return the shortest span if there’s a tie in BLEU score. Third, we replace the original provenance in a task’s input-output pair with the span from the KILT knowledge source, and we report the BLEU score between the two. Finally, we remove from the dev and test sets all outputs for which the BLEU score is lower than a threshold for at least one provenance span (we use 0.5 as threshold) — this is meant to ensure high quality mappings in the evaluation sets — discarding on average 18% of test and dev data (for all tasks except entity linking). We keep all input-output pairs in the train sets (see Figure 5 in the appendix for more details).

Tasks

We consider five tasks that use Wikipedia as a knowledge source for KILT: fact checking, open domain question answering, slot filling, entity linking, and dialogue. The diversity of these tasks challenge models to represent knowledge flexibly. Some tasks require a discrete prediction (e.g., an entity), others, such as extractive question answering, can copy the output directly from a Wikipedia page, while still other tasks must synthesize multiple pieces of knowledge in an abstractive way to produce an output. KILT also provides a variety of ways to seek knowledge, from a claim to verify to a text chunk to annotate, from a structured or natural question to a conversation (see Table 1 for details). We are able to include the test set for all datasets in KILT, either because the test set is public, or because we were able to obtain the test set from the authors of the original dataset. These test sets are not publicly released, but are used for the KILT challenge on EvalAI Yadav et al. (2019) where participants can upload their models’ predictions and be listed on the public leaderboard.available at https://evalai.cloudcv.org/web/challenges/challenge-page/689.

To facilitate experimentation, we define a consistent interface for all datasets in the KILT Benchmark. Each dataset is represented in JSON Line format , where each record contains three fields: id, input, output. The input is a natural language string and the output a non-empty list of equally-valid outputs (e.g. if multiple answers to a question are valid in a question answering dataset). Each output is a string and it is accompanied by a non-empty list of complementary provenance spans (all should be used to acquire the knowledge needed to provide a valid output). Figure 1 displays an example for all considered tasks (Figure 3 in the appendix contains further details on the common interface).

Fact checking verifies a claim against a collection of evidence. It requires deep knowledge about the claim and reasoning over multiple documents. We consider the claim as input and the classification label as output. Each label is accompanied by a set of provenance spans that corroborate the classification label. We model multiple equally-valid provenance sets per label.

FEVER Thorne et al. (2018a) is a large dataset for claim veracity that requires retrieving sentence-level evidence to support if a claim is supported or refuted. Additional details are in the appendix.

2 Entity Linking

Entity Linking (EL) assigns a unique Wikipedia page to entities mentioned in text. Each KILT record for EL has text in the input (max 256 tokens) where a single entity mention is tagged with two special tokens (i.e., [START_ENT] and [END_ENT]—see Figure 1 for an example). The output is the title of the Wikipedia page for the entity mention plus provenance pointing to the entire page (through a unique identifier). Since Wikipedia associates unambiguous titles to entitiesWikipedia uses explicit text in titles to disambiguate., finding the correct output is enough to link entity mention and Wikipedia page. The provenance mimics the canonical approach to EL, that is to produce an identifier for each mention Wu et al. (2019). To map the provenance (whole Wikipedia page), we simply match Wikipedia pages specified in various datasets to the KILT knowledge source. We consider three popular EL datasets in KILT, two of which do not contain a train set but should be assessed in a zero-shot fashion. Note that, in addition to the AY2 train set, the whole knowledge source can be used as training data by exploiting hyperlinks. To facilitate experimentation, we release such data in KILT format (9M train instances), following the splits of Wu et al. (2019).

AIDA CoNLL-YAGO Hoffart et al. (2011b) supplements the CoNLL 2003 dataset Sang and De Meulder (2003) with Wikipedia URL annotations for all entities using the YAGO2 system Hoffart et al. (2011a). The original data is split into three parts: train, testa, testb. Following Hoffart et al. (2011b) we consider testa as dev and testb as test.

WNED-WIKI Guo and Barbosa (2018) is a dataset automatically created by sampling document from the 2013/06/06 Wikipedia dump, and balancing the difficulty of linking each mention (using a baseline as proxy). We randomly split the dataset into dev and test.

WNED-CWEB Guo and Barbosa (2018) is a dataset created with the same strategy as WNED-WIKI, but sampling from the ClueWeb 2012 corpora annotated with the FACC1 system.http://lemurproject.org/clueweb12 Similarly, we randomly split into dev and test.

3 Slot Filling

The goal of the Slot Filling (SF) is to collect information on certain relations (or slots) of entities (e.g., subject entity Albert Einstein and relation educated_at) from large collections of natural language texts. A potential application is structured Knowledge Base Population (KBP Surdeanu and Ji, 2014). SF requires (1) disambiguation of the input entity and (2) acquiring relational knowledge for that entity. For KILT, we model the input as a structured string subject entity [SEP] relation, the output as a list of equally-valid object-entities, each one accompanied with provenance where the subject-relation-object fact manifests.

Zero Shot RE Levy et al. (2017) is a dataset designed to translate relation extraction into a reading comprehension problem. We consider the open-domain version of this dataset and align the input/output with the KILT interface. Additional details are in the appendix.

T-REx Elsahar et al. (2018) provides a large-scale collection of facts aligned to sentences in Wikipedia abstracts through distant supervision. We consider each sentence as provenance and formulate the input as above (details in the appendix).

4 Open Domain Question Answering

Open domain Question Answering Chen et al. (2017) is the task of producing the correct answer for a question, without a predefined location for the answer. Standard tasks such as SQuAD Rajpurkar et al. (2016) provide an evidence document, but in open domain tasks, models must reason over an entire knowledge source (such as Wikipedia). We consider the question as input and the answer as output with dataset-specific provenance.

Natural Questions Kwiatkowski et al. (2019) is a corpus of real questions issued to the Google search engine. Each question comes with an accompanied Wikipedia page with an annotated long answer (a paragraph) and a short answer (one or more entities). We consider the open-version of the dataset and use both long and short answers spans as provenance. We collaborated with the authors of Natural Questions to access a held out, unpublished portion of the original dataset to form a new test set for KILT. By construction each QA pair is associated with a single Wikipedia page, although other pages might contain enough evidence to answer the question. To increase the provenance coverage we perform an Amazon Mechanical Turk campaign for the dev and test sets and increase the average number of provenance pages per question from 1 to 1.57 (details in section 4).

HotpotQA Yang et al. (2018) requires multi-hop reasoning over multiple Wikipedia pages to answer each question. For each question-answer pair, a set of supporting sentences are provided, and we consider these as provenance. We focus on the fullwiki setting, where systems are required to retrieve and reason over the whole Wikipedia.

TriviaQA Joshi et al. (2017) is a collection of question-answer-evidence triples. Evidence documents are automatically gathered from Wikipedia or the Web. We consider only the Wikipedia case. We use the answer span as provenance and consider the full version of the dev and test set.

ELI5 Fan et al. (2019b)https://yjernite.github.io/lfqa.html is a collection of question-answer-evidence triples where the questions are complex, and the answers are long, explanatory, and free-form. For dev and test, we collect annotations using Amazon Mechanical Turk, asking evaluators to select which supporting documents from Wikipedia can be used to answer the question. We treat these as gold provenance annotations for evaluation (details in section 4).

5 Dialogue

Chitchat dialogue is the task of developing an engaging chatbot that can discuss a wide array of topics with a user, which often relies on topical, factual knowledge. For example, it would be difficult to have a conversation about “grayhounds” without any information about that dog breed. We consider the conversation history as input and the next utterance as output.

Wizard of Wikipedia Dinan et al. (2019) is a large dataset of conversation grounded with knowledge retrieved from Wikipedia. One speaker in the conversation must ground their utterances in a specific knowledge sentence, chosen from a Wikipedia page. The chosen sentence forms the provenance for KILT.

Provenance Annotation Campaign

We perform an Amazon Mechanical Turk campaign on the NQ and ELI5 datasets for the dev and test splits. While for the NQ our aim is to increase the provenance coverage (i.e., we already have a provenance page for each qa pair) for ELI5 we want to collect provenance information from scratch. For each question we ask annotators to indicate if four pre-determined passages contain enough evidence to answer the question and additionally highlight a salient span in them. We select the passages to annotate using our baseline retrieval models, namely Tf-idf, DPR, RAG and BLINK + flair (details in the Appendix).for Tf-idf and BLINK + flair we consider the first passage in the retrieved page We only consider passages with some tokens overlap with the gold answers (at least 10%).

For NQ, we additionally include gold passages among those to annotate, with the twofold objective of controlling the quality of the annotation process and filter out questions that can’t be answered given the KILT Wikipedia snapshot.we present passages in random order to the annotator to exclude biases. If no passage is selected by an annotator we ask to provide either another one from Wikipedia or an explanation. We collect three annotations for each passage, and insert the passage as new provenance for the question if at least two annotators found enough evidence to answer in it. The average inter-annotator agreement is 0.3 and 0.1 Cohen’s kappa for NQ and ELI5 respectively. Note that ELI5 questions are in general more complex than NQ ones, the required answer is not an extracted span from a page but a free-form explanation that not always can be grounded in Wikipedia.

To make ELI5 data more robust we computed the overlap between provenance passages and answers for each instance using ROUGE-L and manually annotate instances with low overlap (ROUGE-L < 0.15). Overall, we were able to collect provenance information for 1507 dev instances (3000 annotated) and 600 test instances (2000 annotated) for ELI5, with an average of 1.18 Wikipedia pages as provenance per instance. For NQ, we filter out on average 8% of data (258 dev and 110 test instances) and include on average 1.57 Wikipedia pages as provenance per instance. Additional details in the Appendix, table 6.

Evaluation Metrics

Various tasks in the KILT Benchmark need to be evaluated differently, which can make task-wide comparison challenging. Further, there are multiple aspects of each system that we want to assess, namely (1) downstream results, (2) performance in retrieving relevant evidence to corroborate a prediction and (3) a combination of the two. We report different metrics to capture these aspects.evaluation scripts available in GitHub.

We consider different metrics to capture the uniqueness of the different tasks in KILT and mimic the typical way to assess performance for each dataset. We use Accuracy for tasks that require a discrete output (e.g., an entity); Exact Match (EM) for tasks with extractive (i.e., Natural Questions, TriviaQA) or short abstractive output format (i.e., HotpotQA); finally, for tasks with long abstractive output format, we use ROUGE-L Lin (2004) for ELI5 and F1-score for Wizard of Wikipedia. For EM and F1-score we follow standard post-processing to lowercase, strip articles, punctuation, and duplicate whitespace from gold and predicted output Rajpurkar et al. (2016). Note that Accuracy is equivalent to strict exact match, without post-processing. We report additional metrics for some datasets in the appendix (Table 7-17).

Retrieval.

We adopt a page-level formulation and measure the ability of a model to provide a set of Wikipedia pages as evidence for a prediction.our evaluation scripts allow to evaluate retrieval performance at a more fine-grained level (e.g., paragraph). For most datasets in KILT a single page is enough to provide complete evidence, with the exception of FEVER (~12% which requires more than one page) and HotpotQA (two pages are always required). We consider the following retrieval metrics in KILT:

R-precision, calculated as rR\frac{r}{R}, where RR is the number of Wikipedia pages inside each provenance set and rr is the number of relevant pages among the top-RR retrieved pages. For most of the datasets R=1R=1 and this formulation is equivalent to Precision@1. Concretely, R-precision=1 if all Wikipedia pages in a provenance set are ranked at the top. We report the maximum value among all provenance sets for any given input.

Recall@k, calculated as wn\frac{w}{n}, where nn is the number of distinct provenance sets for a given input and ww is the number of complete provenance sets among the top-kk retrieved pages. For datasets that require more than one page of evidence (e.g., FEVER and HotpotQA), we use the lowest ranked page in each provenance set to determine its position and remove the other pages in the set from the rank. For both metrics, we report the mean over all test datapoints.

KILT scores.

We propose a KILT version for downstream metrics that, inspired by the FEVER-score Thorne et al. (2018a), takes into account the provenance supporting the output. For each datapoint, we only award Accuracy, EM, ROUGE-L, and F1 points to KILT-AC, KILT-EM, KILT-RL and KILT-F1 respectively, if the R-precision is 1. This is equivalent to awarding points if the system finds (and ranks at the top) a complete set of provenance Wikipedia pages for at least one ground truth output given the input. We choose this metric to emphasize that systems must be able to explain their output with proper evidence, not simply answer.

Baselines

The KILT tasks provide a dual challenge of retrieving information and conditioning upon that to create an output. Various directions could be applied to these. For example, the Wikipedia knowledge could be represented explicitly, as natural language or in a structured form, or represented implicitly, as knowledge stored in model parameters. Models could be discriminative, extractive, where a specific span is selected as output, or generative, where the model writes an output. We consider retrieval, task-specific, and general baselines for KILT (see Table 2). Additional details are in the appendix.

Results

We summarize the main results in three tables: downstream performance in Table 3, retrieval in Table 4 and KILT scores in Table 5. Additional results, as well as comparisons with recent works reported numbers, can be found in the appendix. It’s possible to get the performance of a system for the KILT test sets by uploading its predictions to our EvalAI challenge.5

When considering downstream performance (Table 3), although pre-trained sequence-to-sequence models can embed knowledge implicitly in their parameters to some extent Petroni et al. (2019); Roberts et al. (2020), they clearly lag behind models with explicit knowledge access in almost all datasets. The BART+DPR baseline that incorporates an explicit retrieval step in addition to the generative pretraining, works well. It outperforms some of the task-specific solutions, and gets close to others. Performance are even stronger when the retriever and reader components are trained end-to-end, as in the case of RAG. We find this a promising direction for knowledge intensive tasks.

By formulating Entity Linking within KILT, we can evaluate the ability of seq2seq models at this task. They perform surprisingly well, even without any explicit access to knowledge (i.e., BART and T5). These solutions are able to link entity mentions by either leaving them untouched (if they match the correct Wikipedia title), completely altering mention text (e.g., “European Cup” \rightarrow “UEFA Champions League”), or adding disambiguation tokens (e.g., “Galatasaray” \rightarrow “Galatasaray S.K. (football)”). We report an example in Figure 4.

When considering retrieval alone (Table 4) there is no clear winner—entity-centric tasks (Entity Linking and Slot Filling) clearly benefit from entity-based retrieval, while DPR works better for NQ, FEV and ELI5, that require more fine grained passages supervision. We believe that combining all these ingredients (i.e., dense representations, fine grained supervision, entity awareness) will be necessary for general task-agnostic memories. Moreover, jointly training a single DPR model on all KILT training data (Multi-task DPR) led to strong performance gains on all datasets compared with the original model (DPR), that considers only NQ and TQA as training data Karpukhin et al. (2020). This suggests synergies between KILT datasets that are beneficial in terms of model performance.

Finally, the KILT scores formulation allows us to systematically assesses the performance for output and provenance jointly (Table 5). We don’t report results for BART and T5 since answers are generated solely from the input with no explicit retrieval and there is no straightforward way to access provenance for each prediction. The relative performance of the other baselines with respect to KILT scores is consistent with downstream results. However, the generally low absolute numbers leave a large room for improvement for systems able to provide the correct output but also successfully justify their decision.

Discussion

There are custom solutions that can easily simplify the slot filling task. For instance, subject entities can be used for lookups by title in Wikipedia to retrieve knowledge (this heuristic will always work for zsRE), and structured human-curated resources (such as Wikidatahttps://www.wikidata.org) could be used to get all answers right. Nevertheless, we are interested in testing if a general model can extract attributes about specific entities from a large body of text.

The provenance to justify each system prediction can come from anywhere, including a different system, and this is difficult to detect. Moreover our provenance might not be exhaustive—given the redundancy of information in Wikipedia there could be other pages with the knowledge needed to solve a KILT instance. We conduct an annotation campaign to mitigate the problem.

Related Work

Several natural language benchmarks have been introduced to track and support NLP progress, including natural language understanding Wang et al. (2018, 2019), multitask question answering McCann et al. (2018), reading comprehension Dua et al. (2019), question understanding Wolfson et al. (2020), and dialogue Shuster et al. (2019). We focus on multi-domain tasks that need to seek knowledge in a large body of documents to produce an output. Although there exist several tasks and resources that define large-scale external knowledge sources—including the TAC-KBP challenges McNamee and Dang (2009); Ji et al. (2010); Surdeanu (2013); Surdeanu and Ji (2014), ARC Clark et al. (2018), TriviaQA-web Joshi et al. (2017), Quasar-T Dhingra et al. (2017), WebQuestions Berant et al. (2013) and ComplexWebQuestions Talmor and Berant (2018)—in KILT we exclusively consider publicly available Wikipedia-based datasets in order to merge and unify the knowledge source.

Conclusion

We introduce KILT, a benchmark for assessing models that need to condition on specific knowledge in a defined snapshot of Wikipedia to solve tasks spanning five domains. The goal is to catalyze and facilitate research towards general and explainable models equipped with task-agnostic representations of knowledge. Our experiments show promising results for a general solution combining dense retrieval and seq2seq generations, although there is large room for improvements. In particular, we find that provenance of current models is generally low.

Acknowledgment

The authors would like to greatly thank the team behind Natural Questionshttps://ai.google.com/research/NaturalQuestions for the held out data, that defines our NQ test set; FEVERhttps://fever.ai, HotpotQAhttps://hotpotqa.github.io and TriviaQAhttps://nlp.cs.washington.edu/triviaqa teams for sharing official test data for the KILT leaderboard; Luke Zettlemoyer and Scott Wen-tau Yih for helpful discussions; Rishabh Jain for the help in setting up the EvalAI challenge.

References

Appendix A Appendix

We represent the KILT knowledge source as a collection of JSON records, one per Wikipedia page. Each record is assigned: (i) a unique Wikipedia id; (ii) a unique Wikipedia title; (iii) a text field containing a list of strings, one for each paragraph, bulleted list item, and section header (for which we preserve the hierarchical structure); (iv) a list of anchors elements, one for each hyperlink in the original page text, with span reference in the text field and page linked; (v) a list of categories; (vi) a url redirecting to the original html for the page, with timestamp of the last page revision before the considered snapshot.

Datasets Mapping Details

In FEVER, often multiple pieces of knowledge must be combined to produce an output. For example, 30% of claims have more than one equally-valid provenance and 16% require the combination of multiple evidence spans. The second iteration (FEVER2.0, Thorne et al., 2019) introduces a collection of adversarial instances. For KILT, we merge the two versions of FEVER into a single resource and consider only supported refuted claims. We exclude all claims classified as not having enough information since these instances have no evidence to assess the claim and cannot be mapped to the KILT knowledge source. Therefore we cannot asses whether such label is still appropriated given our snapshot. Moreover, we design KILT as an in-KB resource where each instance can be answered and corroborated by information in the knowledge source.

In the Zero Shot RE dataset a set crowd-sourced template questions are defined for each relation — for example, What is Albert Einstein’s alma mater?. Each datapoint reports a Wikipedia sentence expressing the fact that we take as provenance. Some examples in the dataset are negative, obtained by matching a valid question and a random sentence, that likely does not contain the answer. To consider an open-domain version of this dataset and align the input/output with the KILT interface we reformatted this dataset, as follows: (i) exclude neagative pairs - since we consider the whole knowledge source (as opposite to a single sentence) as text all questions can be answered; (ii) group template questions by the subject-relation pair, and create a single datapoint for each (input as above); (iii) randomly split the set of relations, in line with the original dataset, into three disjoint sets train (with 84 relations), dev (12 relations) and test (24 relations)—systems are tested on relations never seen during training; (iv) use the subject entity as the query against Wikipedia titles for the first step of the mapping strategy, and (v) include all template questions in a meta field.

For T-REx, We filter out facts with more than 20 provenances, relations with less than 1000 facts, and merge all the facts for the same subject-relation pair (i.e., for 1-N and M-N relations there could be multiple valid answers), resulting in 113 relations and 2.3M facts. We include object aliases as equally valid answers and report in a meta field subject aliases as well as all surface mentions for the subject, relation and object. We randomly select 5k facts for both dev and test set.

To define an open-version of the Natural Questions dataset we follow Lee et al. (2019) and (1) keep only questions with short answers and (2) discard all answers with more than five tokens.

To find answers in TriviaQA, the original work used distant supervision: (1) find Wikipedia entities in the question with the TAGME entity linked Ferragina and Scaiella (2011); (2) search for the answer (and all Wikipedia aliases) in the corresponding page; (3) if the answer is found, add the page in the evidence documents. Therefore, the documents are not guaranteed to contain evidence for the question-answer pair (but the authors estimate that they do 79.7% of the time).

In ELI5 Evidence documents are automatically gathered, and we focus on the case where evidence documents are extracted from Wikipedia. However, as the original work first collected question-answer pairs from the subreddit Explain Like I’m Five, the documents are not guaranteed to contain evidence.

For Wizard of Wikipedia we discard cases where the dataset does not contain provenance. Moreover, we consider a full open-domain setting where no topic is provided for the conversation and the model must search over all of Wikipedia for knowledge at each dialogue turn (rather than the provided knowledge candidates for each turn in the original dataset). We use the unseen split for dev and test.

Performance Impact Of The Mapping Strategy

We want to assess if the performance we obtain after mapping each dataset to a unified Wikipedia snapshot are in line with what reported in previous work. Thorne and Vlachos (2020) report a 2-way accuracy of 79.09 for the FEVER dev set when considering purely claims in input to a RoBERTa-based classifier Liu et al. (2019). Our dev set includes also the adversarial examples of FEVER 2.0, nevertheless the performance of BART are in line (80.67 dev, 78.93 test). Karpukhin et al. (2020) report 41.5 for EM on the open domain version of the NQ dev setReported as test results in Karpukhin et al. (2020). With our setting, DPR achieves an on-par performance on the dev set, with a 42.58 EM (50.43 F1-score). Results on our brand new NQ test set are 3/4 points lower for EM and F1-score than dev results. We don’t evaluate multi-hop specific baselines on KILT but the current best F1-score for HotpotQA is 75.43 according to the official leadearboardhttps://hotpotqa.github.io, that is quite far from what achieved by our general solutions. BLINK results are in line with what reported in the GitHub repositoryhttps://github.com/facebookresearch/BLINK for all three entity linking datasets. The Tranformer MemNet of Dinan et al. (2019) achieves a F1-score of 14.3 on the original version of the WW dataset while 11.5 in our setting, probably because in KILT we consider an harder open-domain setting.

Retrieval Baselines

The ability to retrieve relevant documents from Wikipedia given an input is an important aspect we assess in KILT. A system should select only the relevant knowledge needed for the task, without redundant or excess information. A way to surface such knowledge is using a dedicated retrieval system. We consider three off-the-shelf retrievers and investigate drastically different retrieval paradigms: (i) Tf-idfwith the DrQA Document Retriever Chen et al. (2017)—traditional page-level sparse vector space retrieval model; (ii) DPRKarpukhin et al. (2020)—a modern passage-level retrieval solution using dense representations; (iii) A combination of BLINK Wu et al. (2019) and flair Akbik et al. (2019)—retrieval solution that ranks pages according to entities in the input.

The DrQA Document Retriever combines bigram hashing and TF-IDF matching to return relevant Wikipedia pages given an input. DPR splits each Wikipedia page into disjoint 100-word passages22,220,793 passages in the KILT knowledge source. Following Karpukhin et al. (2020) we don’t consider Wikipedia bulleted lists in the text. and encodes passages and inputs with a BERT-based bi-encoder to perform dense Maximum Inner Product Search. The BLINK entity linking system uses a BERT-based bi-encoder to encode each Wikipedia page as well as each input, where a single entity mention is tagged. Final results are refined with a BERT-based cross-encoder. To use BLINK for retrieval, we look for entity mentions in each input with flair, then use BLINK to return a ranked list of Wikipedia pages for each entity mention. When multiple entities are identified in the input, we merge results and sort by score. The input string might not contain tags. For all systems, we use the index created on the KILT knowledge source.

We also experiment with multi-tasking, by jointly training a single DPR model on all KILT training data. We use uniform sampling to balance the datasets. In particular, the Multi-task variant of DPR is a single dense passage retriever, trained jointly on the union of TQA, NQ, HoPo, FEV, zsRE, AY2, T-REx and WoW. In order to avoid large datasets, such as T-REx, from having an oversize effect, we resample all datasets uniformly, such that every training epoch contains 150k samples from each task. Batches are formed from a single dataset at a time, iterating through the various datasets in a round-robin fashion.

Task-specific Baselines

Approaches to the KILT Benchmark should be able to generalize to many different tasks, as developing model architectures that can represent knowledge generally is a valuable direction. However, several tasks may benefit from dedicated architectures designed for them.

For fact checking, we consider NSMN Nie et al. (2019), the highest scoring system from the FEVER shared task Thorne et al. (2018b). We use the public modelavailable at https://github.com/easonnie/combine-FEVER-NSMN pre-trained on FEVER, and consider not enough information predictions as false. Moreover, we develop a fact checking baseline that combines a BERT-base classifier with passages returned from DPR where the claim and retrieved passage are input. The classifier is trained to label the claim-passage pair as supported or refuted with an additional neutral class for negative-sampled unrelated passages. Unrelated passages are sampled from two sources: (1) DPR-retrieved passages from pages that are not in the list of pages in the instance’s provenance and (2) passages sampled uniformly at random from pages in the instance’s provenance. At inference, we classify the first sentence of the Wikipedia pages retrieved by the top-100 DPR passages against the claim. Using pages labelled as supported or refuted, we label the claim through majority voting. For claim provenance, we re-rank passages by probability according to this label.

For Open Domain QA and Slot Filling, we use DPR combined with the pre-trained BERT-based extractive reading comprehension model of Karpukhin et al. (2020). We use the model pretrained on TriviaQA for HotpotQA and the model pre-trained on Natural Questions for Zero Shot RE. We reduce the slot filling problem to question answering, by using the specified template questions. We consider a single random template question per subject-relation during inference.

For Dialogue, we consider the Generative Transformer MemNet Dinan et al. (2019) that encodes the dialogue history and knowledge to generates the next utterance. We use the pre-trained version available in ParlAI Miller et al. (2017). Finally, to test the performance of combining BART and DPR on FEVER, we develop a classifier that uses these—full description in the appendix.

General Baselines

A main motivation of the KILT Benchmark is to enable a unified approach towards a wide range of knowledge-intensive tasks. We analyze existing general architectures that can be used as a baseline for multiple tasks in KILT.

Large pre-trained sequence-to-sequence models such as BART Lewis et al. (2019) and T5 Raffel et al. (2019a) implicitly store a surprising amount of knowledge in their parameters Petroni et al. (2019). We treat all KILT tasks as generative, relying on the knowledge accumulated by the model while pre-training, with no retrieval (similarly to Roberts et al. (2020)). We finetune pre-trained variants on all KILT tasks, using fairseq Ott et al. (2019) for BART and Huggingface’s Transformer Wolf et al. (2019) for T5.

A natural way to boost performance is to incorporate an explicit knowledge mechanism. For our BART+DPR baseline, we follow Petroni et al. (2020) to retrieve and prepend the top-3 passages from DPR for each input sample and use context-enhanced training data to fine-tune a BART model. We use the DPR rank when reporting provenance for all except entity linking tasks. For entity linking, we report the Wikipedia id of the page whose title exactly matches the predicted string.

Recently, state-of-the-art results on a wide range of NLP tasks have been achieved by combining a trainable retrieval step with language modeling or generation Guu et al. (2020); Lewis et al. (2020a). We experiment with fine-tuning RAG Lewis et al. (2020b) on KILT tasks, establishing a strong baseline on all of them. RAG combines a DPR retriever with a BART generator, however, unlike in the case of our previous baseline, RAG back-propagates to the retriever’s input encoder, learning to adapt the input embedding to retrieve more relevant results. At every generation step we retrieve top-5 passages and use them as provenance.