Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, Nat McAleese

cs.CL cs.LG

Introduction

Generative language models (LMs) are increasingly useful for answering questions about the world, steadily improving (Roberts et al., 2020; Cheng et al., 2021; Nakano et al., 2021) on question-answering benchmarks (Kwiatkowski et al., 2019; Pang et al., 2021; Lin et al., 2021), and serving generated samples to users via APIs (Brockman et al., 2020; Cohere, 2021). By default, however, LMs generate ungrounded claims that users must choose either to blindly accept or to verify themselves. In this work we train models that help the user or data rater evaluate responses by generating claims alongside supporting evidence. This evidence takes the form of a verbatim quote extracted from a longer source retrieved by Google Search or any suitable information retrieval system. We call this task “self-supported question-answering” (SQA), and intend it as a sub-task that can be embedded into other generative language modelling tasks such as open-ended dialogue or debate (Rae et al., 2021; Irving et al., 2018; Askell et al., 2021; Komeili et al., 2021; Thoppilan et al., 2022).

Crucially, citing external sources inline decreases the effort required on the part of human annotators. By extracting specific supporting quotes from the document rather than linking to entire web pages, we allow faster and more specific appraisal of supportedness. This also affords end-users a qualitatively different level of trust in model samples, compared to systems which simply return an unsupported answer. This consideration has also motivated recently released, partly concurrent work (Nakano et al., 2021) in which a finetuned version of GPT-3 cites sources.

One could view self-supporting answers as a specific type of explanation, putting our work alongside other work in explainable AI (Ras et al., 2020) that aims to provide natural-language explanations of QA model responses (Lamm et al., 2020; Latcinnik and Berant, 2020; Narang et al., 2020). Our goals are aligned to the extent that both explanations and supporting evidence are ways to increase trust in model outputs. However, while our training objective incentivises the model’s answer to agree with the evidence it provides, our method makes no attempt to guarantee that the evidence faithfully (Jacovi and Goldberg, 2020) describes the reason that the model generated the claim. We view work in that direction as complementary.

We cast SQA as a (conditional) language modelling problem, generating both free-form answers and verbatim quotes of supporting evidence as a single string with evidence “inlined”. We term this approach “Inline Evidence”. Whilst more specialized architectures exist for extracting spans from documents (Karpukhin et al., 2020; Joshi et al., 2020; Keskar et al., 2019), we show that span extraction with generative models works well and enables taking advantage of the powerful Large Language Models (LLMs) developed in recent years (Brown et al., 2020; Rae et al., 2021; Smith et al., 2022; Lieber et al., 2021; Zeng et al., 2021). In order to ensure the quotes are “verbatim” with a generative approach, we introduce a special syntax for the language model to use when quoting from documents and constrain the outputs of the model to be exact quotes from the retrieved documents when in this mode.

To measure the quality of the generated answers on the task of Self-Supported question-answering (SQA), we ask human raters to assess whether the answers are plausible and whether they are supported by the accompanying quote evidence. The first metric, “plausible”, assesses if the answer is a reasonable on-topic response to the question as if it were occurring in a conversation. The second metric, “supported”, is introduced to indicate whether the provided evidence is sufficient to verify the validity of the answer. Producing SQA responses that are both plausible and supported is a nontrivial exercise in aligning the language model to human preferences.

In this work, we describe an Inline Evidence system – named GopherCite – which we developed by finetuning the 280B parameter Gopher language model (Rae et al., 2021) using a combination of supervised learning and Reinforcement Learning from Human Preferences (RLHP), as in (Ziegler et al., 2019). Given an input query, the system retrieves relevant documents using Google Search and presents the language model a large context drawn from multiple documents. Whilst our system trusts these sources, we do not explicitly mitigate untrustworthy sources in this version of our work and forward documents to the model no matter where they come from. The language model, in turn, synthesizes a SQA response, with the evidence drawn as a verbatim quote from one of these articles. During reinforcement learning, GopherCite optimizes the score from a “reward model” which predicts human pairwise preferences between two candidate responses as well as an auxiliary classification loss as to whether the response is plausible and whether it is supported.

Retrieving sources using a search engine (Nakano et al., 2021; Thoppilan et al., 2022; Lazaridou et al., 2022) that is kept up-to-date – and supplying them to the language model in a nonparametric fashion – can enable improved temporal generalization over a purely parametric model (Liska et al., 2022; Lewis et al., 2021a; Borgeaud et al., 2021). It also enables the system to attempt questions implying the present date, like “which country got the most medals in the last winter olympics?”.

In our experiments, we show that GopherCite produces high quality (plausible and supported) answers 80% of the time when prompted with fact-seeking questions drawn from a filtered subset of NaturalQuestions dataset and 67% of the time when prompted with explanation-seeking questions drawn from a filtered subset of the ELI5 (“Explain like I’m five”) dataset (Fan et al., 2019). Furthermore, we can improve the reliability of the system dramatically by selecting a minority of questions to decline to answer (El-Yaniv et al., 2010).

We develop a reward model-based mechanism for abstaining from answering a configurable proportion of test-time questions. Performance is measured in this setting by plotting the trade-off between question coverage (the proportion of questions attempted) and the quality of responses when attempting. When declining to answer less than a third of questions in these datasets, the response quality measured amongst those questions the system attempts climbs from 80% to 90% on the filtered NaturalQuestions subset, exceeding the level of performance humans obtain when answering every question. On the filtered ELI5 subset, performance improves from 67% to 80%.

Despite these benefits, optimizing for answers that can be supported by documents on the internet is not sufficient to ensure that model responses are true. We show this via evaluation on the adversarial TruthfulQA (Lin et al., 2021) dataset, along with some qualitative highlights. Whilst often helpful, our models are able to select misleading evidence even from authoritative corpora pointing to a need for enhancement in future work. In particular, we need to tackle source trustworthiness, ensure answers are given with more careful qualification, and investigate whether more subtle alignment approaches such as debate can provide reward signals which ensure that quotes are not misleading.

As we developed GopherCite, closely related work was released, including an updated LaMDA model (Thoppilan et al., 2022) and the WebGPT system (Nakano et al., 2021). LaMDA also focuses on factual grounding, but supports answers by simply showing a URL rather than pointing the user to an easily verified quote as we do in GopherCite. Similar to our work, WebGPT uses RLHP to train question-answering models which refer to sources from the internet. WebGPT learns to interact multiple times with a search engine when gathering evidence to be passed to the question-answering model, critically deciding which queries to issue to a search engine rather than simply forwarding the user query as we do. In our work, instead of curating a collection of brief snippets from multiple search engine interactions, we condition GopherCite with a large context with thousands of tokens of uncurated information from multiple pages, focusing GopherCite on reading comprehension, and we specifically investigate how well the model supports individual claims. We view the richer interaction with a search engine developed in LaMDA and WebGPT as an exciting, complementary direction to the focus of our work. Further similarities and differences to WebGPT, LaMDA, and other recent work in the community is detailed in subsection 2.10. Interestingly, we concur with many of their empirical results such as the relative performance of reinforcement learning and supervised finetuning in the reranking regime, and the ability to obtain models competitive with human performance.

Methods

Our models generate an answer with supporting evidence “inlined” into a single string (hence “Inline Evidence”), treating the task of producing supported claims as (conditional) language modelling. Answer and evidence use the following template, where template tokens are black and output placeholders are violet: