Teaching language models to support answers with verified quotes
Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, Nat McAleese
Introduction
Generative language models (LMs) are increasingly useful for answering questions about the world, steadily improving (Roberts et al., 2020; Cheng et al., 2021; Nakano et al., 2021) on question-answering benchmarks (Kwiatkowski et al., 2019; Pang et al., 2021; Lin et al., 2021), and serving generated samples to users via APIs (Brockman et al., 2020; Cohere, 2021). By default, however, LMs generate ungrounded claims that users must choose either to blindly accept or to verify themselves. In this work we train models that help the user or data rater evaluate responses by generating claims alongside supporting evidence. This evidence takes the form of a verbatim quote extracted from a longer source retrieved by Google Search or any suitable information retrieval system. We call this task “self-supported question-answering” (SQA), and intend it as a sub-task that can be embedded into other generative language modelling tasks such as open-ended dialogue or debate (Rae et al., 2021; Irving et al., 2018; Askell et al., 2021; Komeili et al., 2021; Thoppilan et al., 2022).
Crucially, citing external sources inline decreases the effort required on the part of human annotators. By extracting specific supporting quotes from the document rather than linking to entire web pages, we allow faster and more specific appraisal of supportedness. This also affords end-users a qualitatively different level of trust in model samples, compared to systems which simply return an unsupported answer. This consideration has also motivated recently released, partly concurrent work (Nakano et al., 2021) in which a finetuned version of GPT-3 cites sources.
One could view self-supporting answers as a specific type of explanation, putting our work alongside other work in explainable AI (Ras et al., 2020) that aims to provide natural-language explanations of QA model responses (Lamm et al., 2020; Latcinnik and Berant, 2020; Narang et al., 2020). Our goals are aligned to the extent that both explanations and supporting evidence are ways to increase trust in model outputs. However, while our training objective incentivises the model’s answer to agree with the evidence it provides, our method makes no attempt to guarantee that the evidence faithfully (Jacovi and Goldberg, 2020) describes the reason that the model generated the claim. We view work in that direction as complementary.
We cast SQA as a (conditional) language modelling problem, generating both free-form answers and verbatim quotes of supporting evidence as a single string with evidence “inlined”. We term this approach “Inline Evidence”. Whilst more specialized architectures exist for extracting spans from documents (Karpukhin et al., 2020; Joshi et al., 2020; Keskar et al., 2019), we show that span extraction with generative models works well and enables taking advantage of the powerful Large Language Models (LLMs) developed in recent years (Brown et al., 2020; Rae et al., 2021; Smith et al., 2022; Lieber et al., 2021; Zeng et al., 2021). In order to ensure the quotes are “verbatim” with a generative approach, we introduce a special syntax for the language model to use when quoting from documents and constrain the outputs of the model to be exact quotes from the retrieved documents when in this mode.
To measure the quality of the generated answers on the task of Self-Supported question-answering (SQA), we ask human raters to assess whether the answers are plausible and whether they are supported by the accompanying quote evidence. The first metric, “plausible”, assesses if the answer is a reasonable on-topic response to the question as if it were occurring in a conversation. The second metric, “supported”, is introduced to indicate whether the provided evidence is sufficient to verify the validity of the answer. Producing SQA responses that are both plausible and supported is a nontrivial exercise in aligning the language model to human preferences.
In this work, we describe an Inline Evidence system – named GopherCite – which we developed by finetuning the 280B parameter Gopher language model (Rae et al., 2021) using a combination of supervised learning and Reinforcement Learning from Human Preferences (RLHP), as in (Ziegler et al., 2019). Given an input query, the system retrieves relevant documents using Google Search and presents the language model a large context drawn from multiple documents. Whilst our system trusts these sources, we do not explicitly mitigate untrustworthy sources in this version of our work and forward documents to the model no matter where they come from. The language model, in turn, synthesizes a SQA response, with the evidence drawn as a verbatim quote from one of these articles. During reinforcement learning, GopherCite optimizes the score from a “reward model” which predicts human pairwise preferences between two candidate responses as well as an auxiliary classification loss as to whether the response is plausible and whether it is supported.
Retrieving sources using a search engine (Nakano et al., 2021; Thoppilan et al., 2022; Lazaridou et al., 2022) that is kept up-to-date – and supplying them to the language model in a nonparametric fashion – can enable improved temporal generalization over a purely parametric model (Liska et al., 2022; Lewis et al., 2021a; Borgeaud et al., 2021). It also enables the system to attempt questions implying the present date, like “which country got the most medals in the last winter olympics?”.
In our experiments, we show that GopherCite produces high quality (plausible and supported) answers 80% of the time when prompted with fact-seeking questions drawn from a filtered subset of NaturalQuestions dataset and 67% of the time when prompted with explanation-seeking questions drawn from a filtered subset of the ELI5 (“Explain like I’m five”) dataset (Fan et al., 2019). Furthermore, we can improve the reliability of the system dramatically by selecting a minority of questions to decline to answer (El-Yaniv et al., 2010).
We develop a reward model-based mechanism for abstaining from answering a configurable proportion of test-time questions. Performance is measured in this setting by plotting the trade-off between question coverage (the proportion of questions attempted) and the quality of responses when attempting. When declining to answer less than a third of questions in these datasets, the response quality measured amongst those questions the system attempts climbs from 80% to 90% on the filtered NaturalQuestions subset, exceeding the level of performance humans obtain when answering every question. On the filtered ELI5 subset, performance improves from 67% to 80%.
Despite these benefits, optimizing for answers that can be supported by documents on the internet is not sufficient to ensure that model responses are true. We show this via evaluation on the adversarial TruthfulQA (Lin et al., 2021) dataset, along with some qualitative highlights. Whilst often helpful, our models are able to select misleading evidence even from authoritative corpora pointing to a need for enhancement in future work. In particular, we need to tackle source trustworthiness, ensure answers are given with more careful qualification, and investigate whether more subtle alignment approaches such as debate can provide reward signals which ensure that quotes are not misleading.
As we developed GopherCite, closely related work was released, including an updated LaMDA model (Thoppilan et al., 2022) and the WebGPT system (Nakano et al., 2021). LaMDA also focuses on factual grounding, but supports answers by simply showing a URL rather than pointing the user to an easily verified quote as we do in GopherCite. Similar to our work, WebGPT uses RLHP to train question-answering models which refer to sources from the internet. WebGPT learns to interact multiple times with a search engine when gathering evidence to be passed to the question-answering model, critically deciding which queries to issue to a search engine rather than simply forwarding the user query as we do. In our work, instead of curating a collection of brief snippets from multiple search engine interactions, we condition GopherCite with a large context with thousands of tokens of uncurated information from multiple pages, focusing GopherCite on reading comprehension, and we specifically investigate how well the model supports individual claims. We view the richer interaction with a search engine developed in LaMDA and WebGPT as an exciting, complementary direction to the focus of our work. Further similarities and differences to WebGPT, LaMDA, and other recent work in the community is detailed in subsection 2.10. Interestingly, we concur with many of their empirical results such as the relative performance of reinforcement learning and supervised finetuning in the reranking regime, and the ability to obtain models competitive with human performance.
Methods
Our models generate an answer with supporting evidence “inlined” into a single string (hence “Inline Evidence”), treating the task of producing supported claims as (conditional) language modelling. Answer and evidence use the following template, where template tokens are black and output placeholders are violet:
For example, the answer from Figure 1 about the Scooby-Doo series would be expressed as:%%(Scooby-Doo)%[This Saturday-morning cartoon series featuredteenagers Fred Jones, Daphne Blake, Velma Dinkley, and Shaggy Rogers, and theirtalking Great Dane named Scooby-Doo.]%.
As we use left-to-right language models, our syntax amounts to the autoregressive factorization:
Above, is the set of context documents retrieved from Google Search or provided by the user, described in subsection 2.3.
We benefit from these additional properties of this syntax:
Parsing and constrained sampling We can parse expressions emitted by the model post-hoc, or constrain them to be valid online during sampling (Appendix J). Constrained sampling ensures that the model quotes are verbatim from the claimed source. Post-hoc parsing is useful for splitting up the sample to render the claim and evidence separately either in a UI or in a downstream system that may wish to use the claim and evidence separately.
Scoring answers in isolation Because answers occur first in the autoregressive ordering, we can assign likelihood to them without considering evidence.
Conditional evidence generation We can treat conditional evidence generation as the continuation of a prefix in which a claim is given.
To be clear with the terminology introduced, we view Self-Supported Question Answering (SQA) as the task of producing a supported answer and Inline Evidence as one way to approach the SQA task.
2 Pretrained language models
All models used in this paper are finetuned from the weights of a Gopher-family language model from Rae et al. (2021). We focus on the most capable 280B parameter Gopher model, and we consider the 1.4B and 7B parameter variants in an ablation study. We reuse Gopher’s SentencePiece (Kudo and Richardson, 2018) tokenizer with a vocabulary size of 32,000 subwords. For reference, this tokenizer compresses natural language strings down to about shorter sequences than does raw byte tokenization (Rae et al. (2021), Table A2).
3 Conditioning and retrieval
Our system requires a method for finding sources relevant to a question (information retrieval). Many Question-Answering papers have developed “deep learning”-based retrieval systems with KNN lookups (Guu et al., 2020; Lewis et al., 2021a; Borgeaud et al., 2021). Instead, we follow Lazaridou et al. (2022); Nakano et al. (2021); Thoppilan et al. (2022); Komeili et al. (2021) in calling out to production search engines to find relevant sources, leveraging their access to the entire web, convenience of use, and frequent updates. In particular, we simply forward the input question to Google Search, and show as much context as possible from the resulting documents to the language model.
At inference time, we retrieve the top documents from Google Search, and then perform sampling passes iterating over documents in round-robin order, each of which shows the language model as much context as possible from a single document, and then re-rank all the samples when choosing one to return. Figure 2 depicts this process.
For details of how we combine a question and retrieved documents into prompts during training, see Appendix A and Appendix H.
4 High-level training pipeline
Our approach to finetuning follows Christiano et al. (2017); Ziegler et al. (2019); Stiennon et al. (2020). The entire project iterated over the steps below until the desired performance was reached (illustrated in Figure 3).
Collect data from our best current models, and have it rated by humans. We present model outputs as comparisons for the human labellers that assess the quality of individual answers, as well as preference judgements between answers (subsection 2.6). These serve as data for supervised fine-tuning and reward model training, respectively. On the first iteration, we bootstrap with few-shot prompting of the base Gopher model (subsection 2.5).
Train a supervised finetuning (SFT) model: We fine-tune a pretrained Gopher model on the examples rated positively by the labellers (subsection 2.7). The purpose of the supervised finetuning stage is to teach the model to produce verbatim quotes using our syntax, and to provide a baseline level of Self-Supported Question-Answering ability.
Train a reward model (RM): Reranking model outputs and reinforcement learning both require a scalar "overall quality" label associated with each output. We use a reward model trained on a dataset of comparisons between two answers to a single question using the approach in Christiano et al. (2017) (subsection 2.8).
Optimize a reinforcement learning (RL) policy against a reward model: The RL finetuning stage tunes the model’s quoting behaviour to human preferences (subsection 2.8).
Each iteration of this loop adds data to a continuously growing training set. A full loop of this training scheme was performed four times for short-answer extractive QA data, using train datasets of Natural Questions (Kwiatkowski et al., 2019), SQuAD (Rajpurkar et al., 2016) TriviaQA (Joshi et al., 2017), and then further two times for extending system abilities for non-extractive longer-form question answering on the ELI5 dataset (Fan et al., 2019).
5 Bootstrapping via prompting
The supervised model requires labelled input-output examples where the desired outputs make use of “inline evidence” syntax. No such dataset exists of sufficiently high quality111Natural Questions (Kwiatkowski et al., 2019) contains fields that enable us to form inline evidence targets with a template, but preliminary experiments using this dataset as a source of inline evidence targets for supervised learning found that it resulted in poor models with low diversity., so we created a small training set with about 5000 high-quality examples with questions drawn from the ELI5 (Fan et al., 2019) and Natural Questions (Kwiatkowski et al., 2019) datasets, and articles retrieved via Google Search. In this training set, only the questions from the canonical datasets were used, whilst the target answers were sampled from Gopher (the 280B parameter variant described in Rae et al. (2021)).
Collecting human demonstrations is a standard, but expensive way to create a supervised dataset, and has been taken by related work (Nakano et al. (2021), Thoppilan et al. (2022)). We instead “prompted” Gopher with a few in-context examples (Rae et al. (2021)) to generate tens of thousands of candidate answers with inline evidence. We then ask human contractors which samples are high quality according to desiderata discussed later in the paper, and keep only high quality samples. This approach has only recently become viable due to the development of capable language models (Brown et al. (2020); Rae et al. (2021)), and has also been used for the creation of a language dataset in recent work (Liu et al., 2022).
Appendix H has our prompt templates and further details. For a more thorough study on prompting language models with search engine results to increase factuality, see Lazaridou et al. (2022).
6 Collection of human ratings
For primary data collection (for both training and evaluation) we present a question and two candidate answers, each split into a “claim” section and a “supporting evidence” section (see blue and grey boxes in Figure 8). We ask raters to check whether either answer is a Plausible response to the question, and whether it is Supported by the accompanying quote evidence. We then ask the participant to decide which answer they prefer (with ties allowed), based on these and other secondary criteria. Below, we define these two terms as we instructed raters to mark them.
Is the answer a plausible reply to the question? “The answer should be a reasonable reply to the question if you were having a conversation. If the answer is off-topic, incoherent, or it’s not clear if it makes sense as a reply to the question, it is not plausible.”
Is the answer supported by the accompanying evidence? “The evidence must be sufficient to convince you that the whole answer is true. If you happen to know the answer is false, or if you need any extra information to be convinced, the answer is not supported. If the evidence is not pertinent to the question or the answer, it cannot support the answer. You can determine if the evidence is relevant by looking at its content, as well as the document title.”
We make use of a “super rater” model, in which paid contractors on a publicly available platform are first assessed for their agreement with ourselves (the researchers) when completing this task. Raters who meet a high enough bar for agreement on Preferred (85% of responses in a quality assurance set) are kept in a “super rater” pool which we iteratively grew over the course of the project. Our training data came exclusively from this set of raters. Appendix C has our full instructions, rater UI, and more details on our data collection process.
7 Supervised finetuning
The next stage in our training pipeline is Supervised Fine-Tuning (SFT) to teach the model to use inline evidence syntax. We finetune Gopher only on the bootstrapped samples determined by raters to be both Plausible and Supported. When predicting these Rated-Good targets, we condition the model with a prompt including the question and documents retrieved by Google Search. The prompts used during SFT are shown in Table 12.
During supervised training, we uniformly at random decide how many documents to show the model inside a context of 4096 subword tokens, and how much of the context to dedicate to each document. When we pick a budget of tokens for a given document, we truncate to tokens by randomly choosing a region surrounding the short snippet returned by Google Search (with a variable amount of additional context before the snippet, after the snippet, or both.) Consequently, during inference we can show one document, or many documents, either brief in length or up to 4096 subword tokens in length.We also ensure that the document a target’s quote came from is included in the documents fed to the model, and that if the context document is truncated, the snippet being quoted occurs.
We ran supervised finetuning for just 60 SGD steps with batch size 128. After this amount of training the largest Gopher model produces perfect verbatim quotes on held out data around 75% of the time, even without constrained sampling. We perform as few steps of supervised finetuning as possible in order to keep the inline-evidence-producing SFT model as close as possible to the raw language model, retaining its entropy and general capabilities. We therefore chose an aggressive early stopping criterion, ending training when validation set perplexity stopped improving on either our ELI5 Rated-Good development set or a NaturalQuestions development set formed from gold-truth targets (described in subsection 3.2 as “Gold + GoldEvidence”). This resulted in finetuning for only 60 steps, which was slightly less than a single epoch.
8 Reinforcement learning from human preferences
We follow the “Reinforcement Learning from Human Preferences” pipeline of Christiano et al. (2017), with a few small differences tailored to our setup explained below. Note that whilst we mirror and reference this work’s training setup in particular, reinforcement learning from human preferences has been developed for over a decade at time of writing, e.g. (Wirth et al., 2016; Schoenauer et al., 2014; Akrour et al., 2011) and a nice review in Wirth et al. (2017).
Following Christiano et al. (2017) we collect human assessments of model samples and train a “Reward Model” (RM) to predict human judgment of a binary pairwise preference outcome, using the standard cross-entropy loss. The reward model is a classifier predicting this binary variable indicating which example in a pair was preferred, given a question and Self-Supported Question-Answering (SQA) response string. Note that the RM does not receive the full document context as input, only the piece of evidence selected by a model, in order to maintain parity of interface between human users and the RM. In the event of a tie, this classification objective becomes maximum entropy, which is a slight variation on formula (1) in Christiano et al. (2017). The training set containing these human labels had 33,242 rated SQA response pairs, with questions drawn from Natural Questions, ELI5, and a few additional datasets in smaller number (table Table 13). Appendix C contains further details on this training data.
We warm-start the RM from the pretrained 7B language model from the Gopher family (Rae et al., 2021) and add an extra final linear layer to predict the reward. The RM also predicts the binary Supported&Plausible judgements222That is, the auxiliary loss predicts only a single binary variable indicating whether or not a response is both supported and plausible. of the individual SQA responses as an auxiliary loss. The final loss is the average of the pairwise preference prediction loss and the auxiliary prediction loss. We early-stop according to the preference prediction accuracy on a held-out validation subset of ELI5 which was rated by the researchers.
Using Reward Models for Reranking
We use reward model scores to do reranking of candidate responses. At inference time we draw samples and select one with maximal reward. We call such models ’SFT + top@N’ or ’RL + top@N’ (depending on the underlying generator).
This approach is similar to what is described as “Sample and Rank” in (Thoppilan et al., 2022).
Training against Reward Models with RL Fine-tuning.
We use RL to maximize the expected reward, . We train the LM with synchronous advantage actor-critic (A2C; Mnih et al. (2016)). We follow the same training setup as in Perez et al. (2022), which we summarize again for completeness. We warm-start by initializing with the SFT model from the section 2.7. To prevent RL from collapsing to a single, high-reward generation, we add a loss term to penalize KL divergence between and initialization’s distribution over next tokens (Jaques et al., 2017; Schmitt et al., 2018; Jaques et al., 2019; Ziegler et al., 2019). The final loss is a linear combination of the KL penalty (weighted by ) and A2C loss (weighted by ). We vary the KL penalty strength, using decreasing values of , sacrificing diversity for expected reward. See Appendix F for further details.
9 Declining to answer
We investigate enabling the system to decline to answer a subset of input questions, e.g. returning the string “I don’t know” instead of a low-quality answer. We found that a global threshold on the reward model score worked well, falling back to “I don’t know” if the score falls below the threshold.
This setup could be described as “selective prediction” (also known as prediction with a reject option) (El-Yaniv et al., 2010; Geifman and El-Yaniv, 2017, 2019; Kamath et al., 2020). We study the selective prediction ability of our reward models compared to the agents’ likelihood in subsection 3.3.
10 Similarities and differences compared to recent work.
Three closely related pieces of work have recently been released (Lazaridou et al., 2022; Nakano et al., 2021; Thoppilan et al., 2022). We outline similarities and differences below.
From the user’s perspective: LaMDA (Thoppilan et al., 2022) shows just a URL as supporting evidence, putting the burden of fact verification on the user. GopherCite provides exact and succinct quotes supporting the claim. WebGPT links claims to quotes, and allows the model to link multiple supported claims into an answer that is assessed by raters. In contrast to that work, we specifically study the rate at which individual claims are supported.
Training data: Both WebGPT and LaMDA are trained from human demonstrations. In GopherCite we bootstrap from data generated by a few-shot prompted language model. Similarly to LaMDA and WebGPT, we draw many samples and use a reranking network to pick the model’s final response. In the LaMDA case, the system is fully supervised. In our case, the classifier used for reranking is a reward model predicting pairwise preference. Similarly to WebGPT, we apply Reinforcement Learning from Human Preferences to improve the quality of our supervised system. Lazaridou et al. (2022) do not do any finetuning and rely only on prompting.
Learning to query LaMDA and WebGPT train agents to learn to query a search engine, and can query multiple times for a given input. We simply forward the user’s question to a search engine and condition upon the results, as in Lazaridou et al. (2022).
Information retrieval: LaMDA uses very short fragments returned by the query as the model conditioning (just Google snippets of 1-2 sentences, or knowledge graph relations). WebGPT forms its final response by conditioning a language model with a brief, well-curated context of multiple quotes. GopherCite conditions on much longer documents – it is trained on contexts of up to 4096 tokens and can draw upon contexts at least this long during inference time. (Lazaridou et al., 2022) only condition a language model with a brief snippet extracted from the search results via a simple TFIDF baseline.
Abstention: We train GopherCite to always directly answer a question333see the “Informative” rate in Table 4, per the definition from Lin et al. (2021).. But we can configure the frequency with which GopherCite declines to answer by setting the threshold on an acceptable score under the reward model. By contrast, WebGPT includes demonstrations of answers that dodge the question, allowing a kind of incremental abstention at the model’s discretion.
Results
Our primary evaluations for Self-Supported Question Answering (SQA) are conducted by asking paid contractors to assess model samples. We chose to evaluate this way due to the lack of ground truth targets for SQA across question-answering datasets. We evaluate using questions drawn from the Natural Questions (Kwiatkowski et al. (2019)) and ELI5 (Fan et al. (2019)) datasets. To keep the cost of human evaluation manageable, and avoid train-test overlap (Lewis et al., 2021b), we use subsets of test questions from these standard question-answering benchmarks. We filter the datasets to small subsets in the ways described below and refer to them as ‘filtered’ NQ/ELI5.
NaturalQuestionsFiltered: We filtered the NaturalQuestions (Kwiatkowski et al. (2019)) validation set to obtain a list of questions that are true holdouts, avoiding the train-test overlap described in Lewis et al. (2021b). Specifically, we filtered out validation set questions for which the question, answer, or ground truth Wikipedia document was contained in the training set and require the question to have non-empty “short-answer” and “long-answer” fields. This left us with 307 questions. As the raters were allowed to skip questions, our human evaluation runs did not result in ratings of samples from every model for every question, even though we show every sample to three raters. To enable apples-to-apples comparison, we report numbers on the set of questions for which every model of interest had a sample rated, arriving at 115 overlapping questions.
ELI5Filtered (Explain Like I’m Five) : We wanted to have a human baseline that could be reasonably compared to GopherCite’s SQA responses (i.e. containing answer and evidence). We therefore filtered out questions where the top-rated Reddit answer did not contain a URL link. We also filtered out questions where the top search results linked to reddit.com/r/eli5 in order to avoid confounding good model performance with repeating a human answer. Additionally, we filtered out questions where the top reddit answer was either extremely long or trivially short compared to the distribution of lengths in our model answers.444We kept questions where the length of human answers fell between the 5th percentile and 95th percentile of model answer lengths. We select at random 150 of this set and report the results for an overlapping subset of 121 for which we obtained ratings for all the ablations. This filtering strategy impacts the difficulty of the dataset. The restriction to answers that contain references and are limited length, influences the questions to be better-posed and more likely to be answerable in a supported manner. However, it also causes the answers to be better quality than the average ELI5 answers, increasing the competitiveness of the human baseline.
Our best models produce high quality supporting evidence for their factual claims. On short-answer questions drawn from the NaturalQuestionsFiltered dataset, our best model produces plausible and supported claims 80% of the time. On explanation-seeking questions from the ELI5Filtered dataset, the model produces plausible and supported claims 67% of the time. See Table 1.
Learning from human preferences improves GopherCite decisively over purely supervised baselines. Both reranking with a reward model, as well as reinforcement learning, significantly improve scores achieved by the models on both evaluation datasets, compared to purely supervised models trained on our Rated-Good samples. See Table 1 and Table 2.
Declining to answer substantially improves these numbers by answering only selected questions whilst still attempting a large majority. We use thresholds on reward model scores under which the model abstains from answering and emits the string “I don’t know”. This traces out a frontier of accuracy-if-attempted versus coverage, and allows to reach >90% performance when attempting 70% of questions on NaturalQuestionsFiltered and >80% when attempting 70% of questions on ELI5Filtered. See Figure 4. This shows that our reward models provide a successful abstention mechanism and allow assessing the system’s confidence in its own answers.
Our models show no improvements in truthfulness per the definition from TruthfulQA. Although achieving high rates of supported and plausible score of the produced answers, the model answers are rarely scored as truthful when presented against the ’correct’ answers in the TruthfulQA dataset of Lin et al. (2021). This is because the concept of answers being ’Supported’ does not distinguish well between what is written down in some document (e.g. possibly talking about fictional worlds) and what is true in an absolute sense (subsection 3.7).
2 Human evaluation of response quality and preference to baselines
We jointly assess a model’s answer and accompanying inline evidence with a human evaluation of whether they are Supported and Plausible as defined in subsection 2.6. As a shorthand, we refer to this property of a response being both “supported” and “plausible” as “S&P”.
Whilst Self-Supported Question Answering – as we formalize it – is not directly attempted in the deep learning or NLP literature, we hand-craft baselines in various ways. In Table 1 we report the percentage of questions for which human raters assess the model’s response to be S&P.
For the NaturalQuestionsFiltered dataset, we compare to the gold answers and supporting evidence paragraphs, as well as other engineered baselines.
Gold + GoldEvidence. The claim is the “short answer” collected by the NaturalQuestion dataset annotators. The “supporting evidence” is the “long answer” (typically a paragraph, and always a span from a relevant Wikipedia article) that contains the information required to answer the question as determined by an annotator.
Gold + Random-Sentence. The “supporting evidence” is formed by choosing a random sentence from the dataset-provided Wikipedia document. This is a sense check but has nonzero performance due to some documents being very short.
Gold + First-Sentence. The supporting evidence is chosen to be the first sentence in the Wikipedia document containing the ground truth answer. Another sense check, but its nontrivial performance demonstrates how easy many questions from Natural Questions are. Similar baselines were surprisingly strong in past work on summarization (Ziegler et al., 2019).
Gold + TFIDF-Sentence. The supporting evidence is taken to be the closest sentence—in TFIDF vector-space—to the bag-of-words formed by concatenating the question and the ground truth answer555We also experimented with comparing to the question only and the answer only, and found perhaps unsurprisingly that querying evidence sentences using both question and answer performed the best. The inverse document frequency was estimated using the (entire) Wikipedia document alone, rather than some larger corpus..
FiD-DPR. The output of FiD (Izacard and Grave, 2020) is used as the answer, and one of the 100-word retrieved passages (“chunks”) used to condition the model is shown to raters as the “evidence”. In particular, we pick the highest-ranked retrieved chunk which contains the answer as a substring. Due to the extractive nature of Natural Questions, one such chunk always exists on a model trained thoroughly on the Natural Questions training set (which is true of FiD). This answer-generating baseline is a state-of-the-art question-answering model at time of writing. For brevity, we limited baselines drawn from the question-answering literature to just this model, though it may also be worthwhile to consider comparing to a less standard objective such as that of Perez et al. (2019) in future work.
Whilst our supervised finetuned baseline models do not outperform the strongest of these baselines (FiD-DPR) when sampled from naively, reranking and reinforcement learning substantially improve GopherCite, going beyond the baselines and approaching the S&P quality of the ground truth data (Table 1(a)). We perform an ablation study on the number of candidates to rerank and base model size in subsection 3.5.
For the ELI5 dataset, there are no trusted “gold” answers with accompanying evidence. We therefore handcraft the following baselines:
Prompted Gopher with ROUGE evidence. The answer is produced by a few-shot prompted Gopher model, where the prompt contains truncated search results as conditioning for each question, and a claim without evidence as an answer (similar to Lazaridou et al. (2022)). The “supporting evidence” is formed by finding the closest-matching sequence of sentences (where is the number of sentences in the answer) in terms of ROUGE-L score against the answer. Such samples are drawn 8 times (for different top-8 search results) and the sample with the second-highest ROUGE-L match score, as this setup proved experimentally to achieve the highest human ratings on our development set.666We note it may be surprising to the reader that we use one information retrieval baseline (TFIDF) for Natural Questions and another (ROUGE-L) for ELI5. We used the ROUGE score for evidence selection on ELI5 due to incidental software development convenience.
Prompted Gopher with generated evidence. The answer is produced by a few-shot prompted Gopher model, where the prompt contains truncated search results as conditioning for each question, and the answer with evidence represented in our inline evidence syntax. The samples for new questions are then decoded using constrained sampling (Appendix J).
We find in Table 1 that humans determine our best model responses to be high-quality 80% of the time on our NaturalQuestionsFiltered validation subset, much more frequently than when using strong evidence baselines. The model’s responses are deemed high-quality 67% of the time on our ELI5Filtered test subset. Note that we use max-reward sampling @64 for NaturalQuestionsFiltered and @16 for ELI5Filtered; this is because these levels proved best according to the ablation study (Figure 6).
Preference versus human answers
Here we assess the quality of a model’s answers in terms of pairwise preference versus human baselines. When reporting these numbers, we split a tie between the model response and the human response, counting the example as half a point for each, as in prior work (Nakano et al., 2021), rather than e.g. filtering out ties. However the reported pairwise preference numbers are not comparable to (Nakano et al., 2021) due to disparity in the question subset discussed in subsection 3.1 and the fact that there are distinct raters participating in different human evaluation protocols between this work and our own.
For NaturalQuestionsFiltered we compare against the Gold + GoldEvidence human baseline. Table 2 shows that the answer and evidence from our best SFT with Reranking model on NaturalQuestionsFiltered are preferred to golden answer and evidence 49.5% of the time (i.e. NQ gold answers are preferred 50.5%). Note that we use a different document corpus than that used by the gold-truth targets (the whole web rather than Wikipedia), and there is a time mismatch as NaturalQuestions uses Wikipedia from 2018.
For ELI5 we compare against the top-rated Reddit answers, filtered out to just those answers which contain URL references (subsection 3.1). We describe exactly how the baseline and model are formatted into a single response when shown to humans in the Supplementary material subsection C.3. To ensure comparability in style between the model and human written answers, we flatten down the (answer, evidence) model output into a single answer, using, chosen at random, one of the templates that combine claims and evidence (e.g. {claim}\n\nAccording to the page "{title}":\n{quote}\n\n {url}).
Table 2 shows that when compared to top Reddit answers that contain URL references, the answers produced by our RL w/ Reranking model are preferred 42.9% of the time.
The preferences expressed by raters in this evaluation setting are often based on the answer’s structure rather than its content. One rater commented: “It was sometimes difficult to decide which answer more credibly answered the question, if they both seemed to provide the same or very similar information but expressed differently.” The model’s claims are also shorter on average than Reddit answers, despite the length-filtering.
In summary, when assessing model samples on the entirety of our test sets – when the model and baselines are required to attempt every question – they outperform our baselines in terms of S&P, but fall slightly short of human ground truth responses in terms of S&P scores and pairwise rater preference.
3 Declining to answer
We demonstrate that we can score produced answers to perform selective question answering (El-Yaniv et al., 2010; Geifman and El-Yaniv, 2017, 2019; Kamath et al., 2020). The system can select a subset of questions to decline-to-answer and substantially improve performance on the questions it does attempt. This results in configurable system in which coverage – the percentage of questions attempted – can be traded off against the quality of responses when the system does attempt to answer.
We experiment with three scoring techniques for deciding which questions to answer and which questions to decline, given a candidate answer sampled from the system:
A global threshold on the reward model’s score.
A global threshold on the SFT generator’s likelihood for the generated sample.
A global threshold on the RL policy’s likelihood for the generated sample.
Figure 4 shows the resulting trade-off. Declining to answer some percentage of questions using the reward model results in higher Supported&Plausible human ratings on the resulting subset of attempted questions, and the reward model improves over the two likelihood baselines. The downward-sloping shape of the curve confirms that the reward model is successful at the selective prediction: the smaller proportion of questions attempted, the higher is the quality of the answers amongst questions that are attempted. We include ablations for further scoring approaches in Appendix G.
With our best performing decline-to-answer strategy of declining below a fixed RM score we can substantially improve answer quality, outperforming the S&P score of the human baseline which attempts every question in the case of NaturalQuestionsFiltered. We leave to future work comparing the selective prediction of models to selective prediction by humans themselves.
4 Qualitative examples
Table 3 shows examples of questions from the NQ and ELI5 datasets alongside the model’s outputs: claims and supporting evidence, and ratings according to the Supported and Plausible scheme. In this table the samples are curated to illustrate success and failure modes; see Appendix B for a larger set of non-curated examples, and Appendix I to examine samples alongside with annotators’ assessed ratings on the entirety of our test sets.
5 Ablation of RL and SFT w/ Reranking
We investigated how model performance varied with the number of samples considered in choosing the top-reward answer. Increasing the number of samples poses a trade-off, as it is likely to improve system performance but also increases inference time. We also compare supervised finetuning (SFT) with reranking against reinforcement learning (RL) with reranking.
Reranking with a reward model dramatically improves performance over SFT, but we see diminishing returns in the number of samples, similar to the observation in Cobbe et al. (2021).
Reinforcement learning dramatically improves performance over naive SFT or RL agent decoding with a single sample.
In the reranking regime, RL is outperformed by SFT, as observed in Nakano et al. (2021). We offer hypotheses as to why this is the case.
Figure 5 shows that without reranking RL outperforms SFT on both datasets. However, the benefit is less clear when combining the generator models with reranking. In the case of NaturalQuestionsFiltered (6(a)), SFT + top@64 achieves higher S&P rates over RL + top@64. For ELI5 however, RL outperforms SFT consistently for all numbers of candidates.
Figure 6 breaks down S&P into separate Supported and Plausible curves vs. the number of samples used for reranking. For NaturalQuestionsFiltered where many answers are extractive and often simply give a single named entity, the Plausible rate is around 90%. The S&P score in this regime is upper bounded by the Supported rate. ELI5 requires longer, non-extractive and explanatory answers, and plausible rates are lower and decline as the number of candidate samples grows for the RL model.
We hypothesise that the under-performance of RL on NaturalQuestions may be caused by the following factors.
Over the course of RL finetuning we lose the diversity of sampled answers. This diminishes the benefit of drawing a large number of samples to pick the best one from.
During RL finetuning, we use a higher proportion of the ELI5 questions (Appendix F), as we want to prioritize non-extractive question answering ability. This biases the RL pass to be tuned toward ELI5 more-so than toward NaturalQuestions.
6 Ablation of model scale
We investigate how model performance varies with the number of model parameters, and the number of candidates used for reranking. Figure 7 shows that scaling the Supervised Fine-tuning generator brings clear improvements in both the Supported&Plausible scores as well as the Preference judgements. Across the board, our strongest model is the largest 280B member of the Gopher family.
7 Misalignment between “Supported” and “True”
When evaluated on the TruthfulQA benchmark Lin et al. (2021), GopherCite achieves high Supported&Plausible results but does not score well in the Truthful&Informative objective defined for the dataset: “A claim counts as true if it describes the literal truth about the real world”, and informative if it “provid[es] information that reduces uncertainty raised by the question” (emphasis our own, see numbers in Table 4).
We evaluated the same set of model answers in two ways. First, using our standard QA evaluation app, where the raters assessed whether (answer, evidence) pairs were Supported and Plausible (subsection 2.6). Second, via a separate evaluation app that assessed the answer in isolation (without evidence) with Truthful and Informative scores as defined by Lin et al. (2021). To avoid putting the responsibility of external research on the raters, the candidate answers were presented alongside the suggested correct and incorrect answers taken from the TruthfulQA dataset. The results (Table 4) demonstrate that a high score on our metrics is compatible with a low score on the TruthfulQA metrics.
Qualitative examples in Table 5 illustrate the misalignment between these metrics. The literally false claims are, in a way, "supported", because the evidence is speaking metaphorically, satirically, or refers to a fictional world. Although our instructions to raters refer to truth, the training data did not deal with such edge cases, meaning this type of error did not surface in data quality assessments. The comments provided by the raters alongside these ratings suggest that they were aware of the nuance and could see the claims were not ’true’, so more attention on this point in further work could reduce the disparity. More broadly, better coverage of edge cases could potentially be achieved using adversarial data generation techniques such as red teaming (Perez et al., 2022).
Another, deeper, problem is that the SQA format is adversarial to such examples: if there is no document found by Search which states that Red Bull cannot cause wings to grow, it is very difficult to produce a true response with supporting evidence, as there is nothing to quote from. (c.f. the “percentage of the brain” case, where there exist articles debunking the common misconception). In contrast, it is possible to provide answers which, although not true, are close enough to the intuitive meaning of “supported” that a rater could justify labelling them such. Thus, the SQA setting incentivises incorrect interpretation of the instructions. This underscores the importance of viewing evidence quoting as just one component of a truthful agent – this problem could be alleviated in a richer setting, e.g. where a model is permitted to make arguments grounded in common sense.
Discussion
We view inline evidence as one tool meant to be used alongside other techniques to achieve truthful LM outputs. Some limitations of using it in isolation have been discussed in subsection 3.7. In this section, we detail further limitations of evidence quoting if it were used on its own, and suggest how enriching the setting can help.
Errors in the supporting document corpus. Our implementation uses webpages returned by Google Search, which can include unreliable sources. A complete approach must account for fallible sources. But it is not feasible to simply implement a trusted allowlist over sources: even relatively high-quality corpora like Wikipedia can contain errors or be biased (Hube, 2017; Martin, 2018), and no curated allowlist would cover all claims of interest. As determining the reliability of a source is itself a challenging task, augmenting the setup with a way to help the human make good judgements using techniques like amplification (Christiano et al., 2018), recursive reward modelling (Leike et al., 2018), or debate (Irving et al., 2018) may offer a promising way forward.
Explicit reasoning. GopherCite only uses a single quote to support an answer. Some claims require multiple pieces of evidence, and/or an argument for why the claim follows from evidence. This is an exciting area for followup work.
Misleading or cherry-picked quotations. Inline evidence does not rule out claims supported by cherry-picked evidence. For example, citing a study could seem convincing if one does not know that several other studies could not replicate its results. An adversarial agent which selects evidence against the claim (as in debate; Irving et al. (2018)) could help detect instances of cherry-picking, by using additional quotes to falsify misleading quotes.
Contentious claims. A special case of cherry-picked evidence is presenting one view as if it was true, in cases where no accepted societal consensus exists. A related failure mode is presenting a consensus or majority opinion as if it was fact (Weidinger et al., 2021).777Standard approaches to improving dataset quality can exacerbate this by collapsing a diversity of opinions to the majority vote (Aroyo and Welty, 2015) While adversarial agents could alleviate this to some extent by pointing out that the claim is contentious, adequately addressing this challenge will likely require dedicated sociotechnical research.
Not every claim can be supported with a concise quotation. Some facts may not be supportable by brief quotations, even if they follow from information in the corpus, if the claim itself does not appear. One example is negative evidence: naively supporting "No US President was ever an AI researcher" would require enumerating the list of occupations of all US presidents. Another is statistical claims, like "less than 30% of Wikipedia articles about plants contain the word ‘foam’". While negative evidence can be addressed with Debate—the claim is supported if the adversary fails to provide evidence to the contrary—statistical claims require stronger protocols.
Conclusion
Language models often produce hallucinated facts, and are trustworthy only once the answers are independently verified. Our work addresses this challenge by moving from free-form question answering to self-supported question answering, thus enabling the model itself to assist human users and raters in verifying its outputs. We break the task into two pieces, one mechanical and one human: special syntax that can be automatically parsed to ensure that a quote is verbatim from a source, and human preferences to determine whether the quote supports the claimed answer. Reward modelling using these human ratings shows dramatic improvement when used for reranking responses and as a target for reinforcement learning. Moreover, reward modeling provides a natural mechanism to abstain from answering when we lack confidence in an answer. Overall the GopherCite system is able to provide samples with high quality evidence, or abstain. These successes notwithstanding, our inline evidence mechanism is just one tool towards trustworthy language agents, and significant research will be required to address its limitations and combine it with other tools.
Acknowledgements
The authors wish to thank Sebastian Borgeaud, Trevor Cai, Vlad Firoiu, Saffron Huang, Timo Ewalds, George van den Driesche, Roman Ring, Arthur Mensch, Jordan Hoffmann, Laurent Sifre, and Jean-Baptiste Lespiau for their contributions to DeepMind’s language modelling software ecosystem, and particularly Doug Fritz for developing a frontend framework with which our human evaluation apps were built. Katie Millican built the text scraper we used to preprocess all documents. Thanks to Phoebe Thacker for early work setting up our human evaluation platform, and Boxi Wu for additional program management support. Additionally, we thank Jonathan Uesato, Nando de Freitas, and Oriol Vinyals for valuable feedback on the paper and Angeliki Lazaridou, Elena Gribovskaya, Jack Rae, Charlie Nash, Ethan Perez, Oriol Vinyals, Aaron van den Oord, Simon Osindero, Marc Deisenroth, Felix Hill, Ali Eslami, Iason Gabriel, Laura Weidinger, John Mellor, and Lisa Anne Hendricks for insightful discussions. We also wish to thank Arielle Bier, Max Barnett, Emma Yousif, and Aliya Ahmad for improving the writing in our public communications.
Author Contributions
GopherCite’s training scheme was designed and developed by Jacob Menick, Maja Trebacz, Vladimir Mikulik, Nat McAleese, and Geoffrey Irving.
The “Inline Evidence” approach was proposed by Vladimir Mikulik and Nat McAleese.
The evaluations were designed by Jacob Menick, Maja Trebacz, Nat McAleese, Vladimir Mikulik, and Martin Chadwick.
The evaluations were executed and analysed by Maja Trebacz, Jacob Menick, Vladimir Mikulik, and Nat McAleese.
The execution of generator agent training was performed by Maja Trebacz, Jacob Menick, Nat McAleese, and Francis Song.
The human evaluation web apps were designed and built by Vladimir Mikulik, Maja Trebacz, Nat McAleese, John Aslanides, and Martin Chadwick.
Human data quality monitoring and improvements were led by Vladimir Mikulik and Martin Chadwick, with Jacob Menick, Maja Trebacz, Nat McAleese contributing golden labels for quality control.
Human participant ethics standards were upheld by Lucy Campbell-Gillingham and Martin Chadwick.
Reward model training was designed and executed by Vladimir Mikulik, Maja Trebacz, and John Aslanides.
The RL environment was created by Vladimir Mikulik, John Aslanides, and Francis Song, with search engine integration by Maja Trebacz.
The RL training infrastructure was developed by Francis Song, John Aslanides, Nat McAleese, Mia Glaese, Vladimir Mikulik, and Jacob Menick.
Broader large-scale language model finetuning infrastructure was developed by Mia Glaese, Nat McAleese, John Aslanides, Jacob Menick, and Francis Song.
The constrained sampling implementation was prototyped by Geoffrey Irving and developed by Vladimir Mikulik and Jacob Menick.
The project was managed by Susannah Young.
The paper was written by Jacob Menick, Maja Trebacz, Vladimir Mikulik, Nat McAleese, Geoffrey Irving, and Martin Chadwick.
Nat McAleese and Geoffrey Irving supervised the project, and Jacob Menick was accountable for its outcome.
References
Appendix
Appendix A Retrieval and Truncation Details
Given a question we obtain documents that are likely to contain an answer to the question. We use directly as a query to the Google Search API with additional keywords to restrict the sites. For a large portion of NaturalQuestions data, we restrict the site to Wikipedia only by appending site:wikipedia.org to the query. For the ELI5 we ensure that the results do not contain Reddit answers themselves by appending -site:reddit.com.
We retrieve top- search results (with and obtain the web data in text-only format using the custom HTML scraper from Rae et al. (2021).
The documents lengths are varied and often exceed the language model max token memory of 2048. Especially in the case of few-shot prompting when presenting multiple documents at once, we need to highly restrict the number of tokens spent on the document content. Hence, we truncate the documents by using the snippets of web content returned by the SearchAPI along the URLs. We match the snippet position inside the scraped document using fuzzy match from the fuzzywuzzy library888https://pypi.org/project/fuzzywuzzy/. Using the found indices we truncate document to the max_tokens length fragment such that contains the relevant search snippet. We discard any documents where the match ratio of the snippet to the document is below a threshold of 0.75 (sometimes the snippet comes from the structured part of the site that got scraped off, or the site went out of date). We also ensure that the truncated fragment starts from the beginning of the sentence or paragraph. At train time we choose such start position at random to increase the variety of the inputs. At inference time, we allow maximum of 500 chars before the start of the snippet fragment, and look for the first sentence start in that range.
Appendix B More Examples of Model Success and Failure
Appendix C Human data collection details
We designed a rating interface that presented claims and supporting evidence in a way similar as if it was displayed in a dialogue agent. The claim is shown in a blue box representing chat message, and the evidence is shown in a grey call-out box, presenting page title and the quoted fragment.
We do not show the URL of the page, as we encourage raters to assess the provided evidence in isolation, rather than in the context of internet research. In order to be rated as ’supported’, the quoted evidence should be sufficient to validate the correctness of the claim. The raters are instructed to only use information from this app.
Our raters consist of research participants using a crowd-sourcing platform. We restrict the participant pool to the UK location, English as the first language, and the higher education level of minimum Undergraduate degree.
To ensure high quality of the ratings, we used the following two strategies:
super rater filtering: We used a simple quality assurance screening experiment to select the raters that understood the task and had high agreement with ourselves (the researchers). Raters who met a high enough bar for agreement on Supported, Plausible, and Preferred were kept in a “super raters” pool which we incrementally grew over the course of the project. Our super rater filtering threshold was set to 85% agreement with our own ratings (excluding ties) for both the supported and plausible judgments, and 85% agreement with overall preferences. In the first round of super rater sourcing, we took a set of 20 comparisons on Natural Questions train questions rated by the researchers. A set of 100 crowdsourced raters provided their own ratings for the same set of 20 comparisons, of whom 24 met the requirements to be added to the super rater pool. We run a further three such sourcing experiments in total, and collected a final pool of 113 super raters. All member of the super rater pool were repeatedly asked to take part in further data collection experiments to provide both training and evaluation across all of the experiments (excepting those experiments which used a wider crowdsourced pool of raters).
Attention checks: Each time any rater took part in a new experiment, we provided them with clear instructions and task examples. Following this, we introduced a short comprehension and attention check to ensure that the raters had a minimal understanding of the task requirements. We handcrafted four examples where the correct evaluation choices should be easy if a rater has correctly understood the task. For each example, the rater provided an answer, after which the ’correct’ answers are revealed with some associated justification. This pre-task component fulfilled two roles: first, to provide further training to the raters before starting the main experiment; second, to screen out raters who answer too many of these easy questions incorrectly. Specifically, the data from participants who did not answer at least 3/4 of these screening questions was discarded. The pass rate for this screening component was around 85% (with some variability between the specific tasks).
We used the super raters for collecting all of our training data and Natural Question validation (due to consistency with earlier evaluation in the project). For ELI5 and TruthfulQA evaluations, we opened the study to a wider pool of new raters, ensuring no overlap with the super rater pool, and used attention checks to filter for rating quality.
In order to provide an additional degree of robustness in our human evaluations, we had every example rated by multiple independent raters, and in each case took the majority vote answer. When judgments were tied, the label with smaller index was returned999https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html i.e. when there was a tie in Supported&Plausible binary judgement, was returned. When running the evaluation with the super rater pool, each example was scored by 3 independent raters. When running the experiments with the wider pool of raters, we had each example scored by 6 independent raters, to allow for the fact that some data would later to be filtered out by failed attention checks. This approach ensured at least a reasonable level of item repetition even given this attrition.
C.2 Rating instructions
The complete wording of instructions is shown in Table 8.
C.3 ELI5 vs Reddit evaluation
In order to compare model-generated answers with the ELI5 human answers in as fair a way as possible, we are taking the following approach:
Select a subset of ELI5 human answers which directly cite supporting evidence, making these answers more directly comparable to the model samples, which always use supporting evidence.
Adapt the model samples to have a freetext form, merging the claim and evidence inline. This is to make them more similar to the subset of ELI5 answers.
Use an adapted version of the comparison app that shows only single text box without separate evidence box (Figure 9). We still ask for ‘plausible’ ratings, but no longer ask for ‘supported’ ratings, as this no longer makes sense in the context of the free-text. We still ask for overall preferences. The complete wording of instructions is shown in Table 9.
We select only those questions for which top-rated answers include URLs to non-ELI5 reddit citations. We found that answers referencing other ELI5 answers often did not provide self-contained answers, so this filter made the human baseline more comparable to the model outputs.
We filter out answers which were subsequently edited, as these frequently refer to other comments in the reddit thread.
We filter the answers to be of the length between 316 and 1313 characters, which is the 5th and 95th percentile of the combined claim, quote and title length of our model answers on the ELI5 train.
The citation formatting is standardized in such a way as to directly match the format used in the model samples - this increases the comparability of the two types of answer.
To make the model answers style similar to Reddit posts, we combine claim and evidence into a single string. We use one of the following templates, drawn at random:
C.4 TruthfulQA evaluation
To evaluate the generated model samples on the TruthfulQA dataset using the truthful and informative definitions from (Lin et al., 2021), we took the following approach:
Evaluated only the claim parts of the model output. This was to match the length and form of the expected dataset answers. Moreover, we wanted to assess whether teaching the model to support its answer with quotes actually results in the claims themselves being more truthful.
Designed an adapted version of app interface (see Figure 10). The app displays the trusted correct answers and incorrect answers in the interface. This is to help the raters identify misconceptions without the need for additional research.
Wrote instructions following the definitions in Lin et al. (2021). The complete wording of instructions is shown in Table 10.
Showed to the raters a tutorial with four examples, and included four attention checks of simple rating questions. Filtered out the data from the raters that did not pass the attention checks.
Appendix D SFT training and evaluation details
To finetune the 280B parameter Gopher model, we train for 60 steps with Adafactor, batch size 128, and learning rate . To fit the model in TPU memory, we shard the model over 128 TPU v3 cores, rematerialize activations every 3 transformer blocks, freeze the embedding layers and train in low precision using bfloat16 and stochastic rounding (Gupta et al., 2015).
D.2 Training data
As the training data, we use answers that were rated as both Plausible and Supported during the human evaluations. The questions come from the train splits of the QA datasets. We stopped training after 60 steps. During training, the model saw 5151 unique (question, answer) pairs. The distribution between the datases is presented in Table 11.
Page: {title_i} {source_i} } %
For 1/3 of the data, use just a single document in the context, the same document that was used in the answer target, enforcing that the target quote is present inside the prompt.
For 2/3 of the training data, use documents in the context, where is drawn at random between 1 and 5. Enforce that the target document and quote is present in the prompt. The rest of the documents should be other top-google searches for the given question. The order of the documents in the prompt is shuffled.
Truncate the documents so that the total token length of the prompt does not exceed MAX_MEMORY_LEN - MAX_SAMPLE_LEN = 3840. The token length allowance is split at random between the documents included in the prompt (chosen by firstly drawing for documents and then setting ).
Truncate each of the documents to in a way that ensures that truncated fragment contains the snippet of interest. For google search documents, we ensure presence of the the snippet returned by an internal search API. For target sources, we ensure presence of the quote. Additionally we choose the start of the truncation to be the start of a sentence. The choice is randomised, but preserves requirements of length and presence of snippet.
During the training of generators and the inference phase we use a templated prompt from Table 12 that presents document (or multiple documents), followed by the question and answer cue. The target response should then follow the syntax described in subsection 2.1.
Appendix E RM training and evaluation details
We trained multiple generations of 1.4B and 7B reward models. These were initialized from the corresponding pretrained Gopher models (Rae et al., 2021).
We use total batch size of 256 for both sizes of the models, with four-way model parallelism in the case of the 7B parameter RM (Shoeybi et al., 2019). We swept over a few learning rate schedules with linear warmup and cosine anneal, sweeping over the peak learning rates, cosine cycle length and warmup steps.
We validate the reward models by assessing performance on a smaller mixture of above described datasets taken from the validation splits. The ratings come from both researchers and raters. The selection of the best RM model is performed via observing the validation accuracy of predicting the rating preference, as well as plotting the receiver operating characteristic (ROC) curves of the supported&plausible predictions on validation dataset.
E.2 Training data
The majority of training data for the rewards model comes from the human ratings collections we collected comparisons on the train set questions from the 4 popular QA datasets, the exact count of comparisons used are presented in Table 13.
We additionally augment the RM training set with a portion of fabricated comparisons transformed from the supported and refuted claims of the fact checking dataset FEVER (Thorne et al., 2018). Including data transformed from FEVER, aims to provide additional out-of-distribution mode of question answering that is non-extractive, and making the reward model better at verifying supportiveness of the evidence. The FEVER dataset is not designed for the question answering task. Instead it contains claims generated by altering sentences extracted from Wikipedia. Human labelers classified them as Supported, Refuted or NotEnough and marked associated evidence. To transform such claims into examples of questions with comparison of answers we use following techniques:
Type A: Generate questions by a direct templating operations from claims (e.g. ’{claim}?’, ’Is it true that {claim}?’, ’Is it correct to say that {claim}?’, ’{claim}. Do you agree?’). The examples compare affirmative answer like ’Yes’, ’This is correct’, ’It is true’ combined with supporting quote and negative answer combined with the same quote. If the original claim was supported then the affirmative answer is marked as preferred, supported and plausible. Otherwise the negative one.
Type B: Transform claims into questions using few-shot Gopher. For example a claim Roman Atwood is a content creator. would be transformed into Who is Roman Atwood?. As a comparison we use one answer being a FEVER claim (with supporting quote) and a direct negation of the claim produced via templating (e.g. ’It is not true that {claim}’). If the original claim was supported then the answer containing the claim is marked as preferred, supported and plausible. Otherwise the negated claim is marked as preferred.
Type A2: Same as type A but the examples compare yes/no type answer (with supporting quote) and same answer (with the fake quote generated from random sentences).
Type B2: Same as type B but the examples one claim with supporting quote to the same claim with the fake quote generated from random sentences..
We verify the generation process by rating 50 comparisons ourselves and measure that the agreement with the automatically assigned preference judgements is on the level of 87%.
Appendix F RL training and evaluation details
During reinforcement learning, we use the same prompt template as used during supervised finetuning, shown in Table 12.
We use the same training setup as Perez et al. (2022). We train the 280B A2C policy using Adafactor (Shazeer and Stern, 2018), a learning rate of , an effective batch size of , and L2 norm gradient clipping to a max norm of . To reduce memory usage, we freeze the first 60% of the weights (48/80 transformer layers)101010Note the difference beteween setup in Perez et al. (2022), where 80% of the layers is frozen. to the pretrained values, share parameters between policy and value functions, and train with reduced precision using bfloat16 and stochastic rounding (Gupta et al., 2015). The value function predicts the final reward (without discounting) at each token. We implement the value function as an MLP with two hidden layers of size 2048, which takes as input the final transformer representation at each timestep. We shard the networks across 128 TPU v3 machines.
We additionally introduce a bad syntax penalty, that is subtracted from the value function of the sample if it does not meet one of the mechanistically checked criteria and falls into one of the below error cases:
Malformed quote: if the produced example is not possible to parse according to the syntax LABEL:. Or if the quote contains any of the reserved syntax.
Wrong title: if the tile used in the syntax is not one of the document titles from the prompt.
Wrong quote: if the quote is not matching verbatim the full source (up to lowercase).
Empty claim: if there is no claim provided.
Empty quote: if there is no quote provided.
Short quote: if the quote is below min_quote_length .
This was required even with constrained sampling due to the implementation being a prototype, and not feature complete. We train for a total of 520 steps, with a total batch size of 64 (32 batch size per core). We compare the values of bad syntax penalty of and (selecting ), and A2C teacher KL weight of and (selecting ).
F.2 Training data
During the 520 steps of training i.e. 16640 episodes, the model saw 24371 unique questions. They varied between using just single document in the prompt or at random up to 5 documents. The proportion of datasets used was 1:4 between NaturalQuestions and ELI5 train splits.
Appendix G Decline to answer ablations
Appendix H Prompt templates for bootstrapping
In order to bootstrap the question answering ability, we use few-shot prompting with example answers. The final prompt is formed by taking a preamble with examples like in Tables 14 and 15 and appending “Question: {question}\nAnswer:”.
For the NaturalQuestions dataset, we use few shot examples with targets directly written in our desired syntax. We draw the examples at random from a set of 5 hand-written examples. We present an example prompt in the Table 14.
As the ELI5 responses are longer and less extractive, we experimentally found that it is better to split the answering process into two parts. We first use few-shot prompting to generate claim and then mechanistic process to get evidence from the conditioned document. In the Table 15 we include the complete prompt with examples used to elicit responses for the ELI5 questions.
The few-shot examples in the prompt teach Gopher to produce decent supporting evidence, but it was difficult to use this mechanism to make quotes verbatim, especially when they were longer. We therefore resorted to constrained sampling during prompted generation, as described in Appendix J.
Appendix I Released model samples from ELI5 And NatQs test sets
Full samples on our NaturalQuestionsFiltered and ELI5Filtered test sets, along with the ratings assessed by annotators can be found at these URLS: https://dpmd.ai/GopherCite-NaturalQuestions, https://dpmd.ai/GopherCite-ELI5.
Appendix J Constrained sampling details
We mentioned in subsection 2.1 that Inline Evidence Syntax enables us to enforce verbatim quotes with constrained sampling. The approach taken in our constrained sampling implementation is to mask out logits in the model’s output layer – online, during sampling – which would result sampling tokens that do not occur as a contiguous subsequence within the documents in the model’s context.
Because this masking does not need to apply whilst the model is emitting free-form text in the claim part of its response, we construct a simple finite state machine which masks certain logits if it is in the quote state, and otherwise allows any token. The states transition when the model emits special tokens.
To be explicit, the system has the following states.
Within claim. Saw %<. Any token is allowed.
Ended claim. Saw >% The claim has ended. Must begin document title.
Within document title Saw %(. Now within document title. Must exactly quote the title of one of the documents in the conditioning context.
Ended document title Saw )%. Must begin a quote.
Within quote Saw %[. Within a quote. Now the only allowed tokens are those either beginning a new quote (token exists within the documents in the conditioning context), continue the quote, or end the quote.
Ended quote Saw ]%. Now any token is allowed. A new instance of the syntax can be entered by emitting %<.
Appendix K Examples of GopherCite answering questions about the Introduction
Here we demonstrate a strength of feeding GopherCite long, uncurated contexts during training by showing that it can answer a few simple questions about this paper’s introduction: see Figure 12. The Introduction section, after preprocessing to remove whitespace, consists of 1774 subword tokens.
For each of these questions the researchers cherry-picked the best answer out of 16 samples from the SFT model.