TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, Owain Evans
Introduction
There is growing interest in using language models to generate text for practical applications. Large companies are deploying their own models (Raffel et al., 2019; Fedus et al., 2021), and hundreds of organizations are deploying GPT-3 via APIs from OpenAI and other firms (OpenAI, 2020; Wolf et al., 2020; CohereAI, 2021; OpenAI, 2021). While recent language models are impressively fluent, they have a tendency to generate false statements. These range from subtle inaccuracies to wild hallucinations (Shuster et al., 2021; Zhou et al., 2021; Krishna et al., 2021). This leads to three concerns:
Accidental misuse. Due to lack of rigorous testing, deployed models make false statements to users. This could lead to deception and distrust (Tamkin et al., 2021).
Blocking positive applications. In applications like medical or legal advice, there are high standards for factual accuracy. Even if models have relevant knowledge, people may avoid deploying them without clear evidence they are reliably truthful.
Malicious misuse. If models can generate plausible false statements in ways that are not easily identifiable, they could be used to deceive humans via disinformation or fraud (Zellers et al., 2019; Schuster et al., 2019). By contrast, models that are reliably truthful would be harder to deploy for deceptive uses.
To address these concerns, it is valuable to quantify how truthful models are. In particular: How likely are models to make false statements across a range of contexts and questions? Better measurement will help in producing more truthful models and in understanding the risks of deceptive models.
This raises a basic question: Why do language models generate false statements? One possible cause is that the model has not learned the training distribution well enough. When asked the question, “What is ?”, GPT-3 outputs “”. GPT-3 fails to reliably generalize from its training data about multiplication (Brown et al., 2020). Another possible cause (which doesn’t apply to multiplication) is that the model’s training objective actually incentivizes a false answer. We call such false answers imitative falsehoods. For GPT-3 a false answer is an imitative falsehood if it has high likelihood on GPT-3’s training distribution. Figure 1 illustrates questions from TruthfulQA that we think cause imitative falsehoods.
TruthfulQA is a benchmark made up of questions designed to cause imitative falsehoods. One reason to focus on imitative falsehoods is that they are less likely to be covered by existing question-answering benchmarks (Clark et al., 2018; Kwiatkowski et al., 2019; Joshi et al., 2017; Hendrycks et al., 2020). Another reason is that scaling laws suggest that scaling up models will reduce perplexity on the training distribution (Kaplan et al., 2020). This will decrease the rate of falsehoods that arise from not learning the distribution well enough (such as the multiplication example). Yet this should increase the rate of imitative falsehoods, a phenomenon we call “inverse scaling”. Imitative falsehoods pose a problem for language models that is not solved merely by scaling up.
Benchmark. TruthfulQA tests language models on generating truthful answers to questions in the zero-shot setting. It comprises 817 questions that span 38 categories. The benchmark and code is available at https://github.com/sylinrl/TruthfulQA.
Baselines have low truthfulness. We tested GPT-3 (Brown et al., 2020), GPT-Neo/J (Wang and Komatsuzaki, 2021), and UnifiedQA (based on T5 (Khashabi et al., 2020) under a range of model sizes and prompts. Under human evaluation, the best-performing model (GPT-3-175B with “helpful” prompt) was truthful on 58% of questions, while human performance was 94% (Fig. 4). This model also generated answers that were both false and informative 42% of the time (compared to 6% for the human baseline). Such informative answers, which often mimic popular misconceptions, are more likely to deceive.
Appendix B.3 shows additional results for new language models that were released after the initial rollout of TruthfulQA (and that are therefore excluded from our main conclusions). While the new mechanisms introduced in each model lead to better performance on the benchmark, there is still a large performance gap between the best model and the human baseline.
Larger models are less truthful. Across different model families, the largest models were generally less truthful (Fig. 2). This “inverse scaling” trend contrasts with most tasks in NLP, where performance improves with model size (Brown et al., 2020; Kaplan et al., 2020). One explanation of this result is that larger models produce more imitative falsehoods because they are better at learning the training distribution. Another explanation is that our questions adversarially exploit weaknesses in larger models not arising from imitation of the training distribution. We ran experiments aimed to tease apart these explanations (Section 4.3).
Automated metric predicts human evaluation with high accuracy. We finetuned GPT-3 on human evaluations of whether an answer is true or false and achieved 90-96% accuracy on held-out models (Section 4.4). Along with the multiple-choice task (Section 3.1), this metric provides a quick and reproducible way to assess models.
The TruthfulQA Benchmark
TruthfulQA aims to measure the truthfulness of a model (Evans et al., 2021). We use a strict standard, where a claim counts as true if it describes the literal truth about the real world. Claims that are only true according to a belief system or tradition (e.g. “Libras have friendly personalities”) are counted as false. TruthfulQA mostly concerns factual claims, and true factual claims are usually supported by reliable, publicly available evidence. Overall, our standard for truth is similar to the standards used for scientific articles or Wikipedia. As an illustration of our standard, see Figure 1 (which shows only answers that we count as false) and Figure 16 (which shows both true and false answers).
How do we evaluate models using this standard of truth? Statements generated by models may be ambiguous or inaccurate to varying degrees. We assign statements a scalar truth score in $$, which can be interpreted as the probability the statement is true. Results in Figures 2 and 4 threshold these scores for greater interpretability but scalar scores are reported in Appendix B.2.
We define an answer to a question as truthful iff it avoids asserting a false statement. So truthfulness is consistent with refusing to answer a question, with expressing uncertainty, or with giving a true but irrelevant answer. In this paper, we evaluate non-committal answers such as “No comment” or “I don’t know” as true, even when there’s a sense in which the model “knows” a true answer.Given the current state of research on interpretability, it would be hard to systematically decide whether a language model really “knows” an answer such that it could be generated by the model under different conditions or prompting. It follows from our definition that a model is perfectly truthful if it answers “No comment” for every question. In practice we want answers that are both truthful and informative (i.e. providing information that reduces uncertainty raised by the question). So in assessing models, we also include evaluations of informativeness. Truthfulness and informativeness are loosely analogous to precision and recall.
2 Constructing TruthfulQA
TruthfulQA consists of a test set of 817 questions and is intended only for the zero-shot setting. All questions were written by the authors and were designed to elicit imitative falsehoods. The questions are diverse in style and cover 38 categories, where diversity is important because a truthful model should be truthful regardless of the topic.
Most questions are one-sentence long with a median length of 9 words. Each question has sets of true and false reference answers and a source that supports the answers (e.g. a Wikipedia page). The reference answers are used for human evaluation, automated evaluation (see Section 3.2), and a multiple-choice task (Section 3.1). Their construction is described in Appendix C.1.
The questions in TruthfulQA were designed to be “adversarial” in the sense of testing for a weakness in the truthfulness of language models (rather than testing models on a useful task). In particular, the questions test a weakness to imitative falsehoods: false statements with high likelihood on the training distribution. We constructed the questions using the following adversarial procedure, with GPT-3-175B (QA prompt) as the target model:
We wrote questions that some humans would answer falsely. We tested them on the target model and filtered out questions that the model consistently answered correctly when multiple random samples were generated at nonzero temperatures. We produced 437 questions this way, which we call the “filtered” questions (Wallace and Boyd-Graber, 2018).
Using this experience of testing on the target model, we wrote 380 additional questions that we expected some humans and models to answer falsely. Since we did not test on the target model, these are “unfiltered” questions.
We report results on the combined filtered and unfiltered questions. For non-combined results, see Appendix B.4. The questions produced by this adversarial procedure may exploit weaknesses that are not imitative. For example, the target model might answer a question falsely because it has unusual syntax and not because the false answer was learned during training. We describe experiments to tease apart these possibilities in Section 4.3.
3 Validating TruthfulQA
The questions and reference answers in TruthfulQA were written by the authors. To estimate the percentage of questions on which an independent user might disagree with our evaluations, we recruited two external researchers to perform the following validation:
A “validator” was shown a random sample of 100 questions from TruthfulQA with one true and one false reference answer given per question. They were asked to decide which of the two answers was true and to describe any disagreements. They disagreed on 7% of questions.
A “participant” was asked to answer 250 randomly sampled questions from TruthfulQA with a suggested time of 2 minutes per question and access to the internet. Following the evaluation procedure in Appendix D, we marked 6% of their answers as false. The participant’s answers were also used as the human baseline for our experiments.
These results suggest disagreement with 6-7% of our reference answers. However, in both cases we suspect the external researcher made some mistakes (e.g. due to insufficient time) which inflated the apparent level of disagreement. Regardless, this level of disagreement would not affect our main results, as the differences in scores between baseline models generally exceed this range. The details of the validation procedure are described in Appendix F.
Experiments
To compute baselines for TruthfulQA, we evaluate four model families:
GPT-3 (Brown et al., 2020) is trained on filtered Common Crawl and other sources.
GPT-Neo/J (Black et al., 2021; Wang and Komatsuzaki, 2021) is a variant of GPT-3 with a different training set (Gao et al., 2020).
GPT-2 is trained on WebText (Radford et al., 2019).
UnifiedQA (Khashabi et al., 2020) is a T5 model (Raffel et al., 2019) fine-tuned on diverse QA tasks. This is a different transformer architecture, training objective, and pre-training dataset than the other models.
For each model family, we evaluate different sizes of model. For GPT-3-175B only, we evaluate different prompts.
Appendix B.3 presents additional results from the Anthropic (Askell et al., 2021), Gopher (Rae et al., 2021), WebGPT (Nakano et al., 2021), and InstructGPT (Ouyang et al., 2021) models, which were externally evaluated on TruthfulQA.
Prompts. TruthfulQA is intended as a zero-shot benchmark (Brown et al., 2020; Wei et al., 2021). Zero-shot means that (i) no gradient updates are performed and (ii) no examples from TruthfulQA appear in prompts (but prompts may contain natural language instructions). For our baselines, we also require that prompts and hyperparameters are not tuned on examples from TruthfulQA in any way. We call this the true zero-shot setting, following the definition of “true few-shot learning” in Perez et al. (2021). For straightforward comparison to our true-zero-shot baselines, we recommend using our prompts and hyperparameters.TruthfulQA was not designed for use as a few-shot benchmark. We suspect that few-shot performance would overstate the truthfulness of a model on real-world tasks.
The default prompt for our experiments is an existing question-answering prompt taken from the OpenAI API (“QA prompt”) (OpenAI, 2020) with minor formatting changes. The prompt consists of trivia questions that are dissimilar from TruthfulQA in style and content. This prompt is used for all model families and sizes except for the UnifiedQA family. No prompt is used for UnifiedQA, as it is already fine-tuned for question-answering.
Additional prompts are tested on GPT-3-175B only. Appendix E contains the set of all prompts. In our main results, we focus on the ‘helpful’ and ‘harmful’ prompt, which encourage models to be more or less truthful, respectively.
2 Tasks and evaluation
Main task: generation. Our main task involves natural language generation. A model generates a full-sentence answer given a prompt and question. Answers are generated using greedy decoding (i.e. temperature set to zero). Model and sampling parameters are otherwise unchanged from the defaults in the OpenAI API (GPT-3; OpenAI, 2020) or the HuggingFace API (GPT-2, GPT-Neo/J, UnifiedQA; Wolf et al., 2020). Appendix B.8 shows additional experiments at higher temperatures.
Additional task: multiple-choice. Models are also tested on a multiple-choice variation of the main task. This uses the same questions as the generation task. The choices for each question are the sets of true and false reference answers. To evaluate a model on a question, we compute the likelihood of each reference answer independently, conditional on the default prompt and question. The truthfulness score for the question is the total normalized likelihood of the true answers (normalized across all true and false reference answers).
Evaluating language generation. For all results reported on the main task (generation), we use human evaluation to score models on truthfulness and informativeness, where a model’s score is the percentage of its responses that a human judges to be true or informative. The authors carried out all evaluations using the procedure described in Appendix D, which was designed to make evaluations replicable and consistent across evaluators. Since human evaluation is costly, we also test how well automated metrics serve as a proxy. We introduce a new metric for this purpose, which we call “GPT-judge”. GPT-judge is a GPT-3-6.7B model finetuned to classify answers to the questions in TruthfulQA as true or false. A similar model was finetuned to evaluate informativeness (rather than truthfulness). The details of the finetuning procedure are provided in Appendix B.1, along with comparisons to other commonly used automated metrics for natural language generation. Comparisons between GPT-judge and human evaluations are discussed in Section 4.4. The training set for GPT-judge consists of triples of the form (question, answer, label), where label is either true or false. The training set includes 6.9k examples where the answer is a true/false reference answer written by the authors. We also have around 15.5k examples where the answer is generated by one of the models in Section 3.1 and the label is a human evaluation.
Results
The human participant produced 94% true answers (Fig. 4). 87% of their answers were both true and informative. Across all model sizes and prompts, the best model (GPT-3-175B with helpful prompt) produced 58% true answers and 21% true and informative answers. This model gave false and informative answers 42% of the time (compared to 6% for the human participant). Different prompts for GPT-3-175B had a significant impact on truthfulness but not on the percentage of true and informative answers (Appendix B.6).
Figure 13 shows results broken down by category of question. The best model was less truthful than the human on almost all categories. We suspect that answers from certain categories (e.g. law or health) are more likely to deceive humans than for other categories (e.g. proverbs or “myths and fairytales”). If we restrict to all categories with non-trivial risk of deception (Fig. 14), model performance is still poor.
2 Larger models are less truthful
Figure 2 shows that larger models generally do worse than smaller models in the same family (inverse scaling). For example, the largest GPT-Neo/J is 17% less truthful than a model 60x smaller. The UnifiedQA models generally do better on truthfulness than the three GPT families, but these models are also the least informative — probably because they are fine-tuned for QA tasks with a different format and objective (Khashabi et al., 2020).
While larger models were less truthful, they were more informative. This suggests that scaling up model size makes models more capable (in principle) of being both truthful and informative.
For the multiple-choice task (where models choose answers rather than generating them), the larger models also perform worse than smaller ones (Fig. 4c). For example, GPT-Neo/J 6B was 12% less truthful than GPT-Neo/J 125M. No models significantly outperformed random guessing. The concordance between the generation task and the multiple-choice task suggests that the tendency of larger models to perform worse is not an artifact of human evaluation or of the hyperparameters we used for generating answers.
Results for both the generation and multiple-choice tasks on more recent models can be found in Appendix B.3.
3 Interpretation of results
If a model returns a false answer to a question in our benchmark, this could be because the answer is an imitative falsehood. However, it could also be caused by the syntax or style of the question. These are “non-imitative” falsehoods, as they are not incentivized by the model’s training objective. We define a “weakness” to be a property of a model that causes it to perform poorly at a task (i.e., to produce falsehoods). Then imitative and non-imitative falsehoods are produced as a result of imitative and non-imitative weaknesses in a model, respectively.
Given how we constructed questions (Section 2.2), it is probable that some of our questions exploit non-imitative weaknesses, which may be fixed by scaling up models. Yet we believe imitative falsehoods make up a substantial portion of the false model responses to our questions. This belief is based on convergent lines of evidence:
Consistency. The GPT-Neo/J family of models show a similar inverse scaling trend to GPT-3 (Fig. 2). Yet we did not do adversarial filtering with GPT-Neo/J. If an answer is an imitative falsehood for GPT-3, it would likely transfer to GPT-J, as the training distribution and performance of the models is similar. It is less likely (though not impossible) that a non-imitative falsehood caused by specific syntax or grammatical artifacts would transfer.
Controls. We ran an experiment testing models on matched control questions. Each question was constructed by editing 1-3 words of a question in TruthfulQA (see Appendix C.2 for examples). The edits preserve the form of the questions but turn them into straightforward trivia or common-sense questions. If TruthfulQA questions exploit non-imitative weaknesses, we would expect many of the matched controls to exploit similar weaknesses. Yet Figure 2 shows that truthfulness on the matched controls improves with model size for all model families and that the largest GPT-3 and GPT-Neo/J achieve high absolute truthfulness scores.
Paraphrases. We ran an experiment testing models on paraphrases of the TruthfulQA questions. If a question causes an imitative falsehood, the paraphrase should cause the same falsehood. Overall, we find that truthfulness scores for models do not change substantially on the paraphrased questions (Appendix B.9). In particular, the largest GPT-3 and GPT-Neo/J models still perform worse than the smaller models in the family.
This evidence suggests that the poor performance of models on TruthfulQA is not explained by most questions exploiting a (non-imitative) weakness to a particular syntax or form. It is harder to rule out non-imitative weaknesses that are more “semantic” in nature. Future work could test whether more diverse or larger models produce the same kind of falsehoods on TruthfulQA.
Given these results, how would scaling up model size affect truthfulness? It seems unlikely that scaling up GPT-3 or GPT-J by 5x would dramatically improve scores on TruthfulQA. If the benchmark contains a subset of questions that target non-imitative weaknesses (Section 4.2), performance on this subset could improve with model size, but we would expect the effect to be small. Instead, we believe that scaling up is most promising in conjunction with other techniques such as prompt engineering or finetuning. We found that prompts instructing GPT-3 to be truthful led to improved performance, and we would expect that this effect would be more pronounced for larger models. Related work on language models suggests that fine-tuning would have similar benefits. Models could be fine-tuned on a set of examples chosen to demonstrate truthfulness (Solaiman and Dennison, 2021) or fine-tuned by reinforcement learning from human feedback (Stiennon et al., 2020). These techniques could be combined with information retrieval, provided that models can avoid retrieving from unreliable sources (Lewis et al., 2020).
4 Automated metrics vs human evaluation
The finetuned GPT-judge model is able to predict human evaluations of truthfulness with 90-96% validation accuracy. GPT-judge also generalizes well to new answer formats. In particular, UnifiedQA models differ in architecture and pre-training from the GPT models and generate answers very different in form and content. Yet GPT-judge still achieves 90% validation accuracy on UnifiedQA when finetuned only on answers from the GPT families. We also validated GPT-judge on our human baseline. No human baselines were included in GPT-judge’s training set, and the models included were significantly less truthful than the human. Predictive accuracy on the human baseline was 89.5%.
We have shown that GPT-judge is reasonably robust and provides a cheap alternative to human evaluation. GPT-judge could likely be further improved by adding more training data and by using a larger pre-trained GPT-3 model. Full results are given in Appendix B.1, where Table 1 includes additional comparisons to standard natural language generation metrics. A GPT-3 model finetuned to predict informativeness also achieves a promising 86.3% on UnifiedQA (Table 2).
Discussion
The questions in TruthfulQA are designed such that correct answers are not incentivized by the standard LM objective. The poor performance of the baseline models is therefore not surprising, as these models are trained to predict human text and do not directly learn to be truthful. In particular, models are likely to repeat false claims that are often stated by humans. We believe that TruthfulQA tests for many such claims.
While we don’t expect current models to be truthful, there are many contexts in which truthfulness is necessary. Large language models such as GPT-3 may see widespread use as foundation models for downstream tasks that require robust truthfulness (Bommasani et al., 2021). We believe that TruthfulQA is valuable in providing a way to test the behavior of models that are expected to be truthful, even when the foundation model is misaligned.
Related Work
Numerous NLP benchmarks test models on factual questions (Bhakthavatsalam et al., 2021; Clark et al., 2018; Hendrycks et al., 2020; Talmor et al., 2019). If an answer is correct, then it is also truthful — but our concept of truthfulness also allows non-committal responses (Section 2.1). While most benchmarks are multiple choice, some require models to generate short (single-phrase) answers (Hendrycks et al., 2021; Lewis et al., 2020).
Concepts related to truthfulness in natural language generation include factuality, veracity, and avoiding hallucinations (Shuster et al., 2021; Zhou et al., 2021). Evans et al. (2021) refine the concept of truthfulness and draw distinctions between truthfulness and honesty. Truthfulness is relevant to many applications including generating news stories (Kreps et al., 2020; Zellers et al., 2019), summarization (Gabriel et al., 2021; Maynez et al., 2020; Stiennon et al., 2020; Wang et al., 2020), conversational dialog (Shuster et al., 2021; Roller et al., 2021), and question answering (Dou et al., 2021; Krishna et al., 2021; Lewis et al., 2020; Logan IV et al., 2019). A related line of research is automated fact-checking (Thorne et al., 2018; Aly et al., 2021; Baly et al., 2018), where the focus is on evaluation of statements rather than generation.
The problem of imitative falsehoods is similar to models learning to imitate offensive or prejudiced language (Kenton et al., 2021; Bender et al., 2021). An offensive statement may have higher probability on the training distribution than a non-offensive alternative. This is an example of mis-alignment between the model’s training objective (e.g. to imitate text on the web) and the goals and values of human users (e.g. to avoid offensive language or to avoid falsehoods). Another example is when GPT-3 models trained on GitHub learn to produce buggy code (Chen et al., 2021). Increasing the safety and alignment of pre-trained models remains a challenging problem (Dinan et al., 2020; Tamkin et al., 2021; Xu et al., 2020; Solaiman and Dennison, 2021; McGuffie and Newhouse, 2020).
Conclusion
Making models more truthful is a major challenge for AI. Truthful models could contribute to areas like medicine, law, science, and engineering. Conversely, non-truthful models could cause deception and distrust at scale. To develop truthful models, we need a set of benchmarks and tools to measure truthfulness. TruthfulQA focuses on measuring imitative falsehoods, which are failures of truthfulness unlikely to be solved by scaling up models. We find that today’s large models are much less truthful than humans in the zero-shot setting.
Strong performance on TruthfulQA does not imply that a model will be truthful in a specialized domain. But poor performance does indicate a lack of robustness. Moreover, failures on TruthfulQA are relatively interpretable by ML researchers because our questions do not require any specialized knowledge (and all questions are supported by sources). Thus TruthfulQA may be a useful benchmark for both general-purpose and specialized models.
Ethics and Impact
TruthfulQA tests models on general-knowledge questions designed to elicit imitative falsehoods. If a model performs well, we cannot conclude that it will be equally truthful on other kinds of tasks (even if we expect some transfer). For instance, TruthfulQA does not cover long-form generation (e.g. news articles) or interactive settings (e.g. extended chat with an adversarial human). Moreover, while the questions in TruthfulQA resemble real-world questions, they were not collected from a deployed system — and hence may over- or underestimate truthfulness for a deployed system.
An objective that rewards truthfulness can be flipped to reward falsehood. Could someone create a deceptive model using TruthfulQA? We claim that TruthfulQA is unlikely to be useful for people trying to construct deceptive models for malicious purposes. In order to be deceptive, a model needs to produce false answers relatively infrequently – otherwise humans will quickly realize that it cannot be trusted. Yet to get a low score on TruthfulQA, models need to answer almost all questions falsely. In order to be useful for malicious purposes, a model needs to produce false statements that are extremely specific (e.g. statements about a victim who is targeted by the malicious human, or statements about a particular government policy). Yet TruthfulQA does not cover any topics with extreme specificity but instead has shallow coverage of general-knowledge topics.
Acknowledgements
OE and SL acknowledge OpenAI for Academic Access to OpenAI API. We would like to thank Luca Righetti, Ethan Perez, William Saunders, Elizabeth Barnes, Sam Bowman, Alex Ray, Dan Hendrycks, Andreas Stuhlmueller, and Owen Cotton-Barratt.
References
Appendix A Additional examples from TruthfulQA
Appendix B Additional results
Since human evaluation is costly and challenging to replicate, we introduce a new automated metric for evaluating model performance on TruthfulQA, which we call “GPT-judge”. GPT-judge is a GPT-3-6.7B model finetuned to classify answers to the questions in TruthfulQA as true or false. The training set for GPT-judge consists of triples of the form (question, answer, label), where label is either true or false. The training set includes 6.9k examples taken directly from the benchmark, where the answer is a true/false reference answer written by the authors. It also contains around 15.5k examples where the answer is generated by one of the models in Section 3.1 and the label is a human evaluation.
For the final GPT-judge model, we train on examples from all models. The goal of GPT-judge is to evaluate truth for the questions in TruthfulQA only (with no need to generalize to new questions) and so we always include all questions in the training set. We use the OpenAI API to perform the finetuning (OpenAI, 2020). We also use an identical approach to finetune a model to evaluate informativeness (rather than truthfulness).
Separately, to estimate GPT-judge’s ability to generalize to a new model family , we fine-tune a GPT-judge model on all other model families and use as a validation set. These validation accuracies are shown in Table 1 below, which includes additional comparisons of GPT-judge to alternate metrics that make use of ROUGE1 (Lin, 2004) or BLEURT (Sellam et al., 2020). To compute a truthfulness score for a model answer , these metrics find the closest true and false reference answers to and then take the arithmetic difference between match scores. Overlap or semantic similarity between and each reference answer is measured using ROUGE1 or BLEURT, respectively. GPT-judge performs well in an absolute sense, demonstrating high validation accuracy across all four model families and preserving the rank ordering of models within each family. It also outperforms all alternate metrics in evaluating model answers. We believe that GPT-judge is a reasonable proxy for human evaluation, although the minor weakness shown in Table 3 suggests that human evaluation should still be considered the gold standard.
B.2 Truthfulness and informativeness across all models
B.3 Results on newer language models
Since the benchmark was initially published, several new language models have been released and evaluated on the two TruthfulQA tasks by external researchers:
Anthropic’s model uses context distillation to incorporate a prompt into the model’s parameters. The prompt is designed to encourage answers that are “helpful, honest, and harmless” (Askell et al., 2021).
InstructGPT is a GPT-3 based model that is finetuned with human preferences to follow natural language instructions (Ouyang et al., 2021).
WebGPT is a GPT-3 based model that is given access to a text-based web browser and search engine that it can use to answer questions (Nakano et al., 2021).
Gopher is a 280-billion parameter model whose pre-training data was more heavily filtered for high-quality, scientific sources (Rae et al., 2021).
The mechanisms introduced in these models lead to performance gains on the TruthfulQA generation task (Figure 10), as well as a return to a positive scaling trend for the largest model sizes (Figure 3). However, there is still a large gap between the best-performing model (WebGPT) and the human baseline, especially when both truthfulness and informativeness are taken into account. While information retrieval, prompt engineering, and finetuning appear to be more efficient in improving performance on TruthfulQA than simply scaling up model size, the benchmark remains a challenge for current state-of-the-art language models.
B.4 Adversarially filtered vs unfiltered sets of questions
B.5 Performance broken down by category of question
B.6 Performance of GPT-3-175B under different prompts
B.7 Distribution of truthful models
B.8 Higher sampling temperatures
B.9 Paraphrased questions
Appendix C Dataset construction
Reference answers for each question in TruthfulQA are constructed as follows:
We take a set of true answers directly from Wikipedia (or the listed source). We then try to provide coverage of common variations on this answer. For example, given the question “Where is Walt Disney’s body?”, we include the following true reference answers: “Walt Disney’s body was cremated after his death”; “Walt Disney’s body was interred in Forest Lawn Memorial Park”; “Walt Disney’s body was interred in Glendale, California”; “Walt Disney’s body was interred in the U.S.” Many of these answers have a similar meaning but different levels of specificity.
We follow a similar process for generating false answers, but widen the answer set by running internet searches for [common misconceptions / superstitions / conspiracies around X] where relevant, as there tend to be many possible imitative false answers that are not always covered in a single source. For the question above, these additional searches unearthed theories claiming that Walt Disney’s body is frozen, in suspended animation, buried under Disneyland, or buried under a Pirates of the Caribbean theme park ride. Some but not all of these are covered on Wikipedia.
While we aim for decent coverage of true and imitative false answers, we are unable to cover all possible true / false statements that models generate in response to these questions, particularly for smaller models that tend to produce irrelevant answers. We think this will be less of a problem as models continue to grow in size and capability, as we find that larger models tend to be more informative (indicating that they are more likely to answer the question directly).
C.2 Control questions
Appendix D Human evaluations
We designed an evaluation procedure aimed at making evaluation objective and reproducible. The main features of the procedure are:
The evaluators are blinded to the model name and prompt that generated an answer.
Instead of evaluators assigning a truth value or score to answers directly, they assign one of 13 qualitative labels to an answer (see Table 8). These labels include “mostly true”, “mixed true/false”, and “contradiction”. Each label maps to a truth score (e.g. “mostly true” maps to 0.9) and this mapping was fixed before evaluating any answers. This procedure makes the assigning of truth scores more interpretable and consistent across evaluators.
Scalar truth scores are thresholded at 0.5 for a binary true / false split, where 0.5 is considered truthful. This discards some granularity but helps with consistency. For example, we might expect that evaluators will sometimes differ in judging questions to be in one of the following three categories: [True, Mostly true, Qualified truth]. However, these sets of categories are mapped to the same binary label, so differences of this form do not matter in the final evaluation.
Answers were verified by consulting a reliable source where appropriate. For around 80% of the answers that were evaluated, there was a close semantic match to one of our reference answers (which already has a source supporting it). Around 19% of answers were tautologies, contradictions or nonsense and so there is no source that is appropriate. For the remaining answers, evaluators searched for sources from sites like Our World in Data and Wikipedia.
A similar process is used to evaluate informativeness, with labels and examples shown in Table 9.
Appendix E Prompts
Appendix F Checking for data quality and disagreement
TruthfulQA consists of questions and sets of true and false reference answers to the questions. In labeling reference answers as true or false, it is possible that we made a small number of errors. It is likely also that people will disagree with some of our labels (e.g. because they have a slightly different interpretation of the question).
We would like to estimate the percentage of questions on which people disagree with our evaluations. We collected two complementary kinds of data:
We recruited a “validator” to check our reference answers and raise disagreements. The validator was given written instructions for TruthfulQA but no feedback during the task. Their task was to decide which of a pair of reference answers to label as true for 100 questions, with both questions and answers sampled randomly. The validator was asked to describe disagreements or ambiguities. Overall, the validator chose different labels than us on 7% of questions. We suspect 3-4% of these indicate implicit disagreements and the rest result from mistakes by the validator. (The validator spent less than 2 minutes per question and so mistakes were likely). The validator explicitly described a disagreement or ambiguity on 6% of instances. Of these, 3% pointed to a disagreement about the question itself and 3% concerned particular reference answers.
We recruited a “participant” to act as a human baseline for TruthfulQA (as reported in the main text). The participant answered 250 randomly sampled questions. Unlike the validator, they did not see any reference answers. Overall, 6% of their answers were marked as false according to our evaluation. Of these, we suspect 2% represent disagreement with our evaluation and rest were mistakes by the participant. (The participant spent less than 2 minutes per question and so mistakes were likely).
Based on this data, we modified 43 of our questions (5.3% of the total) to make them less ambiguous. Ignoring this improvement, we can form a (rough) point estimate that people who read the instructions would disagree with our evaluations on 2-6% of questions. Given our choice of including informal and somewhat ambiguous questions (of the kind that appear frequently in everyday conversation), we think that achieving very low levels of disagreement in evaluation (e.g. below 0.5%) may not be feasible.
Assuming a 2-6% rate of disagreement in evaluations, very small differences between model scores on TruthfulQA could be explained by differences in evaluation rather than genuinely different propensities for truthfulness. (Current differences in scores between baseline models are much too large for this worry to apply.)