Generating Benchmarks for Factuality Evaluation of Language Models
Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham
Introduction
Despite rapid improvements in their capabilities, large Language Models (LMs) still tend to generate factually inaccurate or erroneous text Lin et al. (2022); Maynez et al. (2020); Huang et al. (2020). Such phenomena can pose a significant hurdle to deploying LMs in important or sensitive settings, motivating the development of methods for evaluating LM factuality in open-ended generation.
Methods for directly evaluating an LM’s propensity towards factual generation were recently proposed by Lee et al. (2022) and Min et al. (2023). These methods suggest sampling generations from a model, applying an automatic pipeline for fact verification, and then assigning a score corresponding to the percentage of factually correct generated statements. In task-specific domains, such as long-form question answering, evaluation is usually done by assessing the relevance of a sampled generation against a reference text Lin (2004); Fabbri et al. (2022). However, the sampling approach may introduce bias: by scoring the accuracy of facts that an LM tends to generate in an open-ended setting, high-likelihood facts are over-represented, while the “long-tail” of rare facts is under-represented.
Currently, there are no metrics suited to measuring LM factuality with respect to a controlled set of facts in a generation setting. A common proxy is measuring LM perplexity; this was widely adopted to evaluate retrieval-augmented LMs Khandelwal et al. (2020); Borgeaud et al. (2022); Ram et al. (2023); Shi et al. (2023). However, perplexity is affected by many linguistic phenomena, and so cannot be directly linked to factuality.
This paper introduces a novel framework for testing a model’s tendency to generate factual information from a given factual corpus: Factual Assessment via Corpus TransfORmation (FACTOR). The key idea is automatically perturbing factual statements taken from the corpus to create a constant number of similar but false variations for each true statement (Figure 1). We employed InstructGPT Ouyang et al. (2022) to generate the false variations for each true statement. The LM’s FACTOR accuracy on our benchmark is defined as the percentage of examples for which it assigns higher likelihood to the factual completion than to any of the false variations.
We applied FACTOR to the Wikipedia and News domains, as well as to a diverse collection of domain specific question-answer pairs (e.g., medicine, technology, law); constructing new benchmarks dubbed Wiki-FACTOR, News-FACTOR and Expert-FACTOR. We used these datasets to evaluate a large suite of LMs from the OPT Zhang et al. (2022), GPT-2 Radford et al. (2019), and GPT-Neo Black et al. (2021) families, ranging from 110M to 66B parameters. We show in §5.1 that, as expected, FACTOR scores increase with model size. However, even the largest models we evaluated achieved scores of only % for Wiki-FACTOR, % for News-FACTOR, and % for Expert-FACTOR, indicating that these benchmarks are challenging even for large LMs. In §5.2 we show that consistent FACTOR score improvements can be achieved by augmenting the LMs with the simple retrieval component used by Ram et al. (2023). This directly demonstrates that retrieval augmentation improves factuality in the LM setting; FACTOR is thus posed as a prominent approach for measuring retrieval-augmented LMs.
We further show that FACTOR accuracy and LM perplexity are correlted but can sometime induce different orderings between LMs (§5.3). This highlights that FACTOR and perplexity capture different aspects of the LMs’ performance (see Figure 2). In §6, we report findings of a manual annotation effort over generated completions, which reinforces FACTOR accuracy as predictive of factuality in open-ended generation.
Related Work
The subject of factuality evaluation has been extensively studied in downstream tasks such as summarization, fact-verification and dialog Honovich et al. (2022); Huang et al. (2021); Chen et al. (2021); Tam et al. (2023). These works typically focus on factual consistency, evaluating whether a generated text is supported by a reference text or context (e.g., source document and generated summary).
Another popular approach suggests probing LMs’ internal factual knowledge by using slot filling tasks, e.g., “Barack Obama was born is [MASK]” (Petroni et al., 2019, 2021; Roberts et al., 2020; Jiang et al., 2020; Elazar et al., 2021; Li et al., 2022; Zhong et al., 2021; Peng et al., 2022; Mallen et al., 2023). These works test LMs in a simplified, synthetic setting.
FACTOR differs from the above methods as it aims at evaluating factuality in a natural open-ended text generation setting. In such setting, the context may be needed to reason over the evaluated factual statement, while the factual statement may not be evident in the context (unlike summarization).
Recent works proposed scoring the factuality of free-form LM generations samples Min et al. (2023); Lee et al. (2022). However, these approaches lack control over the evaluated facts and are biased towards common facts generated by the LM.
The FACTOR Evaluation Approach
Contrastive evaluation, in which a model is tested to discern between similar positive and negative examples, is widely used in various tasks Sennrich (2017); Burlot and Yvon (2017); Glockner et al. (2018); Kaushik et al. (2020). For factuality evaluation, negative examples are obtained by perturbing factual claims. This is done through human annotation, rule-based or model based heuristics Schuster et al. (2021); Liu et al. (2022); Gupta et al. (2022). Following recent works on benchmarks generation Perez et al. (2023), we employed Instruct-GPT to generate non-factual claims, as described in the following section.
This section outlines our proposed approach: Factual Assessment via Corpus TransfORmation, or FACTOR. Given a corpus, we define a multi-choice task where each example is comprised of a multi-sentence prefix, a single factual next sentence completion, and three non-factual alternative completions (Figure 1). In §3.1 we present several properties required of a FACTOR benchmark, and describe the error verticals along which we generate non-factual alternatives. We then explain our FACTOR dataset creation pipeline, which automatically generates a FACTOR benchmark from a given corpus (§3.2). Finally, we apply this pipeline to two corpora Wikipedia and news, and a long-form question answering dataset, creating Wiki-FACTOR, News-FACTOR and Expert-FACTOR. We verify the quality of these datasets through manual annotations against the required properties (§3.3).
We describe the FACTOR multi-choice factual evaluation task. Each example of our task contains a prefix text , along with four possible full sentence completions, of which only one is factually correct. We choose the original completion (i.e., the continuation of in the corpus) as the factually correct one. The correct completion is denoted as , and the non-factual completions as . We evaluate models by measuring the percentage of examples where they assign the highest mean log-probability to . Formally, a model is correct on a given example if:
where is the length of completion in tokens. We refer to the percentage of correct examples as the FACTOR accuracy.
We require each of the “incorrect” completions to satisfy the following properties:
Non-factuality: contains a false claim;
Similarity to the factual completion: has a small edit-distance from .
The second and third properties make it harder to distinguish between the factual and non-factual completions for reasons other than their factual correctness, such as fluency or style. Furthermore, it is desirable that the non-factual completions be logical and self-consistent, to make them more difficult to eliminate. For example, modifying “They got married in 2010 and divorced in 2017” by changing 2017 to 2009, results in a non-factual completion which can be discarded by knowing the temporal relation between marriage and divorce.
Non-factual completions in a FACTOR dataset should cover diverse factuality error types. To do so, we adopt the error typology introduced in FRANK Pagnoni et al. (2021). While they introduced their error typology to categorize factual inconsistencies of generated summaries w.r.t. the source document, we instead leverage this typology to vary the type of factual inconsistencies that hold between non-factual completions and the prefix and completion ( and ). We focus on the five error types from two error categories: semantic frame and discourse (examples in Table 2):
Predicate error: a predicate that is inconsistent with or .
Entity error: The subject or object of a predicate are inconsistent with or .
Circumstance error: The completion contains information describing the circumstance of a predicate (e.g., location, time, manner) that is inconsistent with or .
Coreference error: The contradiction is inconsistent with a pronoun/reference in or , referring to a wrong or non-existing entity.
Link error: is inconsistent with or in the way that different statements are linked together (causal/temporal links).
2 Generating FACTOR Benchmarks
Given an evaluation corpus, we generate a FACTOR benchmark automatically. The process is designed to meet the requirements presented in §3.1, and follows a four-stage pipeline: (1) prefix and completion selection, (2) non-factual completion generation, (3) non-factual completion filtering, and (4) non-factual completion selection.
We select a single sentence from each document as a factual completion . We exclude headlines and sentences with less than 10 words. The prefix is the entire text preceding in the document.
2.2 Non-factual Completions Generation
Given a prefix and its original completion , we use InstructGPT (davinci-003; Ouyang et al. 2022) to generate a set of contradictory completions. We designed a specific prompt instructing the model to generate contradictions corresponding to each type of error.App. D lists the full prompts for each error type. We only apply each prompt to sentences that are relevant to its error type (determined through simple heuristics, see App. A.1). The prompts are designed as follows:
Multiple contradiction generation: the model is prompted to generate multiple subsequent contradictions in each sampling operation. Preliminary experiments showed that this sampling practice improves diversity compared to multiple independent completion sampling.
Edit planning: for each contradiction, the model first explicitly generates the planned edits over the original completion, and then applies those edits by writing the entire modified completion (similar to chain-of-thought prompting; Wei et al. 2022). For instance, the coreference error in Table 2 is generated by explicitly writing the edits ("Changes: ‘his’ to ‘her’") and then the contradiction. This encourages the model to make minimal edits.
2.3 Non-factual Completions Filtering
We considered the set of generated completions as candidates for non-factual completions. We applied automatic tools to filter out (i) non-contradictory and (ii) non-fluent completions.
Given a candidate completion , we assert that it is indeed contradictory to the original completion by applying an NLI model.We used DeBERTa-large model He et al. (2021) fine-tuned on the MNLI dataset Williams et al. (2018) from Hugging Face: microsoft/deberta-large-mnli. The premise is set to be along with its near context (i.e., the last tokens of the prefix ; denoted by ). The hypothesis is set to be , also preceded by . We selected generations classified as contradictory by the NLI model with a probability higher than , i.e.:
We chose (except for contradictions generated by the coreference error prompt, where we set ) after using a manual validation process detailed App. A.2.
To verify that is a fluent completion we use GPT2-Small Radford et al. (2019) scores, similar to Gupta et al. (2022): We filter out generations with mean log-likelihood lower than the original completion’s by a fixed margin . Using a manual validation, we set (see App. A.2). Formally, we selected a completion if it satisfies:
2.4 Non-factual Completion Selection
Finally, we select non-factual completions from the filtered candidates. For increased error type diversity, we choose one completion per type, and repeat types only when not enough generations meet the §3.2.3’s criteria.
3 Applying FACTOR to Knowledge Intensive Domains
We focused on three knowledge intensive domains: Wikipedia (encyclopedic knowledge), news (current events) and long-form question answering in specific domains. We constructed the following evaluation datasets:
Wiki-FACTOR: based on the Wikipedia section of The Pile’s validation split Gao et al. (2021), containing examples.
News-FACTOR: based on Reuters articles published after , extracted from The RefinedWeb Dataset Penedo et al. (2023). The dataset consists of examples.
Expert-FACTOR: based on the validation and test splits of ExpertQA Malaviya et al. (2023), a long-form expert-curated question answering dataset spanning various fields, which suits the motivation of FACTOR to evaluate rare facts. Each document in the corpus is a concatenation of a question-answer pair. The dataset consists of examples.
To validate that our FACTOR benchmarks meet the required properties detailed in §3.1, we manually evaluated a sub-sample from each dataset. We sampled examples from Wiki-FACTOR, examples from News-FACTOR and examples from Expert-FACTOR, containing , and generations overall. Each generation was annotated w.r.t. the properties manifested in §3.1, namely whether they were (1) non-factual, (2) fluent, and (3) self-consistent. To assess datasets diversity, we annotated the contradictions in accordance with the error typology of Pagnoni et al. (2021), described in §3.1. We verified that the non-factual completions are minimally edits variants of the factual completion by measuring mean edit distances.
Validation results in Table 2 show that for all datasets, almost every generated completion indeed contradicts the original one, was fluent, and was self consistent. Table 3 shows the error type distribution, indicating that FACTOR yields diverse contradiction types. Semantic frame errors (Entity, Predicate, and Circumstance) were more prevalent than discourse errors (Link and Coreference), as more sentences are suited for these type of errors.
We used FACTOR benchmarks to evaluate factual knowledge of LLMs across varying model families. We describe the experimental setup below.
The Wiki-FACTOR, News-FACTOR and Expert-FACTOR datasets are described in §3.3. For perplexity evaluation (§5.3), we selected a subset of Wikipedia articles from the documents Wiki-FACTOR is based on (K tokens).
2 Models
We performed our experiments over a set of open source models: four models of GPT-2 family (110M–1.5B; Radford et al. 2019), five models from the GPT-Neo family (125M–20B; Black et al. 2021, 2022; Wang and Komatsuzaki 2021), and eight models of OPT (125M–66B; Zhang et al. 2022). We capped the sequence length at tokens to compare all models directly.
The corpora that our FACTOR benchmarks were constructed from were not used for training any of the examined models. News-FACTOR is based on articles published after 1/10/2021, while Expert-FACTOR is based on examples written in 2023. Both are beyond the models’ data cutoff date. Wiki-FACTOR is based on Wikipedia documents from The Pile’s validation split, which is not part in any of the models’ training sets. (OPT and GPT-Neo models were trained on The Pile’s training split, GPT-2 models were not trained on Wikipedia).
3 Retrieval-Augmented Models
This section describes the experimental evaluation of LLM factuality using our FACTOR benchmarks. In §5.1 we show that FACTOR accuracy increases with model size but also depends on the training data (different model families differ in scores). In §5.2, we show that retrieval augmentation of the LM improves FACTOR accuracy, positioning it as the first automatic measure of factuality improvement for retrieval augmented LMs. Finally, in §5.3, we show that the pairwise model ranking of corpus perplexity and FACTOR accuracy can differ significantly. This outcome, along with manual validation of the correlation between FACTOR accuracy and factual generation in §6, solidifies FACTOR accuracy as a novel automatic measure for evaluating the proneness of an LM to generate factual information in a certain domain.
We evaluate GPT-2, GPT-Neo, and OPT models on Wiki-FACTOR, News-FACTOR and Expert-FACTOR (Figure 3). Larger models generally outperform smaller ones within the same model family. However, even the largest models are capped at % (GPT-NeoX-20B), % (OPT-66B) and % (OPT-30B) on Wiki-FACTOR, News-FACTOR and Expert-FACTOR respectively, indicating the benchmarks are challenging. Recent works Chuang et al. (2023); Kai et al. (2024) use Wiki-FACTOR and News-FACTOR to evaluate models from the LLaMA family Touvron et al. (2023) and show similar trends.
We observe that all models achieve higher FACTOR accuracy on news comparing to the other two domains. This may be because news articles cover specific events, making the prefix more useful for detecting factual completions (further discussion in App. B.2). When comparing different model-families, we find that the OPT models leads on News-FACTOR, while the GPT-Neo family leads on Wiki-FACTOR. This implies that the different data sources used for training these two model families are suited to different domains.
2 The Effect of Retrieval Augmentation on Factual Knowledge
Next, we ask: Can FACTOR accuracy be improved by augmenting models with a retrieval component? Importantly, while a clear motivation for retrieval augmentation is factual grounding of LMs, no existing metrics allow direct measurement of it in a text generation setting. We propose FACTOR accuracy as an alternative to the course measure of LM perplexity, which is often used to assess these methods Khandelwal et al. (2020); Borgeaud et al. (2022); Ram et al. (2023); Shi et al. (2023).
We compared the FACTOR accuracy of LLMs to that of their retrieval-augmented counterparts, implemented following the IC-RALM framework (§4.3; Ram et al. 2023). Figure 4 show the results for GPT-Neo and OPT Wiki-FACTOR. We observed consistent gains from augmenting the models with retrieval. These results highlight that grounding the model in an external corpus can improve its factuality. Since the retriever used in our experiments is used in an “off-the-shelf” manner, we speculate that further performance boosts may be gained by a retriever system specialized for this task (Izacard et al., 2022; Ram et al., 2023).
Another interesting finding is that the relative gains in FACTOR accuracy obtained by IC-RALM, are more moderate compared to the relative gains in perplexity over WikiText-103 (Merity et al., 2016), reported by Ram et al. (2023). We explore the connection between the two in the next section.
3 Perplexity Correlates but is not Always Aligned with FACTOR Accuracy
We investigate whether FACTOR accuracy adds additional information beyond perplexity, when used as a comparative metric for selecting which LM to use within a certain corpus. Figure 2 shows the FACTOR accuracy of models on Wiki-FACTOR, compared to their token-level perplexity on the Wikipedia section of The Pile’s validation set (§4.1) (App. B.1 includes all evaluated models). Overall, we observe a high correlation between the two metrics. However, there are cases where they disagree (i.e., a pair of models where one is better when measured by perplexity but worse in terms of FACTOR accuracy). For example, GPT-Neo-2.7B is significantly better than OPT-2.7B in terms of perplexity ( vs. ), but slightly worse in terms of FACTOR accuracy (% vs. %). In addition, GPT-J-6B has lower perplexity compared to OPT-66B ( vs. ), while OPT-66B is significantly better in terms of FACTOR accuracy (% vs. %). This finding suggests that (i) FACTOR accuracy offers a complementary view of models’ performance, not necessarily captured by perplexity, and (ii) improvements in perplexity do not necessarily imply better factuality.
This section explores the connection between FACTOR accuracy and factuality in open-ended generation, via human annotations.
We selected tuples of prefix, original completion and non-factual completion from Wiki-FACTOR. We then manually identified the minimal factual claim modified by , denoted by . For example, the predicate error from Table 2, in which “became” was replaced with “declined the position of”, the edit relates to the minimal fact “Donne became Chief Justice of Nauru and Tuvalu”.
We let LLMs generate free text, conditioned on the prefix and the completion until the edit induced by . Formally, let be the common prefix of and (in the predicate error example, is “After completing his term, he"). The LLM is conditioned on the concatenation of and . The LLM might generate the correct fact, text violating it, or other completion that does not refer to it. For each example we manually annotated whether the generated text is true, false, or neutral w.r.t. .
We analyzed two models with a similar token-level perplexity but a significant gap in FACTOR accuracy: GPT-J 6B and OPT-66B (marked in a green circle in Figure 2). For each model, we considered two groups of examples: examples with pairs for which the model was right, i.e., the model assigns larger mean log-likelihood to compared to , and pairs for which the model was wrong (the complement set). We sampled three generations per example for examples from each group and for each model. Overall, we created generations. We filtered some of the samples due to ill-formatted generations or non-contradictory completions (% of all samples).
2 Results
We assess model’s knowledge of the minimal facts through manual annotation. We only considered relevant generations for their minimal fact , excluding "neutral" generations (59.5% and 54.3% for GPT-J 6B and OPT-66B, respectively). For each model, we measure the percentage of generated texts that are true w.r.t. in the "right" and "wrong" subsets separately. We obtained the overall FACTOR accuracy by weighting the subsets results according to their distribution in Wiki-FACTOR. Results in Table 4 (full results in App. B.2).
For cases where models were wrong, they generated more false claims regarding their minimal fact. For example, OPT-66B only generated a true claim % of the times it was wrong, compared to % for when it was right. This suggests that FACTOR accuracy can shed light on the model’s ability to generate factual claims accurately.
Discussion
There were gaps in factuality annotation between OPT-66B and GPT-J 6B: OPT-66B generated true claims % of the time, while GPT-J 6B generated only %. This aligns with the models’ performance over Wiki-FACTOR, despite sharing similar perplexity on Wiki. This suggests that FACTOR is a better proxy for measuring model factuality in a specific domain.
This paper introduces FACTOR, a novel way to evaluate LMs’ factuality. FACTOR creates an evaluation benchmark from a corpus, consisting of factual statements and non-factual variations. By comparing the LM’s likelihood of factual claims with non-factual variants, FACTOR score captures the LM’s propensity to generate factual information.
Metrics for measuring factual knowledge over a given corpus are lacking. Prior works used perplexity, which may be affected by factors other than factual knowledge and does not contrast facts with false statements. FACTOR focuses the language modeling task on factuality by taking a contrastive approach. Our experiments show that FACTOR ranks models differently than perplexity and is more aligned with factuality in open-ended generation. These findings highlight the importance of negative examples for evaluating factuality. Moreover, they indicate that incorporating negative examples into training sets might also help optimizing models to be more factual. We leave investigation of training with FACTOR style data to future work.
Our work joins recent studies on factuality evaluation in a text-generation setting, which proposed to evaluate models by fact-checking the model’s generations Lee et al. (2022); Min et al. (2023). As FACTOR focuses on evaluation over a controlled set of facts, we see these two approaches as complementary; together, they yield a more holistic assessment of LM factuality.
We point to several limitations of our work. First, since FACTOR benchmarks are generated in an automated way, they may not fully comply with the requirements we define in §3.1, as analyzed in §3.3. Second, generating FACTOR benchmarks for different domains may pose new challenges. For instance, the selection of factual completions is straightforward in knowledge-intensive domains, where nearly every sentence in the corpus contains factual information. However, in general cases, a more intricate approach is needed to identify such sentences. Moreover, the generation of non-factual completions is based on a prompted model, specifically designed for the Wikipedia domain. While we observed those prompts applied well for the news domain, their effectiveness may vary in other, more specific domains.
Language models’ tendency to generate factually inaccurate text raises significant issues. FACTOR allows automatic evaluation of factuality, which can be used to efficiently measure and develop methods for mitigating these risks. However, we stress that when deploying such models in sensitive settings, automatic evaluations may not be sufficient, and human evaluation is required.
Appendix A Technical Details of FACTOR Data Pipeline
For each sentence, we identify the types of edits we can apply to it. First, we use a part-of-speech tagger to detect relevance for entity error (detecting nouns), predicate error (detecting verbs) and coreference error (detecting pronouns). For circumstances errors, we use Named-Entity Recognition taggers to identify sentences containing locations, dates, and time entities. Finally, we search for temporal/causal link words from a predefined set of words, which implies relevance for link errors.
A.2 Setting Filters Thresholds
As discussed in §3.2.3, we applied two filters to ensure the quality of the potential completions–an NLI filter (to filter out non-contradictory completions) and an LM filter (to filter out non-fluent completions). To choose the thresholds and , we manually annotated 40 samples w.r.t to the properties specified in §3.1 (i.e., (1) contradictory and (2) fluent and self-consistent). We have tested thresholds 0.1-0.9, and chose the threshold which achieved highest precision without filtering out too many samples (max 35% of the samples). For the NLI filter we used DeBERTa-largs model fine-tuned on the MNLI dataset. Best threshold was , with precision of 0.96. Manually evaluating the different contradiction types we have noticed this threshold was too harsh for corefrence contradiction (87.5% of the completions were filtered out. Therefore we reduced its threshold to 0.3 which filtered out 75% of the samples). For the LM filter we used GPT2-Small. Best threshold was , with precision of 0.78.
Appendix B Extended Results and Discussion
Figure 5 presents Wiki-FACTOR scores versus LM perplexity on Wikipedia. The figure extends Figure 2, presenting all evaluated LMs: models from the GPT-Neo family (blue circle), OPT family (red triangle) and GPT2 family (yellow square).
B.2 Factuality in Open-ended Generation
Table 6 shows the extended results for the manual factuality annotation for open-ended generation experiment §6. In addition to the overall results, we include the distribution of Neutral/True/False annotations. Notably, most generations are neutral for both models. This highlights the limitation of sampled-based approach for assessing model’s factual knowledge.
B.3 Knowledge of Unseen Facts
As seen in Figure 3 in §5.1, FACTOR-accuracy is often way above the random baseline of 25%, indicating that some models succeed in predicting unseen facts. It is possible that the knowledge of these facts is derived from another document in the training data (for example, Wikipedia contains many different articles related to each other, sharing similar factual statements). Another possibility is that an unseen fact is implied by the prefix. We hypothesize that this leads to higher FACTOR scores in the news domain, which often covers specific events, making the prefix more useful for detecting factual completions. Analysis of these cases is non-trivial, and is left for future work.
Appendix C Dataset Licenses
Table 5 details the license for each corpus we used in the paper:
Appendix D Prompts for Contradictions Generation
We prompted the model to generate multiple candidate completions, For each of the five error types: entity (Table 7), circumstance (Table 8), coreference (Table 9), predicate (Table 10 and 11) and link (Table 12). The prompts are concatenated to a given a completion and its near context, with the exception of link-prompt where only the completion is given (we found that the instruct model tends to repeat the context when it’s appended to this particular prompt). The prompts instruct the model to first plan its local edits, and then generate the contradiction.