Factuality Enhanced Language Models for Open-Ended Text Generation

Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, Bryan Catanzaro

Introduction

Large-scale pre-trained language models (LMs) have demonstrated impressive natural language generation results . However, the generative LMs (e.g., GPT-3) are solely trained to model the statistical correlations between subword tokens , and have limited capability to generate factually accurate text as illustrated in Table 1. As a result, there are increasing concerns about the nonfactual generations from large-scale pre-trained LMs [e.g., 6, 7, 8], which needs to be adequately addressed for their safe deployment in real-world applications, e.g., content creation and dialogue .

In previous studies, different metrics and methods have been proposed to measure and improve the factual accuracy of language generation within different tasks , including text summarization [e.g., 12, 13, 14, 15], question answering [e.g., 16, 17, 18], and table-to-text generation [e.g., 19, 20]. However, these works focus on the faithfulness (or factuality) of the fine-tuned LMs for particular downstream tasks (i.e., factual consistency between source and target text). Little exploration has been made to address the factual errors in pretrained LMs for general-purpose open-ended text generation, where the goal is to generate a coherent continuation from the given context (e.g., the use cases from GPT-2).

One of the popular methods for enhancing generation factuality is to incorporate external knowledge sources . Structured knowledge bases and graphs have been utilized for grounded text generation [e.g., 24, 25], where the LMs are trained to select and copy relevant facts from external knowledge sources. In contrast to the sizeable online text with factual information, the structured knowledge graphs only encode a limited amount of knowledge as they require expensive human annotations for high-quality construction. A method that can directly leverage plain text knowledge (e.g., Wikipedia, encyclopedia books, peer-reviewed publications) would be desirable for factuality enhancement as it can remove the human annotation bottleneck and easily scale up the amount of injected knowledge. Augmenting LM with an information retrieval (IR) system is one possible solution to leverage textual facts, however, at the cost of additional complexity and resource overhead to the model . Therefore, we explore an IR-free method that enhances the innate factuality of LMs by continued training on a factually rich plain-text corpus.

In this work, we focus on measuring and improving the factuality of large-scale pre-trained language models (LMs) for open-ended text generation. Specifically, we make the following contributions:

We build the benchmark and metrics to measure the factual accuracy of pre-trained LM for open-ended text generation. We demonstrate a good correlation between the proposed automatic metrics and human assessment of factuality. Based on that, we systematically study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B and find that large LMs have higher factual accuracy than smaller ones (e.g., named-entity factual error is reduced from 63.69% to 33.3%).

We study the decoding algorithms of LM in terms of factual accuracy. We unveil that the popular nucleus sampling algorithm for open-ended text generation can easily mix up different named entities or randomly fabricate information due to the “uniform randomness” introduced at every decoding step. We propose factual-nucleus sampling algorithm that promotes generation factuality while maintaining the quality and diversity.

We explore training methods that can effectively leverage text corpus with rich facts (e.g., Wikipedia). We find that directly continuing the training of LM on factual text data does not guarantee the improvement of factual accuracy. We propose factuality-enhanced training to address the underlying inefficiencies of this baseline. Our method consists of i) an addition of a TopicPrefix that improves the awareness of facts during training, and ii) a sentence completion task as the new objective for continued LM training [e.g., 30].

We demonstrate that the factual accuracy of large-scale LMs (up to 530B) can be significantly enhanced (i.e., named-entity factual error is reduced from 33.3% to 14.5%) after applying the proposed factuality-enhanced training with factual-nucleus sampling algorithm.

We organize the rest of the paper as follows. We discuss related work in § 2 and present our benchmark setup with evaluation protocol in § 3. We study the factual accuracy of LMs with respect to model size, prompt type, and choice of decoding algorithm in § 4. After that, we present factual-nucleus sampling algorithm in § 5, and factuality-enhanced training in § 6. We conclude the paper in § 7.

Related Work

Factuality vs. Model Size Lin et al. propose the TruthfulQA benchmark to measure the falsehood generations from different sized LMs. The result suggests that bigger LMs pre-trained on web text are generally less truthful than smaller ones in terms of false belief or misconception. At first glance, this is contradictory to our observation, however, our work focuses on different knowledge to TruthfulQA work. The TruthfulQA benchmark focuses on conceptual knowledge, while our benchmark focuses on factual knowledge According to Krathwohl , knowledge can be categorized into four types: i) factual knowledge, ii) conceptual knowledge, iii) procedural knowledge, and iv) metacognitive knowledge.. Large LMs can be good at recalling factual knowledge given substantial pre-training corpus, suggested by previous studies on LM’s parameteric knowledge , but there still remains room for improvement for reasoning conceptual knowledge .

Parametric Factual Knowledge A group of work addresses the factual errors in the parametric knowledge of LMs that is acquired from training corpus . The correctness of the parametric knowledge is commonly tested in cloze-style question answering format (e.g., Person X is born in __). Efforts are made to fine-tune the pre-trained LM to “inject” more knowledge and improve its ability to answer factual questions without consulting external knowledge source . Moreover, some works attempt to edit and fix the factual errors . However, it is unclear if the improvement of fine-tuned LM for QA-style task can help to mitigate factual errors in open-ended text generation task.

Hallucination in downstream NLG tasks There are active efforts to reduce the unfaithfulness or factual errors of task-specific LMs fine-tuned for various downstream natural language generation (NLG) tasks such as summarization , data-to-text and dialogue system . In contrast to these works, we focus on general purpose LM for open-ended text generation task.

Human-in-the-loop Human feedback or demonstrations are valuable to improve the factual accuracy of LMs. For example, InstructGPT fine-tune the LMs with collected human feedback for a truthful generation. WebGPT is trained to cite its sources when it generates output, thus allowing humans to evaluate factual accuracy by checking whether a claim is supported by a reliable source. In this work, we focus on human-free solution to mitigate nonfactual generations, as it is less expensive and easy to scale.

FactualityPrompts and Evaluation Metrics

Our goal is to automatically measure and evaluate the factuality of large-scale pre-trained language models (LMs) for open-ended text generation. Factuality refers to being coherent to provided ground-truth knowledge sources in NLP . The biggest challenge of evaluating factuality for open-ended text generation is associated with locating the ground-truth knowledge from the myriad of world knowledge. Evaluating open-ended text generation can be challenging due to the lack of ground-truth references for generation . In this study, the scope of our ground-truth knowledge source is set to Wikipedia Note that Wikipedia is one of the most commonly-used, accessible, large-scale, good quality, unstructured knowledge sources. Our proposed methods can easily generalize to other knowledge sources in plain text (e.g., arXiv papers, medical reports, reliable newspapers). because this helps simplify the evaluation setup.

As illustrated in Fig 1, our evaluation framework consists of the following phases. In phase 1, LM generates the continuations from the provided test prompts (§3.1). In phase 2, we first identify check-worthy continuations, which refers to the generations with facts that require factuality evaluation. One may refer to Appendix B for details. This step is necessary as open-ended text generation may generate text that does not contain facts such as personal opinion or chitchat-style text (e.g., “I like eating apples!”). Then, we prepare relevant ground-truth knowledge required for factual verification of check-worthy continuations (§3.2). Lastly, we calculate the factuality and quality measures (§3.3).

We design our test prompts (FactualityPrompts) that follows a similar setup as in RealToxicityPrompts , which has toxic and nontoxic prompts to evaluate the toxicity of LM continuations. FactualityPrompts consists of factual and nonfactual prompts that allow us to study the impact of prompts’ factuality on the LM continuation; this simulates the real-world scenario where input texts are not guaranteed to be factual. The data construction and statistic details are provided in Appendix D, and we will release the constructed FactualityPrompts for future research.

2 Ground-Truth Knowledge Preparation

To evaluate the factuality of a given generation, we need to prepare relevant ground-truth knowledge. The required ground-truth knowledge can be either document-level or sentence-level, depending on the type of factuality metrics (discussed in §3.3). The correctness of factuality evaluation is crucially dependent on the correctness of the ground-truth knowledge. To ensure that our factuality evaluation is not distorted by the irrelevant provision of ground-truth knowledge, we do the following:

For document-level ground-truth knowledge, we directly use the Wikipedia document annotation from the FEVER dataset. This way, we can mitigate any potential error from automatic document retrieval. For sentence-level ground-truth knowledge, we do automatic sentence selection by using two different methods to maximize the chance of recalling the relevant ground-truth knowledge. We treat the generated text as query $q$ and Wikipedia sentences as a pool of candidates $C=\{c_{1},c_{2},c_{3},...c_{N}\}$ where $N$ is the number of sentences in the Wikipedia document. One ground-truth sentence is retrieved by obtaining TF-IDF vector representations of $q$ and $C$ and selecting the $c_{i}$ with the highest cosine similarity with the $q$ . Another is retrieved by obtaining the contextual representation of $q$ and $C$ using SentenceTransformer and selecting the $c_{j}$ with the highest cosine similarity.

3 Evaluation Metrics

We adapt commonly used metric designs from the hallucination literature : named-entity (NE) based metric and textual entailment based metric. Each metric captures a different aspect of factuality, so we use both metrics for better understanding of factuality.

Since NEs are one of the core building blocks of “fact”, NE-related metric design is one of the common choices in literature . In this work, we specifically adopt the NE-based metric that is designed with a belief that a model is hallucinating (making factual errors) if it generates a NE that does not appear in the ground-truth knowledge source.

We define our NE-based metric to be: $\text{$ \textsc{NE}_{\textsc{Er}} $}={|\textsc{Hallu}_{\text{NE}}|}~{}/~{}{|\textsc{All}_{\text{NE}}|}$ where $\textsc{All}_{\text{NE}}$ is the set of all the NEs detected in the LM generation, and $\textsc{Hallu}_{\text{NE}}$ is subset of $\textsc{NE}_{\text{All}}$ that does not appear in the ground-truth Wikipedia document. Note that evaluating $\textsc{NE}_{\textsc{Er}}$ requires document-level ground-truth. To ensure the quality of the metric, we also take the same precautions used by . For named entities consisting of multiple words, partial n-gram overlaps are also treated as a “match”. This ensures we can address the shortened form of named entities – e.g., “Barack Hussein Obama II” vs. “Obama”. Note that stopwords (e.g., the, a) are not considered in the partial n-gram overlaps. The named entities are detected using a publicly available pre-trained NE detection model from Spacy.io.

Entailment Ratio

Textual Entailment (or natural language inference) is a task of determining whether a hypothesis is entailed by, refuted by, or neutral to a given premise . Entailment-based metrics are based on the rationale that factual generation will be entailed by the ground-truth knowledge .

We define the entailment ratio as: $\text{Entail}_{\textsc{R}}$ $=|\textsc{Entail}_{\text{gen}}|~{}/~{}{|\textsc{All}_{\text{gen}}|},$ where $\textsc{All}_{\text{gen}}$ is set of all generations, and $\textsc{Entail}_{\text{gen}}$ is the set of generations that are entailed by a entailment model. To obtain the entailment scores, we leverage a pretrained entailment model that is publicly available Refer to the code snippet provided in https://pytorch.org/hub/pytorch_fairseq_roberta/; a RoBERTa model fine-tuned on MNLI dataset. $\text{Entail}_{\textsc{R}}$ requires sentence-level ground-truth because only a few Wikipedia sentences are relevant to specific factual information in a given generation. For example, “Barack Obama was born in Hawaii” is only relevant to the Wikipedia sentence that mentions his birth location. Note that our $\text{Entail}_{\textsc{R}}$ is a stricter form of metric that does not treat neutral class to be factual.

Generation Quality Evaluation

We also evaluate the generation quality from three aspects: i) Fluency is an important aspect of text generation. We measured it by the mean perplexity of generated continuations evaluated with a large pretrained LM, which is 1.3B LM in this work . ii) Diversity is an important characteristic of LM that makes the generation more interesting and engaging – it is bland and boring to always generate same texts. It is measured using the mean number of distinct n-grams (we report 4-gram), normalized by the length of text among the 10 generations for each prompt (i.e., in total, 160,000 generations to evaluate the diversity of each method). iii) Repetition is a common form of degeneration that is very undesirable. We measure the number of repetitive substrings that get generated at the end of the generations by using the publicly available metric code from Holtzman et al. .

4 Correlation with Human Judgement

Although NE-based and entailment-based metrics have been used in downstream NLG tasks , they have not been utilized for evaluating factual accuracy in open-ended text generation. To ensure their validity, we collect human annotations to evaluate the correlation between our automatic factuality metrics with human judgement – i.e., are generations with higher $\text{Entail}_{\textsc{R}}$ and lower $\textsc{NE}_{\textsc{Er}}$ errors, more likely to be perceived as factual by human?

We obtained human annotations for 200 randomly chosen LM continuations of varying $\textsc{NE}_{\textsc{Er}}$ and $\text{Entail}_{\textsc{R}}$ scores. The annotators are asked to fact-check the LM continuations against Wikipedia and assign factuality label (1 = Factual : can find supporting Wikipedia evidence. 0 = Non-factual : cannot find supporting Wikipedia evidence).

The fact-checking annotation is a challenging and time-consuming task, as it requires the annotator to carefully read multiple evidences and reason over them. To improve the annotation quality, we have two types of annotations. The first type is two annotations from average English speaking workers on Appen.com platform, and the second type is one “expert” annotation from one of the authors who is familiar with the task and spent solid amount of time checking each samples. Based on these three annotations, we do majority voting and report the Pearson correlation results in Table 2. We also report the correlation result solely using the expert annotations, and show that there is strong correlation between human judgement of factuality and the proposed automatic metric $\textsc{NE}_{\textsc{Er}}$ and $\text{Entail}_{\textsc{R}}$ . $\textsc{NE}_{\textsc{Er}}$ is negatively correlated with factuality because the lower the $\textsc{NE}_{\textsc{Er}}$ error, the better the factuality.

Factuality Analysis of Pretrained LMs

In this section, we perform a factuality analysis of LMs from three aspects: i) model size, ii) prompt type and iii) decoding algorithm.

Researchers have observed the trend of larger LMs outperforming smaller ones in various downstream tasks . However, contradicting to these general observations, recent studies suggest that more misconceptions tend to be generated from larger models , and zero-shot fact-checking performance tend to stagnate with LM scaling . We study the factuality of LMs with a range of parameter sizes (126M, 357M, 1.3B, 8.3B, 530B) to understand whether such surprising trend also applies to open-ended text generation. Note that, all LMs are pretrained on the same corpus as in . As shown in Table 3, generation factuality does improve with the scaling of model size, e.g., $\textsc{NE}_{\textsc{Er}}$ drops from 63.99% to 33.30% when parameter size scales up from 126M to 530B.

Prompt Type

Prompts provided to the LM are known to significantly affect the quality and characteristics of LM continuations . We use our factual and nonfactual prompts to test the behavior of LMs. Results in Table 3 show that both factual and nonfactual prompts can lead to nonfactual generations, although factual prompts always result in less nonfactual generations. Interestingly, the performance gap between factual and nonfactual prompts gets more prominent as the model size increases ( $4\%$ to $7\%$ in $\textsc{NE}_{\textsc{Er}}$ as parameter size increases from 126M to 530B). This could be due to the larger LM can better understand the prompts and imitate the factual or nonfactual prompts in the continuations.

Decoding Algorithm

We investigate the choice of decoding algorithms and their impacts on the factuality of generations. In particular, we compare two representative decoding algorithms that are greedy decoding (i.e., maximize generation likelihood) and nucleus sampling . Nucleus sampling algorithm (a.k.a. top- $p$ ) samples only from the top subword candidates with total cumulative probability $p$ . It is popular for open-ended text generation because it solves the degeneration problems of the greedy decoding algorithm (e.g., repetition). However, the results in Table 3 show that top- $p$ decoding underperforms greedy decoding in terms of factuality, although it obtains higher generation diversity and less repetition. This intuitively makes sense because top- $p$ can be seen as adding “randomness” to encourage diversity, which as a result, can lead to factual errors. It is important to understand that factuality of a sentence can be easily altered by one wrong choice of word. For example, “Barack Obama was born in 1961” will be nonfactual if “1961” is changed to “1962”. In the same sense, greedy decoding is more factual because its way of choosing the word with the highest probability minimizes randomness and maximizes the utilization of parametric knowledge of LM . However, greedy decoding sacrifices generation diversity and quality.

Error Types

We conduct a qualitative analysis of the factual errors from greedy generation of 530B LM, to understand what are the remaining errors when the randomness from decoding choice is strictly restricted. The two notable error types were:

Named Entity Mix-up: Mixing up similar types of the named entity. For example, LM generated “The movie is based on the novel of the same name by Gayle Forman.” about a film called “The Best of Me”. However, the correct author’s name is “Nicholas Sparks”, not “Gayle Forman”. Note that Gayle Forman is also an American young adult fiction author who writes similar type of novels as Nicholas Sparks.

Fabricated Fact: Fabricating some random facts. For example, “Samuel Witwer’s father is a Lutheran minister.” Note that, the pretraining corpus contains non-factual or fictional information, which can also contribute to such fabricated facts.

Both error types can be viewed as wrong associations of entities that appear at different parts of the training corpus with similar context. Such behavior is unsurprising because these LMs are uniformly trained with the next subword prediction objective instead of a fact-related objective.

Factual-Nucleus Sampling

In this section, we propose a new sampling algorithm that achieves a better trade-off between generation quality and factuality than existing decoding algorithms.

We hypothesize that the randomness of sampling is more harmful to factuality when it is used to generate the latter part of a sentence than the beginning of a sentence. There is no preceding text at the start of a sentence, so it is safe for LM to generate anything as long as it is grammatical and contextual. However, as the generation proceeds, the premise become more determined, and fewer word choices can make the sentence factual. Given the example “Samuel Witwer’s father is a Lutheran minister”, the beginning of the sentence “Samuel Witwer’s father is” is not nonfactual. However, the continuation of “Lutheran minister” makes the sentence nonfactual. Therefore, we introduce the factual-nucleus sampling algorithm that dynamically adapts the “nucleus” $p$ along the generation of each sentence. In factual-nucleus sampling, the nucleus probability $p_{t}$ to generate the $t$ -th token within each sentence is,

where $\lambda$ is the decay factor for top- $p$ probability, and $\omega$ lower bounds the decay of probability. Specifically, it has the following parts:

$\lambda$ -decay: Given that top- $p$ sampling pool is selected as a set of subwords whose cumulative probability exceeds $p$ , we gradually decay the $p$ value with decay factor $\lambda$ at each generation step to reduce the “randomness” through time.

$p$ -reset: The nucleus probability $p$ can quickly decay to a small value after a long generation. So, we reset the $p$ -value to the default value at the beginning of every new sentence in the generation (we identify the beginning of a new sentence by checking if the previous step has generated a full-stop). This reduces the unnecessary cost of diversity for any long generations.

$\omega$ -bound: If $\lambda$ -decay is applied alone, the $p$ -value could become too small to be equivalent to greedy decoding and hurt diversity. To overcome this, we introduce a lower-bound $\omega$ to limit how far $p$ -value can be decayed.

We will show the importance of each parts with ablation studies.

2 Result

We report our decoding experimental results with 1.3B LM 1.3B LM is mainly used as it is big enough to have good learning capacity yet not too resource expensive. in Table 4. Additions of $\lambda$ -decay helps improve top- $p$ 0.9 factuality results – for instance, with decay rate $\lambda$ = 0.5, there is 12.5% drop in $\textsc{NE}_{\textsc{Er}}$ and 10.1% gain in $\text{Entail}_{\textsc{R}}$ . However, this affects the diversity and repetition to become similar to greedy decoding. $p$ -reset mitigates the repetition issue and improves diversity metric without losing much in factuality metric. The effect is more drastic for the $\lambda$ = 0.5 option, where it achieves 0.26 gains in diversity metric with negligible changes in factuality scores. By also adding $\omega$ -bound, we obtain the anticipated factuality performance (i.e., similar to greedy decoding), with great improvement in generation quality over greedy; with $p$ =0.9, $\lambda$ =0.9, $\omega$ =0.3, we achieve $\times$ 11 improvement in diversity and $\times$ 4.6 improvement in repetition over greedy. Although our factual-nucleus sampling still under-performs top- $p$ 0.9 in terms of diversity, we believe this is an acceptable trade-off to improve the factuality of LM for factually sensitive open-ended generation tasks. Our proposed decoding does not harm the sentence fluency; its perplexity do not exceed the perplexity of top-p. Refer to Appendix F for full perplexity results.

To further illustrate the underlying trade-off, we also compare the proposed factual-nucleus sampling against the nucleus sampling with lower $p$ values that are also expected to have lower randomness, thus less factual error, in generations. Specifically, we plotted results for nucleus sampling with $p$ = $\{0.9,0.7,0.6,0.5,0.4,0.3\}$ , and factual nucleus sampling with the following $p~{}|~{}\lambda~{}|~{}\omega$ choices: 0.9|0.9|0.7, 0.9|0.9|0.5, 0.9|0.9|0.4, 0.9|0.9|0.3, 0.9|0.7|0.3. The Fig 2(a) and Fig 2(b) respectively show that the factual nucleus sampling method has better trade-offs than top- $p$ in factuality-vs-diversity and factuality-vs-repetition. In other words, it always achieves better factuality score with the same level of diversity and repetition scores.

Factuality-Enhanced Continued Training

This section introduces factuality-enhanced method for continued training of LMs . We introduce the TopicPrefix for better awareness of facts and the sentence completion loss as training objective.

Unstructured factual knowledge typically exists at a document level (i.e., a group of factual sentences about an entity). This means that sentences can contain pronouns (e.g., she, he, it), making these sentences factually useless standalone. To illustrate with an example from Barack Obama’s Wikipedia page, “He previously served as a U.S. senator from Illinois from 2005 to 2008” cannot be a useful standalone fact because it is unclear who “He” is. Due to the GPU memory limit and computation efficiency, it is common to chunk documents in LM training corpus. This causes the “fragmentation” of information and leads to wrong associations of entities that appear in independent documents with similar contexts. As a remedy, we propose to prepend TopicPrefix to sentences in the factual documents to make each sentence serve as a standalone fact. In our experiments, we mainly utilize Wikipedia as the factual corpus and the Wikipedia document name as the TopicPrefix.

2 Sentence Completion Loss

We propose a sentence completion loss to address the incorrect association learned between entities. To explain our rationale, let us recall the nonfactual example from §5: “Samuel Witwer’s father is a Lutheran minister”. This sentence is nonfactual because LM failed to generate factually correct information after “is”. In other words, LM failed to accurately complete the sentence given the generated context. One reason is that the LM is uniformly trained to predict each subword token within the sentence, when ensuring the correct prediction at the latter section of sentence is more critical for factuality. Therefore, we construct a sentence completion loss, which makes the LM focus on predicting the subwords later in the sentence. For implementation, we determine a pivot $t$ for each sentence, and then apply zero-masking for all token prediction losses before $t$ . This pivot is only required during training (i.e., no pivot needed during inference time).

We emphasize that this loss masking is different from the input token masking applied in BERT or BART , and the LM is still trained in an autoregressive manner. Note that many BART-based summarization models are known to still suffer from factual errors, suggesting that masked prediction at the encoder level may not effectively transfer well to autoregressive text generation.

In this work, we explore three strategies (from simple to complex) to determine the pivot $t$ :

$SC_{\textsc{half}}$ : pivot $t=0.5\times$ sentence-length.

$SC_{\textsc{random}}$ : random pivot, e.g., $t\sim\text{uniform}[0.25,0.75]\times$ sentence-length.

$SC_{\textsc{root}}$ : pivot $t=$ position of ROOT (relation) from dependency parsing.

Our experiments show that the simplest $SC_{\textsc{half}}$ performs on par with the complex ones (such as $SC_{\textsc{root}}$ ), thus, we suggest future work to choose $SC_{\textsc{half}}$ strategy.

3 Results

The results are reported in Table 5, and experimental setups are reported in Appendix C.

Inefficiency of Domain Adaptive Training The pre-training corpus of LM contains both factual texts (e.g., Wikipedia) and potentially nonfactual texts (e.g., rumors, fake news) See for details of pre-training corpus.. The nonfactual domain of the training corpus could be the problem. Thus, we conduct a baseline experiment that does domain-adaptive training with strictly factual domain text only (i.e., Wikipedia). Interestingly, we find that domain-adaptive training can hardly improve generation factuality.

Effect of TopicPrefix Continued pre-training of 1.3B LM with TopicPrefix preprocessed Wikipedia alone can already improve the factuality, especially in terms of $\textsc{NE}_{\textsc{Er}}$ . For example, it reduces the $\textsc{NE}_{\textsc{Er}}$ from $42.1\%$ to $27.6\%$ when we use the factual-nucleus decoding (0.9 | 0.9 | 0.3), which even outperforms the 1.3B with greedy decoding ( $\textsc{NE}_{\textsc{Er}}$ : $27.6\%$ vs. $39.9\%$ ) with much less repetition ( $8.0\%$ vs. $33.1\%$ ).

Effect of Sentence Completion Loss The proposed sentence completion loss further helps to improve the factuality, especially for the $\text{Entail}_{\textsc{R}}$ . For example, when one uses factual-nucleus decoding on trained 1.3B model, TopicPrefix + $SC_{\textsc{half}}$ can further improve $\text{Entail}_{\textsc{R}}$ from $8.7\%$ to $17.4\%$ than TopicPrefix alone, while reducing $\textsc{NE}_{\textsc{Er}}$ from $27.6\%$ to $23.6\%$ . Note that the results show consistent improvement across different pivot selection strategies, suggesting that the sentence completion loss is robust. In particular, the simplest $SC_{\textsc{half}}$ performs as good as others or even better in terms of several metrics. Thus we recommend it as the default option.

530B vs 1.3B As expected, our method on 530B LM further reduces the factual errors and achieves the lowest $\textsc{NE}_{\textsc{Er}}$ ( $14.5\%$ ) and the highest $\text{Entail}_{\textsc{R}}$ ( $25.5\%$ ). Surprisingly, our method on 530B LM lead to less diverse generation than 1.3B LM despite the significant improvement in the generation quality (i.e., near perfect repetition scores $0.1\%~{}0.2\%$ ). We conjecture that this is the trade-off between the factuality and diversity for 530B LM.

Conclusion

In this work, we establish a benchmark to measure and analyze factuality in open-ended text generation tasks. We propose factual-nucleus sampling that improves generation factuality at inference time, and the combination of sentence completion loss and TopicPrefix pre-processing that improves factuality with continued training. We demonstrate that our methods are effective in improving the factuality. Lastly, our results shed light on the existence of the trade-off between diversity and factuality. We strongly believe this is an important insight that will help researchers make a better-informed decision about their model design - i.e., appropriately prioritize the desirable attribute of their LM (factuality vs. diversity) according to the final goal of their task. Potential future work would be to reduce the degree of the observed trade-offs.

References

Appendix A Generation Examples

We provide more generation examples from the pretrained 530B LM with greedy and top- $p$ sampling ( $p=0.9$ ), and factuality-enhanced 530B LM with factual-nucleus sampling (Ours). Green indicates factual, red indicates nonfactual, and striked text indicates repetition. Refer to Appendix G for more examples. Disclaimer: Authors tried to exhaustively check the factuality of the following generations, however, there is no 100% guarantee about the annotations.

Appendix B Details about Claim Filtering Step in §3

The goal of open-ended text generations does not require all generations to always contain “facts”. There can be generations that are perfectly grammatical and fluent, yet do not contain any checkworthy content such as personal opinions and daily small talks. Thus, we filter out “not-checkworthy” sentences that possess any of the following characteristics:

Contains no named entities, which are important building blocks of fact or information. E.g., “Check this out”, “To say that a person is an example of something is absurd.”

Contains first-person pronouns (i.e., I, we, and us), which are strong signal for personal opinions or casual chitchat style of writing. E.g., “I think…”, “I believe…”

Contains question mark. E.g., “Do you want to hear something interesting?”, “Did you know?”, “What are your thoughts?”

Appendix C Experiment Details

Here, we provide an example of how the training corpus looks like when TopicPrefix is applied.

The following Wikipedia paragraph about Barack Obama:

Barack Hussein Obama II (born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. He was the first African-American president of the United States. A member of the Democratic Party, he previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004.

Barack Obama ==> Barack Hussein Obama II (born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. Barack Obama ==> He was the first African-American president of the United States. Barack Obama ==> A member of the Democratic Party, he previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. .

Hyper-parameters and training details

The hyper-parameters for 1.3B factuality enhancement training were: learning rate 2e-6, batch size 64, maximum input sequence length 2048. For 530B model, the hyper-parameters were: learning rate 1e-5, batch size 512, maximum input sequence length 2048. The architecture details of pre-trained LMs are in Table 6. During inference, we set maximum subword sequence length to be 150. Same Wikipedia corpus with topic-prefix is commonly used for all our factuality enhancing training.

Detail about pre-trained LMs

All LMs with different sizes are pre-trained on the same corpus, following the experimental details in .

Appendix D FactualityPrompts Details

Since high quality fact-related data collection requires a lot of human efforts, we instead utilize a well-established fact-related dataset, FEVER , to construct our factual and nonfactual prompts. FEVER is a fact-checking dataset consisting of claims that are supported, refuted or unverifiable (NotEnoughInfo) by Wikipedia documents. These claims are created by annotators who were asked to alter or paraphrase the sentences from Wikipedia. We leverage the supported and refuted claims from FEVER validation set The testset is not publicly released and can only be accessed through the FEVER workshop submission site. Therefore, it is common practice to leverage validation set instead. as the factual and nonfactual prompts, respectively. To further ensure the quality of the test set, we filter out claims that are not appropriate to serve as prompts – e.g., extremely short claims that are not enough to provide any context to the LM. The data statistics after filtering is reported in Table 7.

Appendix E Limitations and Societal Impact

Although the factual-nucleus sampling requires the same amount of computation as regular top- $p$ sampling, the continued pre-training of large language models will have some negative carbon footprint. However, our task itself (trying to improve factuality) will bring more overall benefit to the community and society, by allowing the language models to generate less fake information and be safer for deployment. In terms of ethical consideration, to the best of our knowledge, Wikipedia has no private personal information or any inappropriate content (problematic discrimination towards particular demographic groups, NSFW contents, hate speech, etc). So, fine-tuning our model on it will not encourage unfairness, biases or toxic output.

Appendix F Extended Experimental Results

A small scale experiments using 3000 Factual Prompts are conducted to explore the stand-alone impact of sentence completion loss. As shown in Table 8, the only having sentence completion loss is indifferent to having the standard factual-domain adaptive training (i.e., negligible difference in factuality). However, when used together with TopicPrefix, it results in a significant boost for both factuality metrics.

F.2 Experimental Results with Perplexity

In this subsection, we provide experimental results including the perplexity scores (PPL) of generated text evaluated on the 1.3B pretrained LM as a fluency measure. The results consistently indicate that our proposed decoding and training methods do not harm the fluency of the generation. For instance, in Table 9, all our decoding choices result in PPL scores between $1.9\sim 4.1$ that are smaller than Top- $p$ 0.9 PPL score $12.0$ .

To provide full details about the columns reported in Table 9 and Table 10, $\textsc{NE}_{\textsc{Er}}$ refers to the named-entity error, $\text{Entail}_{\textsc{R}}$ refers to entailment ratio, Div. refers to distinct 4-grams and Rep. refers to repetition. $\uparrow$ means the higher the better, and $\downarrow$ means the lower the better.