Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi
Introduction
Large language models (LLMs), particularly characterized by their substantial number of parameters, have arisen as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (Zhao et al., 2023c). With proper alignment techniques, such as supervised finetuning (SFT; Zhang et al., 2023b) and reinforcement learning from human feedback (RLHF; Ouyang et al., 2022; Fernandes et al., 2023), recent LLMs (OpenAI, 2023a; Touvron et al., 2023b; OpenAI, 2023b, inter alia) have exhibited strong capabilities in solving various downstream tasks.
Nonetheless, as exemplified in Figure 1, LLMs, despite their remarkable success, occasionally produce outputs that, while seemingly plausible, deviate from user input (Adlakha et al., 2023), previously generated context (Liu et al., 2022), or factual knowledge (Min et al., 2023; Muhlgay et al., 2023; Li et al., 2023a)—this phenomenon is commonly referred to as hallucination, which significantly undermines the reliability of LLMs in real-world scenarios (Kaddour et al., 2023). For instance, LLMs can potentially fabricate erroneous medical diagnoses or treatment plans that lead to tangible real-life risks (Umapathi et al., 2023).
While hallucination in conventional natural language generation (NLG) settings has been widely studied (Ji et al., 2023), understanding and addressing the hallucination problem within the realm of LLMs encounters unique challenges introduced by
Massive training data: in contrast to carefully curating data for a specific task, LLM pre-training uses trillions of tokens obtained from the web, making it difficult to eliminate fabricated, outdated or biased information;
Versatility of LLMs: general-purpose LLMs are expected to excel in cross-task, cross-lingual, and cross-domain settings, posing challenges for comprehensive evaluation and mitigation of hallucination.
Imperceptibility of errors: as a byproduct of their strong abilities, LLMs may generate false information that initially seems highly plausible, making it challenging for models or even humans to detect hallucination.
In addition, the RLHF process (Ouyang et al., 2022), the vague knowledge boundary (Ren et al., 2023) and the black-box property of LLMs (Sun et al., 2022) also complicate the detection, explanation, and mitigation of hallucination in LLMs. There has been a notable upsurge in cutting-edge research dedicated to addressing the aforementioned challenges, which strongly motivates us to compile this survey.
We organize this paper as follows, as also depicted in Figure 2. We first introduce the background of LLMs and offer our definition of hallucination in LLMs (§2). Next, we introduce relevant benchmarks and metrics (§3). Subsequently, we discuss potential sources of LLM hallucinations (§4), and provide an in-depth review of recent work towards addressing the problem (§5). Finally, we present forward-looking perspectives (§6). We will consistently update the related open-source materials, which can be accessed at https://github.com/HillZhang1999/llm-hallucination-survey.
Hallucination in the Era of LLM
We begin this section by overviewing the history of LLMs (§2.1). Next, we present our definition of LLM hallucination, by breaking it down into three sub-categories (§2.2). In addition, we discuss the unique challenges of hallucination in LLMs (§2.3), and compare hallucination with other prevalent problems that are frequently encountered in the realm of LLMs (§2.4).
An important category of LLMs is autoregressive language models (Radford et al., 2019; Chowdhery et al., 2022; Touvron et al., 2023a, inter alia). These models take Transformers (Vaswani et al., 2017) as the backbone, and predict the next token based on previous tokens.Another variant of language models predicts masked tokens in a corrupted sequence (Devlin et al., 2019; Liu et al., 2019; Lan et al., 2019, inter alia). Prior to the widespread adoption of Transformers, autoregressive language models were built on the backbones of n-grams (Bickel et al., 2005; Pauls and Klein, 2011) and recurrent neural networks (Mikolov et al., 2010), and have been applied to various NLG tasks such as summarization (Nallapati et al., 2017) and dialogue generation (Chen et al., 2017).
Transformer-based LLMs have demonstrated exceptional performance across tasks, and have therefore shifted NLP from a paradigm centered on task-specific solutions to general-purpose pre-training (Devlin et al., 2019; Radford et al., 2019). The pretrained models are optimized on various self-supervision objectives (Devlin et al., 2019; Raffel et al., 2020; Lewis et al., 2020a, inter alia), using large-scale unlabeled corpora. Subsequently, the models are fine-tuned with labeled data on target downstream tasks. Representations from the pretrained models can typically reduce the demand for annotated data and achieve significant performance improvement across downstream tasks (Qiu et al., 2020; Min et al., 2021; Li et al., 2022b, inter alia).
In addition to performance improvement on downstream tasks, recent work has found that scaling up pretrained language models—both in terms of model parameter count and the volume of pre-training data—enables some remarkable abilities, including in-context learning (Brown et al., 2020), reasoning (Wei et al., 2022), and instruction following (Ouyang et al., 2022). The community has, to some extent, popularized the term large language models (LLMs) to differentiate them from their smaller counterparts. Notably, LLMs exhibit the potential to accurately comprehend human instructions and efficiently tackle a variety of complex tasks with only minimal or even no supervision (OpenAI, 2023a, b; Touvron et al., 2023b).
2 What is LLM Hallucination
While LLMs have demonstrated remarkable performances, they still inevitably encounter different problems in practical applications, where hallucination is one of the most significant issues among them. The term hallucination has already been widely adopted in the NLP community before the emergence of LLM, typically referring to generating nonsensical or unfaithful to the provided source content (Ji et al., 2023).
We argue that the definition appears to have considerably expanded due to the versatility of LLMs. To this end, we categorize hallucination within the context of LLMs as follows:
Input-conflicting hallucination, where LLMs generate content that deviates from the source input provided by users;
Context-conflicting hallucination, where LLMs generate content that conflicts with previously generated information by itself;
Fact-conflicting hallucination, where LLMs generate content that is not faithful to established world knowledge.
We present examples for each type of hallucinations in Table 1, and discuss them in detail below.
This type of hallucination arises when the content generated by LLMs deviates from user input. Typically, user input for LLMs comprises two components: task instruction (e.g., user prompt for summarization) and task input (e.g., document to be summarized). The contradiction between LLM response and task instructions typically reflects a misunderstanding of user intents. In contrast, when the contradiction arises between the generated content and task input, the hallucination is in line with the conventional definition in specific NLG tasks, such as machine translation (Lee et al., 2019) and summarization (Maynez et al., 2020; Pu et al., 2023). For instance, the first example in Table 1 appears to highlight a contradiction between the generated content and task input: when users request the LLM to generate a summary, the LLM incorrectly replaces the person’s name in its response (HillLucas), even though the general form can indeed be perceived as a suitable summary.
LLMs may exhibit self-contradictions when generating lengthy or multi-turn responses. This type of hallucination arises when LLMs lose track of the context or fail to maintain consistency throughout the conversation, potentially due to their limitations in maintaining long-term memory (Liu et al., 2023d) or identifying relevant context (Shi et al., 2023a). The second example in Table 1 demonstrates how a user request to introduce the NBA Commissioner leads to a context-conflicting hallucination. Specifically, the LLM initially introduces Silver (the current NBA commissioner), but later refers to Stern (the former NBA commissioner), demonstrating a lack of consistency in the generation.
This type of hallucination occurs when LLMs generate information or text that contradicts established world knowledge. The source of fact-conflicting hallucinations can be multifarious and introduced at different stages of the LLM life cycle, as shown in Figure 2. We present an illustration in Table 1 (third example): in this case, the user asks the LLM about the mother of Afonos II. The LLM gave a wrong answer (Queen Urraca of Castile instead of Dulce Berenguer of Barcelone), which can easily mislead less knowledgeable users.
The focus of recent hallucination research in LLMs is predominantly on fact-conflicting hallucination, despite the importance of the other two types. Possible reasons include but not limited to: (1) input- and context-conflicting hallucinations have been extensively studied in conventional NLG settings (Ji et al., 2023). However, fact-conflicting hallucination poses more complex challenges in LLMs due to the absence of an authoritative knowledge source as a reference; (2) fact-conflicting hallucinations tend to have more side effects on the practical applications of LLMs, leading to a greater emphasis in recent studies. In light of this research status, the following sections of our paper will primarily concentrate on the fact-conflicting hallucinations, and we will explicitly highlight them when addressing the other two types of hallucinations.
3 Unique Challenge in the Era of LLM
Although the problem of hallucination has been extensively researched in conventional NLG tasks (Ji et al., 2023), hallucinations in LLMs bring forth a unique and complex set of challenges stemming from the training process and usage scenarios.
Unlike task-specific NLG models trained on limited-scaled datasets, LLMs are pre-trained on trillions of tokens. These pre-training corpora are automatically collected from the web and often contain a significant amount of fabricated, outdated, or biased information (Penedo et al., 2023). Such inadequate data may lead LLMs to generate hallucinated content. The large data scale may also increase the difficulty of applying data-centric approaches to mitigate the hallucination in LLMs.
Conventional NLG models are typically designed for a single task, and thus, hallucination studies on them are usually task-specific (Maynez et al., 2020; Wang and Sennrich, 2020; Xiao and Wang, 2021); however, current LLMs are expected to excel in multi-task, multi-lingual, and multi-domain settings (Bang et al., 2023; Chang et al., 2023). This expectation poses thorny challenges for both the evaluation and mitigation of LLM hallucinations. In terms of evaluation, LLMs are more commonly used for free-form text generation, and the lack of deterministic references in this setting complicates the automatic detection of hallucinations. Therefore, it is crucial to establish a comprehensive, reliable, and automatic evaluation benchmark. Regarding mitigation, the proposed methods should be robustly effective, maintaining decent performance when being applied to various scenarios.
Compared to traditional NLG models, LLMs possess a significantly enhanced writing capability and store a larger volume of knowledge. Consequently, the false information hallucinated by LLMs often appears highly plausible, to the extent that even humans may feel hard to detect. This amplifies the difficulty in detecting and reducing input- and context-conflicting hallucination, as we can no longer resort to simple superficial patterns. Regarding fact-conflicting hallucinations, we also need to consider leveraging more knowledge sources for verification. These factors collectively introduce substantial new challenges.
4 Other Problems in LLMs
Besides hallucination, LLMs also present other problems. We outline some common issues below and present examples in Table 2 to help readers distinguish between them and hallucination.
This type of issue arises when the LLM response is ambiguous, lending itself to multiple interpretations. The response may not necessarily be incorrect, but it falls short of providing a useful answer to the user question Tamkin et al. (2022). The first example in Table 2 exemplifies this issue. The desired answer is ‘Paris’, yet the LLM provides an ambiguous response.
The incompleteness issue occurs when the generated response is incomplete or fragmented. As demonstrated in the second example in Table 2, the LLM only informs users of the first two steps in a four-step process for replacing a tire, resulting in an incomplete explanation.
Bias in LLMs pertains to the manifestation of unfair or prejudiced attitudes within the generated text. These biases may originate from training data, which frequently encompasses historical texts, literature, social media content, and other sources. Such sources may inherently mirror societal biases, gender bias, stereotypes, or discriminatory beliefs Navigli et al. (2023). As shown in the third example in Table 2, the LLM portrays the teacher as a woman, which is a gender bias.
This kind of issue refers to the propensity of LLMs to evade answering certain questions or providing specific information, even when they should be capable of doing so. For instance, due to imperfections in the reward model, RLHF may lead to over-optimization of LLMs, potentially leading to a state of under-informativeness Gao et al. (2022). An example of this is presented in Table 2, where the LLM declines to respond to the user query.
Evaluation of LLM Hallucination
Previous research has primarily concentrated on evaluating hallucination in specific natural language generation tasks, such as machine translation Guerreiro et al. (2023b); Dale et al. (2023), dialogue generation Dziri et al. (2021), question answering Durmus et al. (2020) and text summarization Kryscinski et al. (2020); Maynez et al. (2020); Zhong et al. (2021). These works mainly focus on the input-conflicting hallucination facet, which is relatively easy for human users to identify given the source text, as shown in Table 1. Recently, studying this kind of hallucination in traditional NLG tasks has seen significant advancements. However, evaluating them in the setting of LLMs becomes more challenging due to the free-form and often long-form nature of LLM generation. Regarding context-conflicting hallucination, Cui et al. (2021) and Liu et al. (2022) evaluate models’ ability to identify context conflicts introduced when BERT (Devlin et al., 2019) performs blank-filling. Most benchmarks today evaluate the fact-conflicting hallucination of LLMs Lin et al. (2021); Lee et al. (2022); Min et al. (2023); Yu et al. (2023a); Li et al. (2023a); Muhlgay et al. (2023), which refers to their tendency to generate factual errors. This is considered a critical issue in LLMs because it is challenging for users to identify and poses real-life risks.
In the upcoming sections, we will review existing benchmark datasets and commonly used evaluation metrics in 3.1 and 3.2, respectively.
Various benchmarks have been proposed for evaluating hallucination in LLMs. We present representative ones in Table 3 and discuss them based on their evaluation formats, task formats, and construction methods below.
Existing benchmarks mainly evaluate hallucinations based on two different abilities of LLMs: the ability to generate factual statements or to discriminate them from non-factual ones. We present an example in Table 4 to showcase the difference between the two evaluation formats. Generation benchmarks Lin et al. (2021); Lee et al. (2022); Min et al. (2023); Yu et al. (2023a) consider hallucination as a generation characteristic, similar to fluency (Napoles et al., 2017) and coherence (Du et al., 2022), and evaluate the generated texts from LLMs. For instance, TruthfulQA Lin et al. (2021) evaluates the truthfulness of LLMs’ responses to questions, while FActScore Min et al. (2023) scrutinizes the factual accuracy of biographies generated by LLMs for specific individuals. In contrast, discrimination benchmarks Li et al. (2023a); Muhlgay et al. (2023) consider LLMs’ ability to discriminate truthful statements from hallucinated ones. Specifically, HaluEval Li et al. (2023a) requires the model to determine whether a statement contains hallucinated information, while FACTOR Muhlgay et al. (2023) investigates whether the LLM assigns a higher likelihood to the factual statement compared to non-factual ones. Note that TruthfulQA Lin et al. (2021) also supports discrimination format by offering a multiple-choice alternative to test a model’s ability to identify truthful statements.
Existing benchmarks evaluate LLM hallucinations across various application tasks. Firstly, certain benchmarks Lin et al. (2021); Li et al. (2023a) explore the issue of hallucination in the context of question-answering, evaluating the ability of LLMs to provide truthful answers to knowledge-intensive questions. Secondly, FActScore Min et al. (2023) and HaluEval Li et al. (2023a) employ task instructions, such as biography introduction instructions and 52K instructions from the Alpaca project Taori et al. (2023), to prompt LLMs to generate responses. The factuality of these responses is then evaluated. Thirdly, a line of work Lee et al. (2022); Muhlgay et al. (2023) directly prompts LLMs to complete text given a prefix, and diagnoses potential hallucination during the generation of informative and factual statements. For instance, FACTOR Muhlgay et al. (2023) considers context prefixes in Wikipedia documents, while FactualityPrompt Lee et al. (2022) designs prefixes specifically for factual or non-factual statements to elicit hallucinations. Table 5 provides samples under different task formats.
Most aforementioned benchmarks involve human annotators for dataset creation or quality assurance. TruthfulQA (Lin et al., 2021) carefully designs the questions to elicit imitative falsehoods, i.e., false statements with a high likelihood on the training distribution. They then hire human annotators to further validate the agreement of golden answers. FActScore Min et al. (2023) conducts a manual annotation pipeline to transform a long-form model generation into pieces of atomic statements. HaluEval Li et al. (2023a) employs two construction methods. For the automatic generation track, they design prompts to query ChatGPT to sample diverse hallucinations and automatically filter high-quality ones. For the human-annotation track, they hire human annotators to annotate the existence of hallucination in the model responses and list the corresponding spans. FACTOR Muhlgay et al. (2023) first uses external LLMs to generate non-factual completion. Then, they manually validate whether the automatically created datasets meet the predefined requirements, i.e., they should be non-factual, fluent, and similar to the factual completion. To construct knowledge creation task, Yu et al. (2023a) build an annotation platform to facilitate fine-grained event annotations.
2 Evaluation Metrics
The free-form and open-ended nature of language generation makes it difficult to evaluate the hallucinations produced by LLMs. The most commonly used and reliable methods for evaluating hallucinations rely on human experts following specific principles Lin et al. (2021); Lee et al. (2022); Min et al. (2023); Li et al. (2023a). It is worth noting that although existing benchmarks use human evaluation to ensure reliability, they also seek to support automatic methods to facilitate efficient and consistent evaluation.
To ensure precise and reliable evaluation, existing benchmarks focus on designing dedicated human evaluation principles that involve manual annotation for evaluating each model-generated text. TruthfulQA Lin et al. (2021) proposes a human-annotation guideline, which instructs annotators to assign one of thirteen qualitative labels to the model output and verify answers by consulting a reliable source. Lee et al. (2022) conduct human annotation to verify the validity of the proposed automatic evaluation metrics. FactScore (Min et al., 2023) requires annotators to assign three labels to each atomic fact: "Supported" or "Not-supported" for facts that are supported or unsupported by the knowledge source, and "Irrelevant" for statements that are not related to the prompt. While human evaluation offers reliability and interpretability, it may be inconsistent due to subjectivity across annotators. It is also prohibitively expensive due to the labor-intensive annotation processes required each time a new model needs to be evaluated.
Several studies Lin et al. (2021); Min et al. (2023); Zha et al. (2023); Mündler et al. (2023) have devised model-based methods as a proxy for human evaluation. Specifically, TruthfulQA Lin et al. (2021) trains a GPT-3-6.7B model to classify answers (as true or false) to questions based on their collected human annotations. They observe that the fine-tuned GPT-judge model achieves a validation accuracy of 90-96% and effectively generalizes to new answer formats. AlignScore Zha et al. (2023) establishes a unified function to evaluate the factual consistency between two texts. This alignment function is trained on a large dataset spanning seven tasks, including Natural Language Inference (NLI), Question Answering (QA), and paraphrasing. Differently, Min et al. (2023) and Mündler et al. (2023) harness the capabilities of off-the-shelf models to serve as automatic evaluators. In particular, FactScore Min et al. (2023) begins by employing a passage retriever, such as Generalizable T5-based Retrievers Ni et al. (2022), to gather pertinent information. Subsequently, an evaluation model, such as LLaMA-65B Touvron et al. (2023a), uses the retrieved knowledge to determine the truthfulness of a statement. They further adopt micro F1 scores and error rates to assess the reliability of the automatic metrics in comparison with human evaluation. Mündler et al. (2023) design dedicated prompts to query an evaluator LLM (e.g., ChatGPT OpenAI (2023a)) whether the subjective LLM contradicts itself under the same context, and report classification metrics, including precision, recall, and F1 score.
For discrimination benchmarks Li et al. (2023a); Muhlgay et al. (2023), common rule-based classification metrics such as accuracy can be directly applied to evaluating the ability of LLMs to discriminate factual statements from non-factual ones. Bang et al. (2023) also compute accuracy to reflect the model’s ability to identify misinformation on scientific and social claims related to COVID-19. In contrast, another line of research Lee et al. (2022); Yu et al. (2023a) focuses on devising heuristic methods specifically designed for assessing hallucination. FactualityPrompt Lee et al. (2022) combines named-entity-based metric and textual entailment-based metric to capture different aspects of factuality. To evaluate knowledge creation, Yu et al. (2023a) devise a self-contrast metric to quantify model consistency in generating factual statements. They accomplish this by comparing model-generated texts with and without including golden knowledge as part of the prompts based on Rouge-L (F1) (Lin, 2004).
Sources of LLM Hallucination
In this section, we aim to explore the various factors that can induce hallucinations within LLMs. We identify four primary sources that span different stages of the LLM life cycle.
During the pre-training phase, LLMs amass a vast amount of knowledge from an enormous volume of training data, which is then stored within their model parameters. When asked to answer questions or complete tasks, LLMs often exhibit hallucinations if they lack pertinent knowledge or have internalized false knowledge from the training corpora.
Li et al. (2022c) discover that LLMs sometimes misinterpret spurious correlations, such as positionally close or highly co-occurring associations, as factual knowledge. Specifically, McKenna et al. (2023) investigate the hallucination problem within the context of the natural language inference (NLI) task and find a strong correlation between LLM hallucination and the distribution of the training data. For example, they observe that LLMs are biased toward affirming test samples where the hypotheses are attested in the training data. Besides, Dziri et al. (2022) argue that hallucination is also present in human-generated corpora (can be reflected as outdated (Liska et al., 2022; Luu et al., 2022), biased (Chang et al., 2019; Garrido-Muñoz et al., 2021), or fabricated (Penedo et al., 2023) expression). As a result, LLMs are prone to replicate or even amplify this hallucination behavior. Wu et al. (2023b) reveal that the memorizing and reasoning performance of PLMs for ontological knowledge is less than perfect. Sun et al. (2023a) put forward a benchmark named Head-to-Tail to evaluate the factual knowledge of LLMs for entities with different levels of popularity. Experimental results suggest that LLMs still perform unsatisfactorily on torso and tail facts. Furthermore, Zheng et al. (2023c) identified two additional abilities associated with knowledge memorization that enable LLMs to provide truthful answers: knowledge recall and knowledge reasoning. Deficiencies in either of these abilities can lead to hallucinations.
Some studies have been conducted with the aim of understanding whether language models can assess the accuracy of their responses and recognize their knowledge boundaries. Kadavath et al. (2022) conduct experiments that demonstrate LLMs’ ability to evaluate the correctness of their own responses (self-evaluation) and determine whether they know the answer to a given question. However, for very large LLMs, the distribution entropy of correct and incorrect answers could be similar, suggesting that LLMs are equally confident when generating incorrect answers as they are generating correct ones. Yin et al. (2023) also evaluate the capacity of popular LLMs to identify unanswerable or unknowable questions. Their empirical study reveals that even the most advanced LLM, GPT4 (OpenAI, 2023b), shows a significant performance gap when compared to humans. Ren et al. (2023) note a correlation between accuracy and confidence, but such confidence often surpasses the actual capabilities of LLMs, namely over-confidence. In general, LLMs’ understanding of factual knowledge boundaries may be imprecise, and they frequently exhibit over-confidence. Such over-confidence misleads LLMs to fabricate answers with unwarranted certainty.
LLMs typically undergo an alignment process following pre-training, where they receive further training on curated instruction-following examples to align their responses with human preferences. However, when trained on instructions for which LLMs have not acquired prerequisite knowledge from the pre-training phase, this is actually a misalignment process that encourages LLMs to hallucinate (Goldberg, 2023; Schulman, 2023). Another potential issue is sycophancy, where LLMs may generate responses that favor the user’s perspective rather than providing correct or truthful answers, which can result in hallucination Perez et al. (2022); Radhakrishnan et al. (2023); Wei et al. (2023b).
Today’s most advanced LLMs generate responses sequentially, outputting one token at a time. Zhang et al. (2023a) discover that LLMs sometimes over-commit to their early mistakes, even when they recognize they are incorrect. In other words, LLMs may prefer snowballing hallucination for self-consistency rather than recovering from errors. This phenomenon is known as hallucination snowballing. Azaria and Mitchell (2023) also contend that local optimization (token prediction) does not necessarily ensure global optimization (sequence prediction), and early local predictions may lead LLMs into situations where it becomes challenging to formulate a correct response. Lee et al. (2022) highlight that the randomness introduced by sampling-based generation strategies, such as top- and top-, can also be a potential source of hallucination.
Mitigation of LLM Hallucination
In this section, we provide an extensive review of recent studies focused on mitigating LLM hallucinations. To make the structure clear, we categorize existing mitigation works based on the timing of their application within the LLM life cycle.
Existing work (Zhou et al., 2023a) argues that the knowledge of LLMs is mostly acquired during the pre-training phase. The presence of noisy data such as misinformation in the pre-training corpus could corrupt the parametric knowledge of LLMs, which is a significant factor contributing to hallucinations, as previously discussed in 4. Akyürek et al. (2022) also demonstrate that it is possible to trace the factual knowledge acquired by language models back to their training data. Consequently, an intuitive approach to mitigating hallucinations could involve manually or automatically curating the pre-training corpus to minimize unverifiable or unreliable data as much as possible.
Before the LLM era, there existed a series of efforts dedicated to manually eliminating noisy training data to mitigate hallucinations. For instance, Gardent et al. (2017) focus on the data-to-text task and enlist human annotators to manually compose clean and accurate responses based on given knowledge bases. It has been shown to effectively reduce hallucinations with such curated training data. Similarly, Wang (2019) manually refine the text in existing table-to-text datasets and observe that this process also substantially alleviates fact hallucinations. Besides, Parikh et al. (2020) instruct annotators to revise verified sentences from Wikipedia rather than directly creating new sentences when constructing table-to-text training data. This approach has also been proven to result in improved factuality of results.
With the advent of the LLM era, curating training data during pre-training has become increasingly challenging due to the vast scale of pre-training corpora (as exemplified in Table 6). For instance, Llama 2 (Touvron et al., 2023b) conducts pre-training on about two trillion tokens. Therefore, compared to manual curation, a more practical approach today could be automatically selecting reliable data or filtering out noisy data. For example, the pre-training data of GPT-3 (Brown et al., 2020) is cleaned by using similarity to a range of high-quality reference corpora. The developers of Falcon (Penedo et al., 2023) carefully extract high-quality data from the web via heuristic rules and prove that properly curated pertaining corpora lead to powerful LLMs. Li et al. (2023f) propose phi-1.5, a 1.3 billion parameter LLMs pre-trained on filtered “textbook-like” synthetic data, which exhibits many traits of much larger LLMs. In order to mitigate hallucinations, current LLMs tend to collect pre-training data from credible text sources. The developers of Llama 2 (Touvron et al., 2023b) strategically up-sample data from highly factual sources, such as Wikipedia, when constructing the pre-training corpus. Lee et al. (2022) propose to prepend the topic prefix to sentences in the factual documents to make each sentence serve as a standalone fact during pre-training. Concretely, they treat the document name as the topic prefix and observe this method improves LMs’ performance on TruthfulQA.
The mitigation of hallucinations during pre-training is primarily centred around the curation of pre-training corpora. Given the vast scale of existing pre-training corpora, current studies predominantly employ simple heuristic rules for data selection and filtering. A potential avenue for exploration could be devising more effective selection or filtering strategies.
2 Mitigation during SFT
As a common practice, current LLMs collectively undergo the process known as supervised fine-tuning (SFT) to elicit their knowledge acquired from pre-training and learn how to interact with users (Wang et al., 2023c; Zhang et al., 2023b). SFT generally involves first annotating or collecting massive-task instruction-following data (Chung et al., 2022; Taori et al., 2023), followed by fine-tuning pre-trained foundational LLMs on this data using maximum likelihood estimation (MLE) (Wei et al., 2021). By employing well-designed SFT strategies, many recent studies claim to have built LLMs that achieve performance on par with ChatGPT (Wang et al., 2023b).
Similar to pre-training, one potential approach to reduce hallucination during the SFT stage could be curating the training data. Given the relatively small volume of SFT data (refer to Table 7), both manual and automatic curation are viable options here. Zhou et al. (2023a) have meticulously constructed an instruction-tuning dataset, comprising 1,000 samples annotated by human experts. Some other studies (Chen et al., 2023b; Cao et al., 2023; Lee et al., 2023) have employed an automatic selection of high-quality instruction-tuning data, by leveraging LLMs as evaluators or designing specific rules. Experimental results on hallucination-related benchmarks, such as TruthfulQA (Lin et al., 2021), suggest that LLMs fine-tuned on such curated instruction data demonstrate higher levels of truthfulness and factuality compared to LLMs fine-tuned on uncurated data. Furthermore, Mohamed et al. (2023) propose the integration of domain-specific knowledge sets into the SFT data, which aims to reduce hallucinations that arise from a lack of relevant knowledge.
It is worth noting that Schulman (2023) underscored a potential risk of the SFT process that it could induce hallucination from LLMs due to behavior cloning. Behavior cloning is a concept in reinforcement learning (Torabi et al., 2018), which means the model learns directly from imitating the expert’s actions. The problem here is that this method simply mimics behavior without learning a strategy to achieve the final goal. The SFT process of LLMs can be viewed as a special case of behavior cloning, where LLMs learn the format and style of interaction by mimicking humans. As for LLMs, despite having encoded a substantial amount of knowledge into their parameters, there remains knowledge that surpasses their capacity (Yin et al., 2023; Ren et al., 2023). By cloning human behaviors during SFT, LLMs learn to respond to all questions with a predominantly positive tone, without assessing whether these questions exceed their knowledge boundaries (see Figure 3). As a result, during inference, if prompted to answer questions related to unlearned knowledge, they are likely to confidently produce hallucinations. One way to remit this problem can be the honesty-oriented SFT, which means introducing some honest samples into the SFT data. The honest samples refer to responses that admit incompetence, such as “Sorry, I don’t know”. The Moss project (Sun et al., 2023b) open-sourced their SFT data, which includes such honest samples. We observed that models tuned with them could learn to refuse to answer specific questions, therefore helping reduce hallucinations.
Curating the training data is one approach for mitigating hallucinations during the SFT phase. Thanks to the acceptable volume of SFT data, they can be manually curated by human experts. Recently, we have performed a preliminary human inspection and observed that some widely-used synthetic SFT data, such as Alpaca (Taori et al., 2023), contains a considerable amount of hallucinated answers due to the lack of human inspection. This calls for careful attention when researchers try to build SFT datasets based on self-instruct (Wang et al., 2023c).
Previous work also pointed out that the SFT process may inadvertently introduce hallucinations, by forcing LLMs to answer questions that surpass their knowledge boundaries. Some researchers have suggested honesty-oriented SFT as a solution. However, we argue this method has two main problems. Firstly, it exhibits limited generalization capabilities towards out-of-distribution (OOD) cases. Secondly, the annotated honest samples just reflect the incompetence and uncertainty of annotators rather than those of LLMs, as annotators are unaware of LLMs’ real knowledge boundaries. Such challenges make solving this issue during SFT sub-optimal.
3 Mitigation during RLHF
Nowadays, many researchers attempt to further improve the supervised fine-tuned LLMs via reinforcement learning from human feedback (RLHF) (Fernandes et al., 2023). This process consists of two steps: 1) train a reward model (RW) as the proxy for human preference, which aims to assign an appropriate reward value to each LLM response; 2) optimize the SFT model with the reward model’s feedback, by using RL algorithms such as PPO (Schulman et al., 2017).
Leveraging human feedback not only closes the gap between machine-generated content and human preference but also helps LLMs align with desired criteria or goals. One commonly used criterion today is “3H”, which denotes helpful, honest, and harmless (Ouyang et al., 2022; Bai et al., 2022; Zheng et al., 2023b). The honest aspect here just refers to the minimization of hallucinations in LLM responses. Current advanced LLMs, such as InstructGPT (Ouyang et al., 2022), ChatGPT (OpenAI, 2023a), GPT4 (OpenAI, 2023b), and Llama2-Chat (Touvron et al., 2023b), have collectively considered this aspect during RLHF. For example, GPT4 uses synthetic hallucination data to train the reward model and perform RL, which increases accuracy on TruthfulQA (Lin et al., 2021) from about 30% to 60%. Moreover, Lightman et al. (2023) use the process supervision to detect and mitigate hallucinations for reasoning tasks, which provides feedback for each intermediate reasoning step.
As discussed in the previous section, the phenomenon of behavior cloning during the SFT stage can potentially lead to hallucinations. Some researchers have attempted to address this issue by integrating honest samples into the original SFT data. However, this approach has certain limitations, such as unsatisfactory OOD generalization capabilities and a misalignment between human and LLM knowledge boundaries. In light of this, Schulman (2023) propose to solve this problem during RLHF. They design a special reward function just for mitigating hallucinations, as shown in Table 8. “Unhedged/Hedged Correct/Wrong” here means the LLM provides correct or wrong answers with a positive or hesitant tone. “Uninformative” denote the safe answers like “I don’t know”. The core idea is to encourage LLMs to challenge the premise, express uncertainty, and commit incapability by learning from specially designed rewards. This method, which we refer to as honesty-oriented RL, offers several advantages over honesty-oriented SFT. The primary benefit is that it allows LLMs to freely explore their knowledge boundaries, thereby enhancing their generalization capabilities to OOD cases. Additionally, it reduces the need for extensive human annotation and eliminates the requirement for annotators to guess the knowledge boundaries of LLMs.
Reinforcement learning can guide LLMs in exploring their knowledge boundaries, enabling them to decline to answer questions beyond their capacity rather than fabricating untruthful responses. However, we note this approach also poses unique challenges. For instance, RL-tuned LLMs may exhibit over-conservatism due to an imbalanced trade-off between helpfulness and honesty (Ouyang et al., 2022). An example of this is illustrated in Table 9. As observed in this case, ChatGPT tends to be overly hedged and refrains from providing a clear answer that it already knows, as evidenced in another dialogue turn. This could be attributed to the unreasonable design of the reward function or the poor quality of the training data for the reward model. We hope future work can take such problems into consideration.
4 Mitigation during Inference
Compared with the aforementioned training-time mitigation approaches, mitigating hallucinations in the inference time could be more cost-effective and controllable. Therefore, most existing studies focus on this direction, which we will introduce in detail in the following sections.
Decoding strategies, such as greedy decoding and beam search decoding, determine how we choose output tokens from the probability distribution generated by models (Zarrieß et al., 2021).
Lee et al. (2022) carry out a factuality assessment of content generated by LLMs using different decoding strategies. They find that nucleus sampling (a.k.a top-p sampling) (Holtzman et al., 2019) falls short of greedy decoding in terms of factuality. They argue that this underperformance could be attributed to the randomness introduced by top-p sampling to boost diversity, which may inadvertently lead to hallucinations since LLMs tend to fabricate information to generate diverse responses. In view of this, they introduce a decoding algorithm termed factual-nucleus sampling, which aims to strike a more effective balance between diversity and factuality by leveraging the strengths of both top-p and greedy decoding.
Dhuliawala et al. (2023) develop a decoding framework known as the Chain-of-Verification (CoVe). This framework is based on the observation that independent verification questions typically yield more accurate facts than those presented in long-form answers. The CoVe framework initially plans verification questions, and then answers these questions to ultimately produce an enhanced, revised response. Experimental results on list-based questions, closed book QA, and long-form text generation demonstrate that CoVe can effectively mitigate hallucination.
Another work, Li et al. (2023b), introduces a novel Inference-Time Intervention (ITI) method to improve the truthfulness of LLMs. This method is based on the assumption that LLMs possess latent, interpretable sub-structures associated with factuality. The ITI method comprises two steps: 1) fitting a binary classifier on top of each attention head of the LLM to identify a set of heads that exhibit superior linear probing accuracy for answering factual questions, and 2) shifting model activations along these factuality-related directions during inference. The ITI method leads to a substantial performance improvement on the TruthfulQA benchmark (Lin et al., 2021).
Distinct from the aforementioned studies, Shi et al. (2023b) instead concentrates on the retrieval-augmentation setting. Prior research has shown that LLMs sometimes fail to adequately attend to retrieved knowledge when addressing downstream tasks, particularly when the retrieved knowledge conflicts with the parametric knowledge of LLMs (Zhou et al., 2023b; Xie et al., 2023). To address this issue, Shi et al. (2023b) propose a straightforward context-aware decoding (CAD) strategy. The core idea of CAD is to perform a contrastive ensemble of and , where represents the LM, is the input query, is the context, is the response, and is the time step. means the generation probability distribution of -th token when given the context while denotes the distribution only considering the query. The CAD method aims to compel LLMs to pay more attention to contextual information instead of over-relying their own parametric knowledge to make decisions. Experimental results show that CAD effectively elicits the ability of LLMs to exploit retrieved knowledge and thus reduces factual hallucinations on downstream tasks. Another work, DoLA (Chuang et al., 2023), also employ the idea of contrastive decoding to reduce hallucination. However, they contrast the generation probabilities from different layers of LLMs, as they find that linguistic and factual information is encoded in distinct sets of layers.
Designing decoding strategies to mitigate hallucinations in LLMs during inference is typically in a plug-and-play manner. Therefore, this method is easy to deploy, making it promising for practical applications. However, for this approach, most existing works require accessing the token-level output probabilities, while a substantial number of current LLMs can only return generated content through limited APIs (e.g., ChatGPT). Consequently, we encourage future research in this direction to explore within a more strict black-box setting.
4.2 Resorting to External Knowledge
Using external knowledge as supplementary evidence to assist LLMs in providing truthful responses recently represents a burgeoning solution (Ren et al., 2023; Mialon et al., 2023). This approach typically consists of two steps. The first step entails accurately obtaining knowledge related to the user instructions. Once useful knowledge has been achieved, the second step involves leveraging such knowledge to guide the generation of the responses. We provide a comprehensive review of the latest progress in this direction, focusing on the specific strategies employed in these two steps, respectively. We also present a summary of recent studies in Table 4.
LLMs have internalized vast amounts of knowledge into their parameters through extensive pre-training and fine-tuning, which can be referred to as parametric knowledge (Roberts et al., 2020). However, incorrect or outdated parametric knowledge can easily lead to hallucinations (Xie et al., 2023). To remedy this, researchers have proposed acquiring reliable, up-to-date knowledge from credible sources as a form of hot patching for LLMs (Lewis et al., 2020b; Li et al., 2022a). We summarize the two primary sources of such knowledge as follows.
External knowledge bases. The majority of existing works retrieve information from external knowledge bases, such as large-scale unstructured corpora (Cai et al., 2021; Borgeaud et al., 2022), structured databases (Liu, 2022; Li et al., 2023d), specific websites like Wikipedia (Yao et al., 2022; Peng et al., 2023a; Li et al., 2023c; Yu et al., 2023b), or even the entire Internet (Lazaridou et al., 2022; Yao et al., 2022; Gao et al., 2023a; Liu et al., 2023c). The evidence retrieval process typically employs various sparse (e.g., BM25 (Robertson et al., 2009)) or dense (e.g., PLM-based methods (Zhao et al., 2022)) retrievers. Search engines, such as Google Search, can also be viewed as a special kind of information retriever (Nakano et al., 2021; Lazaridou et al., 2022; Yao et al., 2022; Gao et al., 2023a). Besides, Luo et al. (2023c) propose the parameter knowledge guiding framework which retrieves knowledge from the parametric memory of fine-tuned white-box LLMs. Feng et al. (2023) try to teach LLMs to search relevant domain knowledge from external knowledge graphs to answer domain-specific questions.
External tools. In addition to solely retrieving information from knowledge bases, there are also many other tools that can provide valuable evidence to enhance the factuality of content generated by LLMs (Mialon et al., 2023; Qin et al., 2023; Qiao et al., 2023). For instance, FacTool (Chern et al., 2023) employs different tools to help detect hallucinations in LLMs for specific downstream tasks, such as search engine API for Knowledge-based QA, code executor for code generation, and Google Scholar API for scientific literature review. CRITIC (Gou et al., 2023) also enables LLMs to interact with multiple tools and revise their responses autonomously, which has been proven to effectively improve truthfulness.
Once relevant knowledge is obtained, it could be employed at different stages to mitigate hallucinations within LLMs. Existing methods for knowledge utilization can be roughly divided into two categories, as detailed below and illustrated in Figure 4.
Generation-time supplement. The most straightforward approach to utilize retrieved knowledge or tool feedback is to directly concatenate them with user queries before prompting LLMs (Shi et al., 2023c; Mallen et al., 2023; Ram et al., 2023). This method is both effective and easy to implement. Such knowledge is also referred to as context knowledge (Shi et al., 2023b). Existing studies have demonstrated that LLMs possess a strong capability for in-context learning (Dong et al., 2022), which enables them to extract and utilize valuable information from context knowledge to rectify nonfactual claims they previously generated.
Post-hoc correction. Another common practice involves constructing an auxiliary fixer to rectify hallucinations during the post-processing stage (Cao et al., 2020; Zhu et al., 2021; Fabbri et al., 2022). The fixer can be either another LLM (Peng et al., 2023a; Zhang et al., 2023d; Chern et al., 2023; Gou et al., 2023) or a specific small model (Chen et al., 2023a). Such fixers first interact with external knowledge sources to gather sufficient evidence, and then correct hallucinations. For example, RARR (Gao et al., 2023a) directly prompts an LLM to ask questions about the content that needs to be corrected from multiple perspectives. Then it uses search engines to retrieve relevant knowledge. The LLM-based fixer finally makes corrections based on retrieved evidence. The Verify-then-Edit approach (Zhao et al., 2023a) aims to enhance the factuality of predictions by post-editing reasoning chains based on external knowledge sourced from Wikipedia. To achieve better performance, LLM-Augmenter (Peng et al., 2023a) prompts LLMs to summarize retrieved knowledge before feeding it into the fixer. Moreover, FacTool (Chern et al., 2023) and CRITIC (Gou et al., 2023) propose to utilize various external tools to obtain evidence for the fixer.
Resorting to external knowledge to mitigate hallucinations in LLMs offers several advantages. Firstly, this method circumvents the need for modifying LLMs, making it a plug-and-play and efficient solution. Secondly, it facilitates the easy transfer of proprietary knowledge (e.g., a company’s internal data) and real-time updated information to LLMs. Lastly, this approach enhances the interpretability of information generated by LLMs by allowing the tracing of generation results back to the source evidence (Gao et al., 2023b; Yue et al., 2023). However, this direction also presents some remaining challenges. We discuss some of them below.
Knowledge verification. In the era of LLMs, the external knowledge source could extend beyond a single document corpus or a specific website to encompass the entire Internet. However, the information from the Internet is in the wild, which means they may also be fabricated, or even generated by LLMs themselves (Alemohammad et al., 2023). How to verify the authenticity of retrieved knowledge from the Internet is an open and challenging problem to be solved.
Performance/efficiency of retriever/fixer. The performance of the retriever/fixer plays a vital role in ensuring the effects of hallucination mitigation. Future work may consider jointly optimising the whole working flow (retrieverLLMfixer) via reinforcement learning (Qiao et al., 2023) or other techniques. Besides, the efficiency of the retriever/fixer is another important factor to be considered, as the generation speed of existing LLMs is already a significant burden (Ning et al., 2023).
Knowledge conflict. As introduced before, the retrieved knowledge may conflict with the parametric knowledge stored by LLMs (Qian et al., 2023). Shi et al. (2023b) reveal that LLMs may fail to sufficiently exploit retrieved knowledge when knowledge conflict happens. Xie et al. (2023) take a more cautious look at this phenomenon. How to fully utilize context knowledge is an under-explored question. For example, Liu et al. (2023d) find the performance of retrieval-augmented LLMs significantly degrades when they must access evidence in the middle of long contexts.
4.3 Exploiting Uncertainty
Uncertainty serves as a valuable indicator for detecting and mitigating hallucinations during the inference process (Manakul et al., 2023). Typically, it refers to the confidence level of model outputs (Jiang et al., 2021; Huang et al., 2023a; Duan et al., 2023). Uncertainty can assist users in determining when to trust LLMs. Provided that the uncertainty of LLM responses can be accurately characterized, users can filter out or rectify LLMs’ claims with high uncertainty since such claims are more prone to be fabricated ones (Lin et al., 2023).
Generally speaking, methods for estimating the uncertainty of LLMs can be categorized into three types (Xiong et al., 2023), as listed below. To facilitate understanding, we also present illustrative examples for these methods in Figure 5.
Logit-based estimation. The first method is the logit-based method, which requires access to the model logits and typically measures uncertainty by calculating token-level probability or entropy. This method has been widely used in the machine learning community (Guo et al., 2017).
Verbalize-based estimation. The second is the verbalize-based method, which involves directly requesting LLMs to express their uncertainty, such as using the following prompt: “Please answer and provide your confidence score (from 0 to 100).” This method is effective due to the impressive verbal and instruction-following capabilities of LLMs. Notably, Xiong et al. (2023) further suggest using chain-of-thoughts prompts (Wei et al., 2022) to enhance this method.
Consistency-based estimation. The third is the consistency-based method (Wang et al., 2022; Shi et al., 2022; Zhao et al., 2023a). This method operates on the assumption that LLMs are likely to provide logically inconsistent responses for the same question when they are indecisive and hallucinating facts.
Several recent studies have leveraged uncertainty estimation for detecting and mitigating hallucinations in LLMs. SelfCheckGPT Manakul et al. (2023) is the first framework to detect LLM hallucinations based on uncertainty measurement in a zero-resource and black-box setting. They employ a consistency-based approach for uncertainty estimation. A non-trivial challenge in SelfCheckGPT is determining how to measure the consistency of different responses. Manakul et al. (2023) perform experiments with BERTScore (Zhang et al., 2019), QA-based metrics (Wu and Xiong, 2023) and n-gram metrics. They finally find that a combination of these approaches yields the best results. Mündler et al. (2023) directly utilize an additional LLM to assess whether two LLM responses are logically contradictory given the same context (Luo et al., 2023b), which means at least one of them is hallucinated. Consequently, they employ another LLM to revise such self-contradictory hallucinations from two responses. Agrawal et al. (2023) further adopt the verbalize-based method to evaluate the hallucination rate of LLMs for fabricating references. Varshney et al. (2023), on the other hand, use the logit-based method to detect false concepts in LLMs’ responses with high uncertainty. They then fix such content with auxiliary retrieval-augmented LLMs.
Besides, Zhao et al. (2023b) present a Pareto optimal self-supervision framework. This framework utilizes available programmatic supervision to assign a risk score to LLM responses, which can serve as an indicator of hallucinations. Luo et al. (2023a) introduce a pre-detection self-evaluation technique, which aims to evaluate the familiarity of LLMs with the concepts in user prompts and prevent the generation of content about those unfamiliar concepts.
Exploiting uncertainty to identify and mitigate LLM hallucinations is a promising research direction today. Three primary approaches exist for estimating the uncertainty of LLMs, each presenting its unique challenges. Firstly, the logit-based method is becoming less applicable for modern commercial LLMs as they are usually closed-source and black-box, rendering their output logits inaccessible. Secondly, regarding the verbalize-based method, researchers have observed that LLMs tend to display a high degree of overconfidence when expressing their confidence (Xiong et al., 2023). Thirdly, the effective measurement of the consistency of different responses remains an unresolved issue in the consistency-based method (Manakul et al., 2023). We believe that leveraging uncertainty is crucial in developing trustworthy LLMs and encourage future research to address the aforementioned challenges in this field.
5 Other Methods
In addition to the above approaches, other techniques demonstrating the potential for reducing hallucinations are shown below.
Some recent research has sought to address the hallucination problem in LLMs from a multi-agent perspective, wherein multiple LLMs (also known as agents) independently propose and collaboratively debate their responses to reach a single consensus, as exemplified in Figure 6. Du et al. (2023) is a pioneering work in this line. They initially developed a benchmark for assessing the factual accuracy of prominent computer scientist biographies generated by LMs. Their findings reveal that an individual LLM can easily generate hallucinated information within this benchmark; however, such hallucinations can be mitigated by engaging multiple LLMs in a debate to achieve consensus. Besides, Cohen et al. (2023) ask one LLM to generate claims (acting as Examinee) and another to raise questions about these claims and check the truthfulness of them (acting as Examiner). Wang et al. (2023d) instead propose prompting a single LLM to identify, simulate, and iteratively self-collaborate with multiple personas, such as Harry Potter Fan and Jay Chou Fan. By leveraging an LLM as a cognitive synergist, it effectively reduces hallucinations with relatively low costs.
Existing research highlights that the behavior of LLMs can significantly vary based on the prompts given by users (Si et al., 2022; Zhu et al., 2023). In terms of hallucination, users may encounter an LLM that initially responds accurately but begins to hallucinate information when using different prompts. In light of this observation, Zhang et al. (2023a) endeavour to engineer more effective prompts to mitigate hallucination. Concretely, they employ the chain-of-thought prompt (Wei et al., 2022) to compel LLMs to generate reasoning steps before providing the final answers. However, chain-of-thought may introduce some new challenges. The potential of hallucinated reasoning steps is one of them. Furthermore, a popular practice nowadays involves explicitly instructing LLMs not to disseminate false or unverifiable information when designing the “system prompt”, i.e., the special messages used to steer the behavior of LLMs. The following system prompt used for Llama 2-Chat Touvron et al. (2023b) exemplifies this approach: If you don’t know the answer to a question, please don’t share false information.
Azaria and Mitchell (2023) contend that LLMs may be aware of their own falsehoods, implying that their internal states could be utilized to detect hallucinations. They propose Statement Accuracy Prediction based on Language Model Activations (SAPLMA), which adds a classifier on top of each hidden layer of the LLM to determine truthfulness. Experimental results indicate that LLMs might “know” when the statements they generate are false, and SAPLMA can effectively extract such information. The Inference-Time Intervention (ITI) method (Li et al., 2023b) is also grounded in a similar hypothesis. They further shift model activations alongside factuality-related heads during inference and discover that this can mitigate hallucinations. These studies suggest that “the hallucination within LLMs may be more a result of generation techniques than the underlying representation” (Agrawal et al., 2023).
Zhang et al. (2023c) posit that a potential cause of hallucination in LLMs could be the misalignment between knowledge and user questions, a phenomenon that is particularly prevalent in the context of retrieval-augmented generation (RAG). To address this issue, they introduce MixAlign, a human-in-the-loop framework that utilizes LLMs to align user queries with stored knowledge, and further encourages users to clarify this alignment. By refining user queries iteratively, MixAlign not only reduces hallucinations but also enhances the quality of the generated content.
Several studies have explored modifying the architecture of LMs to mitigate hallucinations. Examples include the multi-branch decoder (Rebuffel et al., 2022) and the uncertainty-aware decoder (Xiao and Wang, 2021). Li et al. (2023g) suggest employing a bidirectional autoregressive architecture in the construction of LLMs, which enables language modeling from both left-to-right and right-to-left. They claim that this design strategy could contribute to the reduction of hallucinations by effectively leveraging bidirectional information.
Outlooks
In this section, we discuss a few unresolved challenges in the investigation of hallucinations within LLMs and offer our insights into potential future research directions.
Although considerable effort has been dedicated to building evaluation benchmarks for quantitatively assessing hallucination in LLMs, there are still issues that need to be solved. The automatic evaluation in the generation-style hallucination benchmark cannot accurately reflect the performance or align with human annotation. Such inaccuracy is reflected in two ways: (1) The automatic metric does not perfectly align with human annotations Lin et al. (2021); Min et al. (2023); Muhlgay et al. (2023); (2) The reliability of automatic metric varies across texts from different domains or generated by different LLMs Min et al. (2023), resulting in reduced robustness for generalization. Although the discrimination-style benchmark Li et al. (2023a); Muhlgay et al. (2023) could relatively accurately evaluate a model’s ability to distinguish hallucinations, the relationship between discrimination performance and generation performance is still unclear until now. These issues all need more in-depth exploration.
Existing work in LLM hallucination primarily focuses on English, despite the existence of thousands of languages in the world. We hope that LLMs can possess the ability to handle various languages uniformly. Some previous studies have investigated the performance of LLMs on some multi-lingual benchmarks (Ahuja et al., 2023; Lai et al., 2023), and collectively found that their performance degenerates when generalizing to non-Latin languages. In terms of the hallucination problem, Guerreiro et al. (2023a) observe that multi-lingual LLMs predominantly struggle with hallucinations in low-resource languages in the translation task. Potential follow-up work could include systematically measuring and analyzing LLM hallucinations across a wide variety of languages. As shown in Table 11, we find that LLMs such as ChatGPT provide accurate answers in English but expose hallucinations in other languages, leading to multilingual inconsistencies. The transfer of knowledge within LLMs from high-resource languages to low-resource ones also presents an interesting and promising research direction.
In an effort to improve the performance of complex multi-modal tasks, recent studies have proposed replacing the text encoder of existing vision-large models with LLMs, resulting in large vision-language models (LVLMs) (Liu et al., 2023b; Ye et al., 2023). Despite their success, some research reveals that LVLMs inherit the hallucination problem from LLMs and exhibit more severe multi-modal hallucinations compared to smaller models. For instance, Li et al. (2023e) discuss the object hallucination of LVLMs, wherein LVLMs generate content containing objects that are inconsistent with or absent from the input image, such as the example in Figure 7. To effectively measure object hallucinations generated by LVLMs, Liu et al. (2023a) propose a GPT4-Assisted Visual Instruction Evaluation (GAVIE) benchmark. Gunjal et al. (2023) introduce a multi-modal hallucination detection dataset named M-HalDetect, further study the unfaithful descriptions and inaccurate relationships beyond object hallucinations in LVLMs. Furthermore, in addition to images, some studies have extended LLMs to other modalities such as audio (Wu et al., 2023a; Su et al., 2023) and video (Maaz et al., 2023), making it interesting to investigate hallucination in these new scenarios.
As elaborated in 4, hallucinations in LLMs may primarily stem from the memorization of false information or the absence of correct factual knowledge. To mitigate these issues in LLMs with minimal computational overhead, the concept of model editing has been introduced (Sinitsin et al., 2020; De Cao et al., 2021). This approach involves modifying the behavior of models in a manner that is both data- and computation-efficient. At present, there are two mainstream paradigms for model editing. The first involves the incorporation of an auxiliary sub-network (Mitchell et al., 2022; Huang et al., 2023b), while the second entails direct modification of the original model parameters (Meng et al., 2022a, b). This technique may be instrumental in eliminating LLMs’ hallucinations by editing their stored factual knowledge in purpose (Lanham et al., 2023; Onoe et al., 2023). However, this emerging field still faces numerous challenges. These could include editing black-box LLMs (Murty et al., 2022), in-context model editing (Zheng et al., 2023a), and multi-hop model editing (Zhong et al., 2023), etc.
As previously discussed, significant efforts have been undertaken by both researchers and companies to guarantee that LLMs produce truthful responses, ultimately improving the overall user experience. Cutting-edge commercial LLMs, such as GPT4 (OpenAI, 2023b), appear to have acquired a decent ability to generate proper responses to factuality-related queries. However, they are not invincible. Several studies show that LLMs can be manipulated using techniques like meticulously crafted jailbreak prompts to elicit arbitrary desired responses (Wei et al., 2023a; Zou et al., 2023), including hallucinations. Consequently, the attacking and defending strategies for inducing hallucinations could also be a promising research direction. This is particularly important as the generation of fabricated information could potentially breach relevant laws, leading to the forced shutdown of LLM applications. This direction is also intimately tied to the robustness of existing hallucination mitigation methods.
Given that the current research on hallucinations in LLMs is still in its early stages, there are also many other intriguing and promising avenues for further investigation. For instance, researchers have begun to treat LLMs as agents for open-world planning in the pursuit of AGI (Park et al., 2023; Wang et al., 2023a). Addressing the hallucination problem within the context of LLMs-as-agents presents brand-new challenges and holds considerable practical value. Besides, analyzing and tracing LLM hallucinations from the linguistic aspect is another interesting research topic. Rawte et al. (2023) show that the occurrence of LLM hallucination is closely related to linguistic nuances of the user prompts, such as readability, formality, and concreteness. We believe all these directions merit thorough exploration in future research.
Conclusion
With their strong understanding and generation capabilities in the open domain, LLMs have garnered significant attention from both academic and industrial communities. However, hallucination remains a critical challenge that impedes the practical application of LLMs. In this survey, we offer a comprehensive review of the most recent advances, primarily post the release of ChatGPT, that aim to evaluate, trace, and eliminate hallucinations within LLMs. We also delve into the existing challenges and discuss potential future directions. We aspire for this survey to serve as a valuable resource for researchers intrigued by the mystery of LLM hallucinations, thereby fostering the practical application of LLMs.
Acknowledgments
We would like to thank Yu Wu and Yang Liu for their valuable suggestions.