Hallucinations in Large Multilingual Translation Models

Nuno M. Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo, André F. T. Martins

Introduction

Recent advancements in large-scale multilingual machine translation have brought us closer to realizing a universal translation system: a single model capable of handling numerous languages and translation directions (Aharoni et al., 2019; Arivazhagan et al., 2019; Fan et al., 2020; Zhang et al., 2020; Wenzek et al., 2021; Goyal et al., 2022; NLLB Team et al., 2022). Concurrently, general-purpose large language models (LLMs) have exhibited a remarkable ability to generalize to new tasks, including translation, where they are becoming increasingly stronger (Brown et al., 2020; Chowdhery et al., 2022; Hendy et al., 2023). Compared to traditional bilingual models, these systems can offer significant performance improvements and greatly simplify engineering efforts, as a single model can be used for all language pairs (Arivazhagan et al., 2019). As a result, they are an increasingly attractive choice for real-world applications. However, when deployed in the wild, these models may still generate hallucinations: highly pathological translations that can severely damage user trust and pose serious safety concerns (Perez et al., 2022; Kumar et al., 2022).

The problem of hallucinations has long been recognized by researchers (Ji et al., 2022), and recent studies have contributed towards better understanding, detection and mitigation of these pathological translations. However, these studies have been conducted on small bilingual models (<100M parameters) trained on a single English-centric high-resource language pair (Raunak et al., 2021; Ferrando et al., 2022; Guerreiro et al., 2022b, a; Dale et al., 2022; Xu et al., 2023). This leaves a knowledge gap regarding the prevalence and properties of hallucinations in large-scale translation models across different translation directions, domains and data conditions.

In this work, we aim to fill this gap by investigating hallucinations on two different classes of models. The first and main class in our analysis is the de facto standard approach of massively multilingual supervised models: we use the M2M-100 family of multilingual NMT models (Fan et al., 2020), which includes the largest open-source multilingual NMT model with 12B parameters. The second class is the novel and promising approach of leveraging generative LLMs for translation. Contrary to conventional NMT models, these models are trained on massive amounts of monolingual data in many languages, with a strong bias towards English, and do not require parallel data. In our analysis, we use ChatGPThttps://openai.com/blog/chatgpt; this system has not been documented, so details of training data and training regime are unknown., a LLM that has been shown to achieve surprisingly high translation quality over a wide range of language pairs (Hendy et al., 2023; Peng et al., 2023).

We organize our study by analyzing the two prevalent types of hallucinations in NMT considered in the literature: hallucinations under perturbation and natural hallucinations (Lee et al., 2018; Raunak et al., 2021; Guerreiro et al., 2022b). Firstly, we study hallucinations under perturbation and evaluate whether these translation systems are robust to source-side artificial perturbations. While previous studies have found that these perturbations (e.g., spelling errors and capitalization mistakes) can reliably induce hallucinations (Lee et al., 2018; Raunak et al., 2021), it is not clear whether those conclusions hold for large multilingual models. Secondly, we comprehensively investigate natural hallucinations, and evaluate their prevalence and properties in the outputs of the massively multilingual M2M models on a vast range of conditions, spanning from English-centric to non-English-centric language pairs, translation directions with little supervision, and specialized sensitive domains where hallucinations have devastating impact on user trust (e.g., medical data). Finally, we study a hybrid setup where other translation systems can be requested as fallback systems when an original system hallucinates, with the aim of mitigating hallucinations and improving overall translation quality.

Our analysis reveals several key insights on the prevalence and properties of hallucinations, including:

multilingual models predominantly struggle with hallucinations in low-resource language pairs and translating out of English, with hallucination rates well above 10% for some translation directions;

hallucinations in low-resource language pairs can manifest toxic patterns that can be traced back to the training data, posing serious safety issues;

smaller distilled models can mitigate hallucinations by incorporating modeling choices that discourage them, such as leveraging less potent shallow decoders that rely more on the encoder representations, and reducing bias towards higher-resource language pairs through uniform sampling of translation directions during distillation;

ChatGPT produces hallucinations that are qualitatively different from those of conventional translation models, mostly consisting of off-target translations, overgeneration, and even failed attempts to translate;

hallucinations are sticky and hard to reverse with models that share the same training data and architecture, whereas employing more diverse models as fallback systems can substantially improve overall translation quality and eliminate pathologies like oscillatory hallucinations.

We release all our code and make available over a million translations in more than 100 translation directions to spur future research.All resources will be made available in https://github.com/deep-spin/lmt_hallucinations.

Background

Massively multilingual neural machine translation has recently emerged as a powerful paradigm for building machine translation systems that can handle numerous languages (Akhbardeh et al., 2021; Wenzek et al., 2021; NLLB Team et al., 2022; Siddhant et al., 2022; Bapna et al., 2022; Chowdhery et al., 2022). These systems aim to translate directly with a single model for multiple language pairs without relying on any pivot language.

The dominant strategy for achieving these systems is to train large multilingual models on vast amounts of parallel data often obtained through a combination of data mining and data augmentation strategies, such as backtranslation (Sennrich et al., 2016; Edunov et al., 2018). Compared to classic bilingual models, the multilinguality of these systems results in significant improvements, particularly for low-resource and non-English-centric language pairs, as these benefit the most from multilingual transfer (Arivazhagan et al., 2019; Fan et al., 2020).

As an alternative, a novel and promising strategy is to leverage the emergent capabilities of large language models (LLMs). These systems are pretrained on massive nonparallel corpora and can be prompted to solve arbitrary tasks (Radford et al., 2019; Brown et al., 2020). In fact, this approach has led to impressive results across a wide variety of NLP tasks (Chowdhery et al., 2022; Zhang et al., 2022). Translation is no exception: LLMs can produce fluent and adequate translations, especially for high-resource English-centric language pairs, that are competitive with those of dedicated supervised translation models (Vilar et al., 2022; Peng et al., 2023; Garcia et al., 2023; Hendy et al., 2023; Bawden and Yvon, 2023).

2 Hallucinations in Machine Translation

Hallucinations lie at the extreme end of translation pathologies and present a critical challenge in machine translation, as they can severely compromise the safety and reliability of real-world applications.

Importantly, hallucinations in machine translation are unlike hallucinations in other natural language generation tasks (e.g., abstractive summarization and generative question answering) (Ji et al., 2022). While, for these other tasks, models often produce hallucinated outputs (Falke et al., 2019; Cao et al., 2022; Manakul et al., 2023), hallucinations in machine translation, possibly attributed to the more closed-ended nature of the task, are substantially rarer and hard to observe in clean, unperturbed data. This has led several previous studies to examine their properties by creating artificial scenarios where hallucinations are more likely to occur (e.g., introducing perturbations in the source text (Lee et al., 2018) or noise in the training data (Raunak et al., 2021)). To distinguish these two scenarios, hallucinations in machine translation are categorized into two types (Raunak et al., 2021): hallucinations under perturbation and natural hallucinations.

A model generates a hallucination under perturbation when it produces a significantly lower quality translation for a slightly perturbed input compared to the original input (Lee et al., 2018). Hallucinations under perturbation explicitly reveal the lack of robustness of translation systems to perturbations in the source text (e.g., misspellings or capitalization errors) by finding translations that undergo significant negative shifts in quality due to these changes.

Contrary to hallucinations under perturbations, these translations occur naturally without any perturbation. As a result, natural hallucinations are rare and challenging to study. In this work, we follow the taxonomy introduced in Raunak et al. (2021) and later extended in Guerreiro et al. (2022b). Under this taxonomy, hallucinations are translations that contain content that is detached from the source text. To distinguish between different types of hallucinations, they can be categorized as largely fluent detached hallucinations or oscillatory hallucinations. The former refers to translations that bear minimal or no relation at all to the source, while the latter refers to inadequate translations that contain erroneous repetitions of words and phrases.

Experimental Suite

In this section, we provide an overview of the models, datasets and evaluation metrics used throughout our study.

We focus on two classes of models: (i) conventional supervised multilingual NMT models, and (ii) LLMs that can be prompted for translation.

For the supervised multilingual NMT models, we use the transformer-based (Vaswani et al., 2017) M2M-100 family of models (Fan et al., 2020), which consists of three variants with different sizes: M2M (S) with 418M parameters, M2M (M) with 1.2B parameters, and M2M (L) — the largest available open-source multilingual NMT model — with 12B parameters. These models were trained on a many-to-many parallel dataset comprising 7.5B sentences crawled from the web, and support 100 languages and thousands of translation directions. We also experiment with SMaLL100 (Mohammadshahi et al., 2022), a shallow multilingual NMT model with 330M parameters obtained via distillation of M2M (L). Unlike the M2M models, SMaLL100 was trained on a much smaller training set with uniform sampling across all language pairs to reduce the bias towards high-resource languages: only 100k parallel sentences from the original M2M training data were used for each translation direction, for a total of 456M parallel sentences. For decoding, we run beam search with a beam size of 4. All experiments were run on fairseq (Ott et al., 2019).

As for the alternative strategy using LLMs, we use ChatGPT (gpt-3.5-turbo)https://platform.openai.com/docs/models/gpt-3-5; we used the API in March, 2023. , a variant of GPT3.5 — a GPT-family (Radford and Narasimhan, 2018; Radford et al., 2019; Brown et al., 2020) large-scale model with 175B parameters — that has been fine-tuned with human feedback in the style of InstructGPT (Ouyang et al., 2022). ChatGPT has been shown to achieve impressive results for multiple multilingual NLP tasks, including translation (Kocmi and Federmann, 2023; Lu et al., 2023; Fu et al., 2023; Hendy et al., 2023; Peng et al., 2023). To generate translations, we use the zero-shot prompt template used in Hendy et al. (2023) and keep the generation parameters as the default API parameters.We encountered several API/server errors when prompting ChatGPT for translation with temperature 0, particularly for low-resource language pairs and languages with lower coverage scripts. Those errors are alleviated, although not entirely eliminated, when the default parameters are used.

2 Datasets

We carefully selected datasets based on two main criteria: their familiarity to researchers and practitioners, and the avoidance of train/test overlap for the M2M models.ChatGPT’s training data is not publicly available. As such, we cannot guarantee that it has not been exposed to the data we use in our analysis. To this end, we chose to use premier translation benchmarks: Flores-101 (Goyal et al., 2022), WMT and TICO (Anastasopoulos et al., 2020). Flores-101 is a high-quality multi-parallel dataset that consists of Wikipedia text in 101 languages and allows for the assessment of hallucinations across a vast range of translation directions; we join the dev and devtest subsets for evaluation. For WMT, we used the same benchmarks as those used in the original M2M paper evaluation suite, as these were explicitly removed from the training data. Additionally, we selected recent WMT test sets from the WMT21 and WMT22 campaigns as they were released after the models were trained. In contrast to these general-purpose datasets, TICO is a specialized medical-domain multilingual benchmark that includes COVID-19 related data, such as medical papers and news articles; we join the dev and test sets. Full details about the datasets can be found in Appendix B.

3 Evaluation Metrics

Throughout our work, we focus mainly on sentence-level evaluation. Our main lexical metric is spBLEU (Goyal et al., 2022),We use spBLEU as implemented in Sacrebleu (Post, 2018): nrefs:1|case:mixed|eff:yes|tok:flores101| smooth:exp|version:2.3.1. as it has been widely employed in works on massively multilingual translation (Fan et al., 2020; Wenzek et al., 2021; Mohammadshahi et al., 2022; NLLB Team et al., 2022) and offers fairer evaluation for low-resource languages compared to BLEU (Papineni et al., 2002). Moreover, we follow the most recent MT metrics shared-task recommendations (Freitag et al., 2022) and also adopt neural metrics. We use the latest reference-based and reference-free COMET variants: COMET-22 (Rei et al., 2022a) and CometKiwi (Rei et al., 2022b). Lastly, we use the cross-lingual encoder LaBSE (Feng et al., 2022) to obtain sentence similarity scores, as these have been successfully employed in prior research on detection of natural hallucinations (Guerreiro et al., 2022a; Dale et al., 2022).

Hallucinations under Perturbation

We start our analysis by focusing on artificially created hallucinations. We first provide an overview of our experimental setting, focusing on the construction of the perturbed data and detection approach. Then, we present our results and analyze the properties of these hallucinations across different resource levels and models.

To construct the perturbed source sequences, we apply the same minimal perturbations used in Xu et al. (2023): misspeling of words, insertion of frequent tokens in the beginning of the source sequence, and capitalization errors. For full details on the construction of the perturbed data, refer to Appendix C.1.

We use the Flores dataset for these experiments, and focus specifically on translation out of English. We selected all bridge languagesIn the M2M paper (Fan et al., 2020), a bridge language is defined as a language that connects languages across language groupings (e.g., Romance, Slavic languages) and is mined against all other bridge languages. It is usually one of the most resourced languages within a language grouping and its purpose is to reduce the number of bitext pairs while preserving translation directions of practical interest., as well as additional low-resource languages that were underrepresented among bridge languages. Overall, we generate translations for 31 different language pairs (LPs). We present the language pairs and more details on our choice of languages in Appendix C.1.

2 Results

Our detection approach is inspired by that of previous works on hallucinations under perturbation (Lee et al., 2018; Raunak et al., 2021; Ferrando et al., 2022; Xu et al., 2023). The algorithm is a simple 2-rule process: we fix (i) a minimum threshold quality score for the original translations, and (ii) an extremely low maximum quality score for the perturbed translations. A model generates a hallucination under perturbation when both translations meet the thresholds. Crucially, rule (i) ensures that low-quality translations for unperturbed sources are not considered as candidates for hallucinations under perturbation.Note that low-quality translations for unperturbed sources fall under the scope of the study on natural hallucinations, that follows in subsequent sections of the paper.

We extend this algorithm to handle multiple models and language pairs by adapting rule (i). We first obtain source sentences for which all models produce translations that meet a minimum quality threshold (spBLEU > 9). Then, we sort them according to average quality across the different models, and select the top 20% as candidates. Finally, we apply rule (ii) and set the threshold to spBLEU < 3. We selected both thresholds based on the choices made in previous works (Raunak et al., 2021; Ferrando et al., 2022; Xu et al., 2023).

This approach ensures a fixed sample size across different language pairs, and that the sentences analyzed for each language pair are consistent across all models. Moreover, it allows us to effectively detect hallucinations under perturbation across multiple models in a multilingual scenario in a scalable manner, while accounting for the unique quality trends observed across different models and languages.Note that detection of hallucinations under perturbation does not explicitly target detachment from the source text. We provide a broader discussion on the difference between this detection approach and that of natural hallucinations (introduced later in Section 5.1) in Appendix C.2.

We show aggregated results in Table 4 and language-specific results in Figure 1. Overall, they reveal that perturbations have the potential to trigger hallucinations under perturbation, even in larger models. In what follows, we highlight several noteworthy trends found in our results.

Table 4 shows that all models, with the exception of ChatGPT that we analyze separately below, exhibit lower hallucination rates as resource levels increase. This is expected and suggests that models are better equipped to handle source-side perturbations for language pairs with more parallel data during training. In fact, hallucinations under perturbation for high-resource languages are almost non-existent. However, Figure 1 reveals variability across languages, and even within the models in the same family that have been trained on the same data. For instance, when translating to Asturian (ast), M2M (L) and its distilled version SMaLL100 have significantly higher hallucination rates than the smaller M2M (S). Thus, hallucinations under perturbation may emerge in other non-trivial ways unrelated to the training data.

Recall that SMaLL100 was trained using uniform sampling across all language pairs to prevent bias towards higher resourced language pairs. The results in Table 4 may reflect one positive outcome from such approach: despite being much smaller than M2M (L), SMaLL100 hallucinates less and for fewer languages than its teacher model for low- and mid-resource language pairs.

The common approach for detection of hallucinations under perturbation (see Section 4.1) raises an interesting question: are the original source sentences for which models produce higher quality translations less likely to lead to hallucinations when perturbed? Our analysis found a very weak correlation (according to Pearson correlation; see Appendix C.2) between hallucinations under perturbation and spBLEU scores for the original unperturbed sources across all models. This indicates that even minimal perturbations in the source text can cause models to undergo significant shifts in translation quality.

Natural Hallucinations

Table 4 shows that, contrary to traditional models, ChatGPT generates more hallucinations for mid-resource languages than for low-resource languages. In fact, it surprisingly produces fewer hallucinations for low-resource languages than any other model. Moreover, ChatGPT’s hallucinations are qualitatively different from those of other models: they often consist of off-target translations,We perform automatic language identification using the fasttext (Joulin et al., 2016) LID model lid.176.bin. overgeneration, or even failed attempts to translate (e.g., “This is an English sentence, so there is no way to translate it to Vietnamese”; we provide further examples in Appendix C.2). Furthermore, unlike traditional NMT models that frequently produce oscillatory character hallucinations, ChatGPT does not generate any such hallucinations under perturbation. This is further evidence that translation errors, even severely critical ones, obtained via prompting a LLM are different from those produced by traditional machine translation models (Vilar et al., 2022; Garcia et al., 2023; Hendy et al., 2023; Bawden and Yvon, 2023).

Interestingly, we also found that the vast majority of the hallucinations can be reversed with further sampling from the model.We also found this to be the case with a one-shot prompt. This connects to findings in Guerreiro et al. (2022b); Manakul et al. (2023): as with traditional NMT models, hallucinations with a LLM may not necessarily indicate model defect or incapacity to generate adequate translations, and may just result from “bad luck” during generation.

Let us now turn to investigating natural hallucinations.From now on, we will use the terms natural hallucinations — both detached and oscillatory hallucinations — and hallucinations interchangeably. We first provide an in-depth overview of our evaluation setting, focusing on the scenarios and detection methodology. Subsequently, we present a thorough analysis, exploring diverse properties of natural hallucinations such as their different types, the influence of translation direction, and prevalence of toxicity.

Analyzing massively multilingual translation models opens up several research scenarios that have not been studied in previous works that focused solely on bilingual models. We will take advantage of this opportunity and investigate natural hallucinations in three different evaluation scenarios, studying more than 100 translation directions in the main text alone.

We start with an English-centric scenario where we pair 32 different languages with English for a total of 64 translation directions. Then, we study a non-English-centric scenario inspired by Fan et al. (2020), where we explore 25 language pairs corresponding to real-world use cases of translation not involving English (e.g., translating Greek directly to Turkish). Finally, we assess the prevalence of hallucinations on sensitive medical data where they can have a devastating impact on user trust. We pair 9 different languages with English for a total of 18 directions. We present all the translation directions investigated in these setups in Appendix D.1. We report results for the first two setups using the Flores dataset in the main text and WMT in Appendix D.2. For the final setup, we use the medical-domain TICO dataset.

We integrate key findings from recent research on detection of hallucinations and focus on two main detectors: ALTI+ (Ferrando et al., 2022) for detached hallucinations, and top nn-gram (TNG) (Raunak et al., 2021, 2022; Guerreiro et al., 2022b) for oscillatory hallucinations.

ALTI+ evaluates the relative contributions of both source and target prefixes to model predictions. As hallucinations are translations detached from the source sequence, ALTI+ can effectively detect them by identifying sentences with minimal source contribution. Notably, it faithfully reflects model behavior and explicitly signals model detachment from the source text in any translation direction (Ferrando et al., 2022). In previous works, this method has been successfully employed to detect hallucinated toxicity in a multilingual context in NLLB Team et al. (2022), and it has been validated on human-annotated hallucinations in Dale et al. (2022), where it was demonstrated that ALTI+ scores easily separate detached hallucinations from other translations.We followed the recommendations in Guerreiro et al. (2022b) and set model-based ALTI+ thresholds based on validation data where the models are expected to perform well. Specifically, we obtained the lowest 0.02% — in line with natural hallucination rates reported in the literature (Raunak et al., 2022) — of the ALTI+ score distributions for high-resource WMT benchmarks. Additionally, to ensure further trustworthy, high-precision measurements, we excluded detected candidates with LaBSE or CometKiwi scores — as these have been also been validated for detection of human-annotated detached hallucinations (Dale et al., 2022; Guerreiro et al., 2022a) — exceeding the top 10% of scores on translations from the same WMT benchmarks.

TNG, on the other hand, is a straightforward, lightweight black-box heuristic targeting oscillatory hallucinations. It works by comparing the count of the top repeated translation nn-gram to the count of the top repeated source nn-gram, ensuring the difference is at least tt. This approach has been validated on human-annotated hallucinations and found to identify oscillatory hallucinations with perfect precision (Guerreiro et al., 2022b). We follow previous work by using n=4n=4 and t=2t=2 (Raunak et al., 2021; Guerreiro et al., 2022b) and excluding translations that meet the reasonable quality threshold outlined in Section 4.1.Note that oscillatory hallucinations can be simultaneously detected with ALTI+ and TNG.

We rely on ALTI+, a model-based detector, for reliable detection of detached hallucinations. Since we lack access to glass-box internal features from ChatGPT, we exclude it from our model selection to ensure consistency in our analysis. It is important to note that using alternative detectors could lead to misleading results and create discrepancies between the evaluation scenarios for ChatGPT and other models. Nonetheless, we will further examine ChatGPT in Section 6, exploring various aspects such as the generation of oscillatory hallucinations and translation quality in scenarios where other models produce hallucinations.

2 English-Centric Translation

We start by investigating natural hallucinations on English-centric language pairs. We reveal key insights on how properties of hallucinations change across resource levels, models and translation directions. We present language-pair specific results in Appendix D.2.

Table 2 shows that hallucinations occur frequently for low-resource directions, with all M2M models exhibiting average hallucination rates exceeding 10%. Furthermore, all models generate hallucinations for the vast majority of low-resource language pairs. On what comes to the type of hallucinations, Figure 2 demonstrates that, in contrast to mid- and high-resource language pairs, oscillatory hallucinations are less prevalent, while detached hallucinations occur more frequently in low-resource languages. This reveals that models tend to rely less on the source context when translating to or from low-resource languages. Importantly, although massive multilingual models have significantly improved translation quality for low-resource languages, these findings not only suggest that there is considerable room for improvement, but also highlight potential safety concerns arising from translations in these directions.

Despite having the smallest number of parameters, SMaLL100 shows remarkable hallucination rates across low- and mid-resource language pairs, hallucinating significantly less than its larger counterparts in low-resource settings. These improved rates may be attributed not only to the uniform sampling of language pairs during training, but also to architectural decisions. While SMaLL100 shares a 12-layer encoder with the other models to process source representations, it diverges by employing a shallow 3-layer decoder—instead of a 12-layer decoder—and placing the target language code on the encoder side. We hypothesize that this design encourages greater reliance on the more complex encoder representations, reducing the likelihood of detachment from the source. In fact, distinct patterns in ALTI+ scores (shown in Appendix D.2) support this hypothesis: SMaLL100 consistently demonstrates higher source contributions and similar patterns across all resource levels. In contrast, M2M models show a greater tendency to rely less on the source, especially in low-resource language pairs. Importantly, however, SMaLL100’s reduced hallucination rates do not necessarily imply superior translation quality compared to the other M2M models: we observed a strong correlation between M2M models’ corpus-level COMET-22 scores and their respective hallucination rates for low-resource languages, whereas, contrastingly, for SMaLL100 the correlation is weak. This indicates that despite detaching less from the source content, SMaLL100’s translations are not necessarily of higher quality to those of other M2M models. This and other statistics can be found in the Appendix D.2.

As shown in Table 2, increasing the size of the M2M family models results in consistent reductions in hallucination rates. Relative improvements are more pronounced for mid- and high-resource language pairs, with M2M (L) exhibiting fewer hallucinations and hallucinating for fewer languages than all other models.

Table 3 demonstrates that models are significantly more prone to hallucinate when translating out of English. In fact, in line with the observations of Ferrando et al. (2022), we found that models tend to detach more from the source text when translating out of English. This is evidenced by ALTI+ source contributions being lower across all language pairs in this direction compared to translating into English. Interestingly, we discovered that the translation direction can also influence the properties of hallucinations: (i) over 90% of off-target hallucinations occur when translating out of English, and (ii) nearly all hallucinations into English for mid- and high-resource language pairs are oscillatory.

Toxic text in translations can emerge in the form of hallucinations (NLLB Team et al., 2022). To assess the prevalence of toxic text in detected hallucinations, we utilized the toxicity wordlists provided by NLLB Team et al. (2022). We found that toxic text primarily appears in translations out of English and almost exclusively affects low-resource language pairs. For instance, over 1 in 8 hallucinations in Tamil contain toxic text. Interestingly, these toxic hallucinations not only exhibit high lexical overlap among them, but are repeated across models for multiple unique source sentences. Moreover, they are not necessarily reduced by scaling up the model size. These observations suggest that these hallucinations are likely to be traced back to toxic patterns in the training data,Upon inspecting the Common Crawl corpora that were used to create the training data, we found reference translations that exactly match the toxic hallucinations. aligning with observations in Raunak et al. (2021); Guerreiro et al. (2022b). Moreover, we also found that these hallucinations can be propagated through model distillation, as evidenced by SMaLL100 generating toxic hallucinations that are copies of those of its teacher model. This underlines the necessity of rigorously filtering training data to ensure safe and responsible use of these models in real-world applications.

3 Beyond English-Centric Translation

We shift our focus to translation directions that do not involve English, typically corresponding to directions with less supervision during training. We present language-pair specific results in Appendix D.3.

Table 4 reveals trends that largely mirror those observed in the English-centric setup:Comparing absolute hallucination rates between the two setups is not advised, as they involve different translation directions, which may render such comparisons unreliable. (i) hallucinations are more frequent in low-resource settings; (ii) SMaLL100 significantly outperforms the M2M models in low-resource language pairs; and (iii) scaling up to M2M (L) consistently yields substantial improvements over the smaller M2M models in low- and mid-resource directions. Additionally, the trends related to hallucination types also hold across the two setups: detached hallucinations are more prevalent in low-resource settings, while oscillatory hallucinations overwhelmingly dominate in mid- and high-resource directions (see Appendix D.3).

As expected, models struggle more with hallucinations for directions with less or even no supervision during training, such as ro-hy and af-zu. For instance, M2M (M) hallucinates for nearly half of the translations in these directions.

4 Translation on Specialized Domains

We now turn to investigating hallucinations in data from the medical domain, where they can have devastating consequences. Using the TICO dataset, we compare hallucination rates with the Flores dataset for 18 translation directions. We present language-pair specific results in Appendix D.4.

Table 5 reveals that hallucination rates for the TICO medical data do not consistently exceed those observed for the Flores Wikipedia data. This finding diverges from previous works that investigated hallucinations for specialized domain data (Wang and Sennrich, 2020; Müller et al., 2020). We hypothesize that, in contrast with the smaller models typically trained on limited datasets from a single domain used in those works, the concept of "domain shift" may not be as pronounced for M2M models. These models are not only much larger but, crucially, they are trained on a dataset containing over 7 billion parallel sentences gathered from the web, which encompasses a broad array of domains. This massive training set potentially mitigates the impact of domain shift and, consequently, reduces its influence on hallucinations.

Building upon our analysis on natural hallucinations in the previous section, we now explore the potential of reducing hallucinations and enhancing overall translation quality by employing a simple hybrid setup that can take advantage of multiple systems with possible complementary strengths. Put simply, we leverage an alternative system as a fallback when the primary original model produces hallucinations. Our analysis in the main text is focused on the more extensive English-centric setup. We provide results on the non-English-centric setup in Appendix E.

We begin by analyzing the performance of same-family models when employed as fallback systems for one another (e.g., using SMaLL100, M2M (M), and M2M (L) as fallbacks for M2M (S)).For simplicity, we consider the distilled SMaLL100 as a model from the M2M family.

Figure 3 reveals that when employing M2M models as fallback systems, reversal rates—percentage of hallucinations from the original system that are corrected by the fallback system—are consistently higher for oscillatory hallucinations than for detached hallucinations. These findings not only corroborate those in Guerreiro et al. (2022b), where oscillatory hallucinations were found to be less related to model defects, but also further emphasize the close connection between detached hallucinations and training data. This connection can help explain their stickiness: since the M2M models share the same training data, reversing these hallucinations using other same-family models as fallbacks is more challenging. Interestingly, we also observe that M2M (L) particularly struggles to reverse the detached hallucinations generated by its distilled counterpart SMaLL100, suggesting that model defects can persist and be shared during distillation.

In line with our analysis in Section 5.2, Figure 3 shows that reversal rates using SMaLL100 as a fallback system are higher for detached hallucinations than for oscillatory hallucinations. In fact, although SMaLL100 is a distilled M2M model, its training data, training procedure, and architecture differ from those of the M2M models. This distinction may make it more complementary as a fallback system to other M2M models than simply scaling up within the same model family. This suggests that merely increasing the scale of models within the same family is not an effective strategy for mitigating hallucinations, and exploring alternative models with different architectures and trained on different data could yield more substantial improvements. In the next section, we will analyze this alternative strategy.

2 Employing external models as fallback systems

Motivated by the findings from the previous section, we will now study how models that are not from the M2M family can be employed to further mitigate hallucinations and improve translation quality. We will test this approach with two different models: (i) we will prompt ChatGPT as detailed in Section 3, and (ii) we will use a high-quality 3.3B parameter model from the NLLB family of multilingual NMT models (NLLB) proposed in NLLB Team et al. (2022).

Figure 4(a) demonstrates that external fallback systems, particularly NLLB, can significantly enhance translation quality of originally hallucinated translations compared to same-family models.For reference, in the English-centric setup, the averaged COMET-22 corpus-level scores for low-, mid-, and high-resource LPs obtained with M2M(L) are 73.6373.63, 86.5486.54, and 87.1987.19, respectively. We provide fallback system quality scores for all language pairs in Appendix E. This improvement is especially notable for low-resource languages, where both ChatGPT and NLLB consistently boost translation quality. Remarkably, NLLB generally outperforms ChatGPT as a fallback system for low- and mid-resource languages, aligning with the findings in Hendy et al. (2023), which revealed that GPT models have limited capabilities for lower-resourced languages and lag behind dedicated translation models in those settings. Nonetheless, ChatGPT still surpasses dedicated M2M translation systems in these resource levels when used as a fallback system, underscoring the limitations of relying on same-family models as fallback systems.

Conclusion

From Figure 4(b), we see another benefit of employing external fallback systems: oscillatory hallucinations are almost entirely eliminated. Interestingly, consistent with our findings in Section 4, we observe that ChatGPT produces very few, if any, oscillations, slightly improving the rates obtained with NLLB. This provides further evidence that, although hallucinations obtained via prompting LLMs may still occur, they exhibit different properties and surface forms. Investigating and understanding these differences in hallucination properties presents an interesting research path for future work.

We have comprehensively investigated the phenomenon of hallucinations in massively multilingual translation models. By departing from the settings studied in previous work that focused on bilingual models trained on high-resource language pairs, we were able to explore a wide range of research scenarios that remained overlooked.

Our analysis revealed several key insights on the prevalence and properties of hallucinations across various models of different scale, translation directions, and data conditions, including: the prevalence of hallucinations across multiple translation directions across different resource levels and beyond English-centric translation; the emergence of toxicity in hallucinations; and the effect of scaling up within the same model family on the prevalence of hallucinations. Additionally, we explored how fallback systems can mitigate hallucinations and improve overall translation quality. We found that hallucinations can be sticky and difficult to reverse when using models that share the same training data and architecture. However, by leveraging other external models, we can significantly improve translation performance and virtually eliminate pathologies such as oscillatory hallucinations.

To support future research on this topic, we are open-sourcing our code and releasing over a million translations and detection results across several models and language pairs.

Our study mainly focuses on the M2M family of multilingual models. We chose this family of models as it includes several models at different sizes and the largest open-source multilingual NMT model. It is unclear how our findings generalize to other families of multilingual models (e.g., the NLLB family of models).

Our detection approaches inherit the limitations that carry over with the metrics that are leveraged in them. For instance, following all of previous work, we adopt a BLEU metric to detect hallucinations under perturbation. However, this and other lexical metrics ranked worst than reference-based neural metrics in last year’s WMT22 Metrics Shared Task (Freitag et al., 2022).

We analyzed ChatGPT as it has demonstrated impressive capabilities for translation and other multilingual tasks, such as MT evaluation. Unfortunately, the model remains behind API walls and documentation is scarce. As such, we could not ensure that ChatGPT was not trained on our evaluation sets, nor could we evaluate the contribution of the source text to ChatGPT’s translations, which would have enabled detection of detached hallucinations. Despite these limitations, we believe our findings provide relevant insights into the properties of translations generated by the model.

We would like to thank Meta AI for open-sourcing the M2M models and maintaining libraries such as stopes (Andrews et al., 2022) and nllb (NLLB Team et al., 2022). The work is partially supported by the European Research Council (ERC StG DeepSPIN 758969), by EU’s Horizon Europe Research and Innovation Actions (UTTER, contract 101070631), by the FCT through contract UIDB/50008/2020, and by the projects MAIA and NextGenAI (LISBOA-01-0247-FEDER-045909 and 2022-C05i0102-02). Part of this work was performed using HPC resources from GENCI-IDRIS (Grant 2022- AD01101838).

Appendix A List of Languages

We summarize the languages used in our work in Table 6.We determine the resource levels following Goyal et al. (2022). We consider very-low resource languages as low-resource.

Appendix B Dataset Statistics

The Flores benchmark (Goyal et al., 2022) consists of a multi-parallel dataset in 101 languages with sentences extracted from Wikipedia. The sentences in the dataset are translated from original English sentences. We join the dev and devtest splits for a total of 2009 parallel sentences for each translation direction.

The TICO benchmark (Anastasopoulos et al., 2020) is a multilingual dataset specialized in the medical domain obtained by combining English open-source data from various sources, such as scientific articles, government health announcements, and others. We join the dev and test set for a total of 2100 parallel sentences for each translation direction.

We selected the same English-centric WMT benchmarks that were used as evaluation sets in the original M2M paper (Fan et al., 2020). Additionally, we selected recent WMT test sets from the WMT21 and WMT22 campaigns as they were released after the models were trained. A summary of the datasets used for each translation direction can be found in Table 7.

Appendix C Hallucinations under Perturbation

We randomly misspell words by changing the characters with a probability of 0.01;We use the ButterFingersPerturbation from the NL-Augmenter framework (Dhole et al., 2022). insert a token chosen randomly from the 50 most frequent tokens or punctuation in the test set; and, randomly title-case words with a probability of 0.1, guaranteeing that at least one word’s capitalization is perturbed.

We pair the following languages with English: af, ar, ast, bn, cs, cy, de, el, es, fa, fi, fr, he, hi, hr, hu, id, ja, ko, lt, nl, oc, pl, pt, ru, sv, sw, tl, tr, vi, zh. We selected all bridge languages (see Table 6) and additional low-resource languages, as they are underrepresented among the bridge languages. Although Tamil is a bridge language, we did not include it in our analysis, as we could not find enough candidates with reasonable quality for hallucinations under perturbation after applying rule (i) of the detection approach (see Section 4.1).

C.2 Supplementary Results

We present the statistics in Table 8. As mentioned in the main text, the Pearson correlation between the original translation quality and the detection outputs of hallucinations under perturbation is very weak across all models.

As we have mentioned in the main text, hallucinations under perturbation, in contrast to natural hallucinations, explicitly reveal the lack of robustness of translation systems. Put simply, hallucinations under perturbation are cases where the model is not robust and, as a result, undergoes significant negative shifts in translation quality due to perturbations. This means that they do not necessarily entail detachment from the source. Hallucinations under perturbation have been studied as such in previous work, and detection of these translations with the 2-step method that we adopted in our work (see Section 4.1) has been consistently used in all of previous research. We decided to keep the same designations, definitions and detection approach so as to be consistent with previous work. Nevertheless, we assess here whether the detected hallucinations under perturbation are, in fact, detached or oscillatory hallucinations. We found that more than 85% of hallucinations under perturbation would be detected with our detection approach for natural hallucinations. Inspection of the non-detected translations revealed that they usually contain critical mistranslation errors, but do not necessarily exhibit detachment from the source content (e.g., off-target translations or copies of the source).

We provide examples of hallucinations under perturbation in Figure 5.

Appendix D Natural Hallucinations

Hereinafter, we will use the terms natural hallucinations and hallucinations interchangeably.

We pair the following languages with English: ar, ast, az, bn, cs, cy, de, el, es, fa, fi, fr, he, hi, hr, hu, id, ja, ko, lt, nl, oc, pl, ps, pt, ru, sv, sw, ta, tr, vi, zh. Similarly to the choice for the experiments on hallucinations under perturbation, we selected all the bridge languages and additional low-resource languages.

We analyzed the following translation directions: hi-bn, it-fr, de-hu, it-de, cs-sk, nl-fr, fr-sw, ro-ru, ro-uk, de-hr, hr-sr, be-ru, hr-hu, hr-cs, el-tr, hr-sk, nl-de, af-zu, ro-hu, hi-mr, ro-tr, uk-ru, ro-hy, ar-fr, ro-de. We selected language pairs that were used in the analysis from the original M2M paper on real-world settings for many-to-many translation (Fan et al., 2020). They represent translation directions that are commonly used in countries that have official and regional languages that do not include English.

We pair the following languages with English: ar, fr, hi, id, ps, pt, ru, sw, zh. We selected languages from the TICO benchmark that are supported by the M2M models.

D.2 English-Centric Translation

We present spBLEU and COMET-22 corpus-level scores for the Flores setup in Table D.2.1.

We present the distribution of ALTI+ scores for all translation directions in Figure 6. For the M2M models, we find the same trends presented in Ferrando et al. (2022): smaller source contributions when translating out of English and for low-resource language pairs. Remarkably, it is also clear from the distributions that ALTI+ scores for SMaLL100 are not only significantly higher than those of the M2M models, but they are also more consistent across different resource levels.

D.2.2 Evaluation with WMT.

We present the statistics in Table 9. As we have remarked in the main text, SMaLL100’s reduced hallucination rates for low-resource language pairs do not necessarily equate to superior translation quality compared to the other M2M models. We observed a strong correlation between M2M models’ corpus-level COMET-22 scores and their respective hallucination rates for low-resource languages, whereas, contrastingly, for SMaLL100 the correlation is weak. This indicates that despite detaching less from the source content, SMaLL100’s translations are not necessarily of higher quality to those of other M2M models. Interestingly, the correlations for the M2M models are significantly weakened on high-resource language pairs, mirroring observations from previous work (Lee et al., 2018).

We present a heatmap with hallucination rates for each model for all translation directions in Figure 7.

We present spBLEU and COMET-22 corpus-level scores in for the WMT setup in Table D.2.2.

We present a heatmap with the relative prevalence of hallucinations detected only by TNG (oscillatory hallucinations) in Figure 9.

We present spBLEU and COMET-22 corpus-level scores in Table 13.

We present a heatmap with hallucination rates for each model for all translation directions in Figure 10.

D.4 Specialized Domains

We present a heatmap with the relative prevalence of hallucinations detected only by TNG (oscillatory hallucinations) in Figure 11. The trends follow those presented in the analysis in the main text (see Section 5.2).

We present spBLEU and COMET-22 corpus-level scores in Table D.4.

We present a heatmap with the relative prevalence of hallucinations detected only by TNG (oscillatory hallucinations) in Figure 13. Similarly to the non-English-centric setup, the trends follow those presented in the analysis in the main text (see Section 5.2).

We present COMET-22 scores on the original model hallucinated translations for every translation direction in Figure 14.

We present the results on employing external models as fallback systems in Figure 15, and COMET-22 scores on the original model hallucinated translations for each translation direction in the non-English-centric setup in Figure 16. Importantly, the trends analyzed in-depth in the main text largely hold in this setup as well. In contrast with the analysis for the English-centric setup, ChatGPT struggles to outperform all other models in Non-English-Centric directions, especially in low and mid-resource levels. This is expected: as ChatGPT was trained on a non-parallel heavily English-centric corpus, it is likely that it struggles more with translation directions that do not include English.

Appendix F Examples of Natural Hallucinations

We provide several examples of English-centric natural hallucinations generated by the M2M models in Figure 17, which spans 2 pages.