Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li

Introduction

With the increasing scale of parameters and training corpus, large language models (LLMs) have gained a universal ability to handle a variety of tasks via in-context learning (ICL, Brown et al. 2020), which allows language models to perform tasks with a few given exemplars and human-written instructions as context. One particular area where LLMs have shown outstanding potential is machine translation (MT). Previous studies have shown the surprising performance of LLMs on high-resource bilingual translation, such as English-German translation Vilar et al. (2022); Zhang et al. (2022), even if these models are not particularly optimized on multilingual data.

However, the multilingual translation ability of LLMs remains under-explored. MMT is a challenging task that involves translating text among different languages and requires semantic alignment between languages Fan et al. (2021); Costa-jussà et al. (2022); Yuan et al. (2023). It is also unclear that how LLM acquires translation ability and which factors affect LLM’s translation ability.

In this paper, we follow ICL paradigm and focus on studying LLMs in multilingual machine translation by answering two questions: 1) How LLMs perform MMT over massive languages? 2) Which factors affect the performance of LLMs?

For the first question, we evaluate several popular LLMs: English-centric LLMs, including OPT Zhang et al. (2022), LLaMA2 Touvron et al. (2023), Falcon Almazrouei et al. (2023) and multilingual LLMs, including XGLM Lin et al. (2022), BLOOMZ Scao et al. (2022), ChatGPT OpenAI (2022), GPT-4 OpenAI (2023), and consider 102 languages, 606 translation directions (202 English-centric directions, 202 French-centric directions and 202 Chinese-centric directions). Results show that the multilingual translation capabilities of LLMs are continually improving and GPT-4 reaches new performance height. Compared with the widely-used supervised MMT system NLLB Costa-jussà et al. (2022), GPT-4 achieves higher performance on 40.91% English-centric translation directions. But compared with the commercial translation system (Google Translator), LLMs still have a long way to go, particularly when it comes to low-resource languages. French-centric and Chinese-centric translation are more challenging for GPT-4 than English-centric translation, which further indicates its unbalanced capability across languages.

For the second question, we find some new working patterns. First, LLMs are able to perform translation even with unreasonable instructions if in-context learning exemplars are given. However, if given mismatched translation pairs as in-context exemplars, LLMs fail to translate, which is similar to observations from concurrent studies (Wei et al., 2023). This shows the importance of exemplars in ICL for machine translation. Second, we find that cross-lingual translation pairs can be surprisingly good exemplars for low-resource translation, even better than exemplars in the same language. Third, we discover that LLM can acquire translation ability in a resource-efficient way and generate moderate translation even on zero-resource languages.

The main contribution of this paper can be summarized below:

We benchmark popular LLMs on MMT in 102 languages and 606 translation directions, covering English-centric, French-centric and Chinese-centric translation.

We systematically compare the results of LLMs and three strong supervised baselines (M2M-100, NLLB, Google Translator) and reveal the gap between two translation paradigms.

We find some new ICL working patterns of LLMs for MMT and discuss corresponding advantages and challenges.

Background

Language modeling is a long-standing task in natural language processing Bengio et al. (2000); Mikolov et al. (2010); Khandelwal et al. (2020), which is a task to predict the probability of the next token. Transformer Vaswani et al. (2017) basically is the backbone of existing LLMs.

LLMs show great potential as a universal multi-task learner. Recently, Radford et al. (2019) find that a casual decoder-only language model can be a multi-task learner with merely unsupervised training corpus. Later, Kaplan et al. (2020) reveal the scaling law of LLM, indicating that when the scale of neural parameters and training data keeps increasing, LLM can be further strengthened. Wei et al. (2022b) show that scaling the language model also brings astonishing emergent abilities, e.g., in-context learning, which is only present in large models. Consequently, more and more efforts have been put into scaling-up language models Brown et al. (2020); Hoffmann et al. (2022); Scao et al. (2022); Vilar et al. (2022); Ren et al. (2023). Among them, GPT-4 OpenAI (2023) and ChatGPT OpenAI (2022) are the most representative systems, which shows impressive results in various NLP tasks.

2 Emergent Ability: In-context Learning

In-context learning is one of the well-known emergent abilities Brown et al. (2020); Dong et al. (2022), which enables LLM to learn target tasks according to the prompt without updating any parameters.

Specifically, the prompt is made up of in-context exemplars $\{(\mathcal{X}_{i},\mathcal{Y}_{i})\}_{i=1}^{k}$ and in-context template $\mathcal{T}$ . Exemplars are often picked from supervised data, where $\mathcal{Y}_{i}$ is the ground truth corresponding to the input sentence $\mathcal{X}_{i}$ . Template $\mathcal{T}$ is usually a human-written instruction related to the target task. Wrapping exemplars with the template and concatenating them together produce the final prompt:

where $\oplus$ denotes the concatenation symbol, e.g., whitespace, line-break. During inference, LLM is able to generate the corresponding output $\mathcal{Y}$ of the test sample $\mathcal{X}$ under the guidance of the prompt:

For label prediction tasks, the prediction $\mathcal{Y}$ can be obtained in one-step generation. For sequence generation tasks, e.g., machine translation, the prediction $\mathcal{Y}$ can be obtained through sampling strategies like greedy search and beam search.

Experiment Setup

We benchmark multilingual translation on Flores-101 Goyal et al. (2022) datasetWe evaluate LLMs on the first 100 sentences of each direction’s test set in benchmarking experiment, considering the prohibitive API cost of evaluating massive languages. In analysis experiment, we use full test set., which enables an assessment of model quality on a wide range of languages.

LLMs

We evaluate translation performance of eight popular LLMs: XGLM-7.5B Lin et al. (2022), OPT-175B Zhang et al. (2022), BLOOMZ-7.1B Scao et al. (2022), Falcon-7B Almazrouei et al. (2023), LLaMA2-7B Touvron et al. (2023), LLaMA2-7B-chat Touvron et al. (2023), ChatGPT OpenAI (2022) and GPT-4 OpenAI (2023).

ICL strategy

For each model, we report its translation performance with eight randomly-picked translation pairs from the corresponding development set as in-context exemplars and “=” as in-context template. “” and “” are the placeholder for the source and target sentence. We use line-break as the concatenation symbol. According to our experiment analysis, this ICL strategy serves as a simple but strong recipe. All implementation is based on OpenICLhttps://github.com/Shark-NLP/OpenICL Wu et al. (2023).

Supervised baselines

We report the performance of the supervised model M2M-100-12B Fan et al. (2021) and NLLB-1.3B Costa-jussà et al. (2022) (distillation version), which are widely-used many-to-many MMT models. We also report the performance of the powerful commercial translation system, Google Translatorhttps://translate.google.com/.

Metric

Following Goyal et al. (2022), we use SentencePiece BLEUhttps://github.com/mjpost/sacrebleu (spBLEU) as evaluation metric, which enables an evaluation of all languages. In addition, we also consider emerging metrics, COMETWe compute the score with wmt22-comet-da model. Rei et al. (2020) and SEScoreWe compute the score with SEScore-2 Xu et al. (2022a). Xu et al. (2022b), which have been shown to correlate well with human judgements.

Benchmarking LLMs for Massively Multilingual Machine Translation

In this section, we report results on multilingual machine translation and introduce our main findings about LLMs’ translation ability.

Table 1 presents evaluation resultsEvaluating with SEScore leads to similar findings, thus we report those results in Appendix A. Detailed results for each translation direction are listed in Appendix B. grouped by language family. Monolingual pre-trained LLMs present impressive multilingual translation ability, indicating the possibility of aligning multiple languages even with unsupervised data Garcia et al. (2023). More encouragingly, the multilingual translation capabilities of LLMs are continually improving. The most recent LLMs are reaching new performance heights; for example, LLaMA2-7B outperforms previously released open-source LLMs, and GPT-4 surpasses ChatGPT. Overall, GPT-4 is the best translator among evaluated LLMs and it achieves the highest average BLEU and COMET score on most directions.

LLM’s capability is unbalanced across languages

In Table 1, we observe a similar trend for all evaluated LLMs: they perform better at translating into English than translating into non-English. LLM’s capability on non-English languages is also unbalanced. For languages that are similar to English, e.g, Indo-European-Germanic languages, LLMs achieve impressive results. For languages that are dissimilar to English, e.g., Sino-Tibetan languages, LLMs often produce less decent results.

Table 2 presents another clue, where we evaluate GPT-4 on French-centric and Chinese-centric translation. Compared to English-centric translation, GPT-4 faces greater challenge when it comes to non-English-centric translation, which again indicates LLM’s unbalanced translation ability across languages.

LLMs still lag behind the strong supervised baseline, especially on low-resource languages

Figure 2 shows the translation performance of the supervised systems and GPT-4 on each language. In 40.91% translation directions, GPT-4 has achieved higher BLEU scores than NLLB, indicating the promising future of this new translation paradigm. But on long-tail low-resource languages, GPT-4 still lags behind NLLB, let alone Google Translator.

Data leakage issue should be considered before evaluating LLMs on public datasets.

We do not include BLOOMZ’s performance on Flores-101 in our report because BLOOMZ is instruction-tuned with xP3 dataset Scao et al. (2022), which includes Flores-200 dataset. Thus BLOOMZ may have been exposed to test cases from Flores-101 during training. If so, the evaluation results can not precisely reflect its translation ability Elangovan et al. (2021).

To illustrate this concern, we take 1000 English sentences from the most recent news spanning August 2023 to October 2023The news were collected from BBC news, Fox news, ABC news and Yahoo news., and ask human experts to translate them into Chinese and construct a bilingual no-leakage evaluation set, named News2023. Figure 4 shows that BLOOMZ’s performance significantly deteriorates on this no leakage set, whereas other models maintain a consistent performance across both datasets. This disparity underscores the risk of using Flores-101 for evaluating BLOOMZ. Through this example, we wish to draw the community’s attention to the potential data leakage issue when evaluating large language models.

Analyzing Factors That Influence LLM’s Translation Performance

To better understand how LLM acquires translation ability and which factors have influence on its performance, we conduct in-depth analysis. For analysis, we choose XGLM-7.5B as an exampleWe choose XGLM for three reasons: (1) XGLM has a multilingual focus and covers many languages, which can be seen as a representative of multilingual LLM. (2) XGLM-7.5B is an open-source medium-sized LLM. It is more affordable to run experiments with it than large-sized LLM or close-source LLM. (3) The composition of the XGLM’s pre-training corpus is clear, allowing us to analyze the relationship between translation ability and corpus size.. Note that, when studying a certain factor, we keep the remaining factors unchanged.

As XGLM authors report data distribution of their pre-training corpus, we can investigate the relationship between translation performance and corpus size (Figure 3). We find that for low-resource languages, e.g., Catalan (cat) and Swahili (swh), XGLM can generate moderate translation, showing that LLM can build bilingual mapping between non-English and English with a few non-English monolingual resources (less than 1% of English resources). Even on unseen languages, e.g., Occitan (oci) and Asturian (ast), XGLM can translate through ICL. These observations indicate a potential advantage of the novel translation paradigm: LLM can learn to translate in a resource-efficient way.

2 Findings on In-context Template

The initial step of applying in-context learning for translation is determining the template. We find that the translation performance varies greatly with different templates (Table 3), where the largest gap in the average performance is up to 16 BLEU. The best template for each direction is also different. Among these templates, “=” achieves the highest average BLEU score. “[SRC]: $\backslash$ n [TGT]: ” achieves the lowest score, although it is a commonly-used template for prompting other LLMs, e.g., PaLM Vilar et al. (2022), GLM Zhang et al. (2023). Such phenomena indicate that the template plays a vital role in ICL and it may be challenging to design a universally optimal template for different LLMs and translation directions.

Even unreasonable template can instruct LLM to generate decent translation

A common intuition of ICL is that the template instructs LLMs to do the target task Brown et al. (2020), e.g., the template “ can be translated to ” instructs the LLM to perform translation task.

However, we find that wrapping translation exemplars with task-unrelated template can also serve as an effective prompt. For example, the template like “ can be summarized as ” can also instruct LLM to generate translation, rather than guiding it to generate summarization. Given the fact that these unreasonable template are also effective, the community may not fully understand the role of in-context-template.

3 Findings on In-context Exemplar

Translation direction of the exemplar is a unique factor in machine translation. We find that using cross-lingual exemplars does not always causes worse performance and show two cases in Figure 5. When using cross-lingual exemplars for German-English translation, the translation performance degenerates.

But when using cross-lingual exemplars for low-resource Chinese-English translation (illustrated in Appendix D), XGLM’s translation performance usually improves significantly, even when both source and target language is changed. This phenomenon indicates the potential usage of cross-lingual exemplars in a broader range of tasks Lin et al. (2022), and we will explore more about this in the future.

Semantically-related exemplars does not brings more benefits than randomly-picked exemplars

In this paper, we use development set for exemplar selection, which has been found to be a high-quality candidate pool Vilar et al. (2022), and we compare four ways of selecting in-context exemplars, namely RandomRandom: picking exemplars on a random basis., BM25BM25: selecting exemplars whose source sentences are similar to the test case’s source sentence according to BM25., TopKTopK: selecting exemplars whose source sentences are similar to the test case’s source sentence according to the similarity of sentence embedding. and OracleOracle: selecting exemplars whose target sentences are similar to the test case’s according to sentence embedding, which can be seen as the upper bound of selection strategy..

Effects of selecting varying number of in-context exemplars with different approaches are shown in Figure 6. The general trend in all dataset is similar. As the number of examples grows from 1 to 8, the BLEU score increases rapidly. Afterwards, the translation performance plateaus regardless of selection strategy. When more exemplars are added, e.g., 32 exemplars, the BLEU score usually starts to decline, shows an opposite phenomenon against the observation in natural language understanding tasks Li et al. (2023).

Compared to semantically-related exemplars, randomly-picked exemplars gives comparable translation performance. Even the performance of oracle selection is on par with random selection. Based on these observations, we suggest that translation exemplars can teach LLM to translate but LLM may struggle to acquire helpful translation knowledge from semantically-related exemplars.

Exemplars teach LLM the core feature of translation task

To better understand how ICL exemplars influence LLM to understand the translation task, we observe LLM’s translation behaviour under abnormal in-context exemplars (Table 4).

We can see that LLM completely fails when mismatched translation is used as exemplars, indicating that LLM needs to learn from the context to keep source and target sentence semantically consistent. Word-levelWe select word pairs from open-source fasttext dictionary. and document-levelWe select document translation from Europarl dataset. translation exemplar degenerates LLM’s translation performance, which demonstrates that the translation granularity of exemplar matters as well. Another interesting phenomenon is that LLM performs worse when duplicated translation is used as the exemplar, indicating that keeping in-context exemplars diverse is also important. In general, these comparison results show that LLM learns the core feature of translation task through in-context learning.

The exemplar in the tail of the prompt has more impact on the LLM’s behaviour

During our analysis, we find that reversing the translation direction of exemplars will cause LLM to fail. Based on this observation, we conduct experiments to investigate the importance of different parts of the prompt (Table 5). We find that reversing exemplars in the tail of the prompt consistently produced worse results compared to reversing exemplars in the head, which suggests that exemplars in the tail of the prompt have larger influence on LLM’s behavior.

Related Work

Using LLMs for multilingual machine translation is attracting more and more attention. Lin et al. (2022) evaluate GPT-3 and XGLM-7.5B on 182 directions. Bawden and Yvon (2023) evaluates BLOOM on 30 directions. Bang et al. (2023), Jiao et al. (2023) and Hendy et al. (2023) evaluate ChatGPT on 6 to 18 directions. In this paper, we thoroughly evaluate multilingual translation performance of popular LLMs on 102 languages and 606 directions and compare them with state-of-the-art translation engines, such as NLLB and Google Translate, which provides a more comprehensive benchmark result and highlights the challenges involved in optimizing this emerging translation paradigm.

To find better ICL recipe for machine translation, many efforts have been put into designing exemplars selection strategy Agrawal et al. (2022); Zhang et al. (2023); Moslem et al. (2023). Similar to the findings of Zhang et al. (2023), we find that random selection is a simple but effective strategy. We also find that even oracle selection can not result in consistently better performance. Wei et al. (2022a) shows few-shot exemplars improve translation performance. But we further demonstrate the dynamic variations of translation performance with the number of in-context exemplars and the usage of cross-lingual exemplars. Besides, Vilar et al. (2022) find that using a high-quality pool, e.g., development set, for ICL example selection is better and Zhang et al. (2023) analyze why the quality of translation exemplars matters. In this paper, we reveal how in-context exemplars teach LLM to translate by analyzing LLM’s behaviour under different kinds of exemplars.

Multilingual machine translation

Developing a bilingual translation system for each direction becomes impossible when the number of supporting languages increases. Therefore, multilingual machine translation is proposed Johnson et al. (2017). But how to build a high-quality yet efficient MMT system remains an on-going challenge Costa-jussà et al. (2022); Yuan et al. (2023); Guerreiro et al. (2023). In this paper, we focus on LLM and reveal its potential in MMT.

Conclusion

In this paper, we evaluate the multilingual translation ability of popular LLMs, including ChatGPT and GPT-4, on 102 languages and 606 directions, which presents the advantages and challenges of LLMs for MMT. We find that translation capabilities of LLMs are continually improving and GPT-4 reaches new performance height. But even for GPT-4, it still face challenge on low-resource languages. In our analysis, we find that LLMs exhibit new working patterns when used for MMT. For example, instruction semantics can be ignored during in-context learning and cross-lingual exemplars can provide better task instruction for low-resource translation. More importantly, we find that LLM can acquire translation ability in a resource-efficient way, which indicates the promising future of LLM in multilingual machine translation.

Acknowledgement

We would like to thank Fei Yuan and Zhenyu Wu for their support to this project. Shujian Huang is the corresponding author. This work is partially supported by National Science Foundation of China (No. 62376116, 62176120), the Liaoning Provincial Research Foundation for Basic Research (No. 2022-KF-26-02).

References

Appendix A Evaluating LLM’s translation performance with SEScore

Table 6 presents average SEScore of LLMs on different language families. Currently, SEScore mainly supports evaluating English translation. Thus we evaluate LLM’s performance on translating other languages to English.

Appendix B Detailed Results on Each Language

We report detailed results of our evaluated models in Table 7 (BLEU), Table 8 (COMET), Table 9 (SEScore) and Figure 8. One thing that needs to be mentioned is that BLEU supports all translation directions, whereas COMET and SEScore only support a subset of these translation directions.

Appendix C Lists of Language

We evaluate 102 languages in this paper. Table 10 lists the name, ISO code and language family of these languages.

Appendix D Cross-lingual Exemplars

In Figure 5, we show an example of using cross-lingual in-context exemplars (Russian-English exemplars for Chinese-English translation).