Towards Making the Most of ChatGPT for Machine Translation

Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, Dacheng Tao

Introduction

Recently, the emergence of ChatGPThttps://chat.openai.com has brought remarkable influence on natural language processing (NLP) tasks. ChatGPT is a large-scale language model developed by OpenAI, based on InstructGPT Ouyang et al. (2022a), that has been trained to follow instructions with human feedback. ChatGPT possesses diverse abilities of NLP, including question answering, dialogue generation, code debugging, generation evaluation, and so on Qin et al. (2023); Zhong et al. (2023); Wang et al. (2023a); Kocmi and Federmann (2023); Lu et al. (2023b); Wang et al. (2023b). We are particularly interested in how well ChatGPT can perform on the machine translation task.

Previous studies Jiao et al. (2023); Hendy et al. (2023) on translation tasks have found that ChatGPT performs competitively with commercial translation products (e.g., Google Translate and Microsoft Translator) on high-resource languages, but has limited capabilities for low-resource and distant languages. However, they only adopt simple prompts and basic settings regardless of the significant influence of the prompts’ quality Zhou et al. (2022), which may limit ChatGPT’s performance. In this paper, we aim to further elicit the capability of ChatGPT by revisiting the following three aspects and correspondingly propose an optimal temperature setting and two simple but effective prompts: Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP).

Temperature is an important parameter to ensure ChatGPT generates varied responses to human queries. Basically, decoding with higher temperatures displays greater linguistic variety, while the low one generates grammatically correct and deterministic text Ippolito et al. (2019). However, for tasks with a high degree of certainty, such as machine translation, we argue, a diverse generation may impede its translation quality. We evaluate the performance of ChatGPT at different temperatures to verify its effect and find the optimal temperature setting for the following experiments.

ChatGPT is fine-tuned on high-quality chat datasets and thus essentially a conversational system that has a certain distance from the translation system, we argue that the task inconsistency will limit its translation ability to a certain degree. In response to this problem, we proposed Task-Specific Prompts (TSP) to further emphasize the task information to bridge the task gap, i.e., conversation and translation.

Compared with traditional machine translation systems, ChatGPT can incorporate additional information, like human interactions, through the input prompts Dong et al. (2023). We argue that such flexible interaction may alleviate some classical MT challenges, e.g., cross-domain generalization Koehn and Knowles (2017). We, therefore, propose Domain-Specific Prompts (DSP) to introduce the domain navigation information to elicit ChatGPT’s generalization ability across different domains.

Through extensive experiments, we find that:

ChatGPT’s performance largely depends on the temperatures, especially in difficult languages. Generally, setting a lower temperature can result in higher performance.

Emphasizing the task information in prompts can further improve ChatGPT’s performance, especially in complex tasks.

Introducing the correct domain information consistently improves ChatGPT’s performance while wrong domain information leads to significant degradation in performance.

When tackling the non-English-centric tasks (both the input and expected output are non-English), ChatGPT may generate hallucinations, which should be paid more attention to by the MT/NLP community.

Furthermore, we explore the effects of several advanced in-context learning strategies Brown et al. (2020b). Specifically, we investigate ChatGPT’s few-shot in-context learning (ICL) and chain-of-thought (CoT) Wei et al. (2022c); Kojima et al. (2022) abilities on MT tasks. Experimental results show that few-shot ICL can further improve ChatGPT’s performance, which is identical to the findings of Hendy et al. (2023), and we also find a negative but interesting observation: CoT leads to word-by-word translation behavior, thus bringing significant translation degradation. Also, we call for improving ICL and CoT for MT upon ChatGPT by incorporating the philosophy of example-based and statistical MT Nagao (1984); Koehn (2009).

The remainder of this paper is designed as follows. We present the evaluation settings in Section 2. In Section 3, we revisit the performance of ChatGPT from three aspects (temperature, task, and domain information) and show the zero-shot translation performance of ChatGPT with our proposed advanced prompt recipes. Section 4 summarizes the few-shot in-context learning and chain-of-thought results. Section 6 presents conclusions.

Evaluation Setting

We provide a brief introduction of the evaluation setting, which mainly includes the used models, test set, and evaluation metrics.

We mainly compare ChatGPThttps://chat.openai.com/chat with the commercial translation product Google Translatorhttps://translate.google.com, which supports translation in 133 languages. By default, the results in this paper come from the gpt-3.5-turbo-0301 models, which power the ChatGPT.

For multilingual translation and in-context learning, we evaluate the performance of the models on the Flores-200 Goyal et al. (2022)https://github.com/facebookresearch/flores test sets, which consists of 1012 sentences translated into 204 languages. To evaluate the effect of cross-domain translation, we adopt the test set of WMT19 Biomedical Bawden et al. (2019), News Translation Task Barrault et al. (2019) and WMT22 E-Commerce task Kocmi et al. (2022). Table 1 lists the statistics of these test sets. We test all samples through OpenAI API.

The translation metrics shared task Freitag et al. (2022) recommends using neural network-based metrics since they have demonstrated a high correlation with human evaluation and are resilient to domain shift. Hence, we adopt the mostly used COMET Rei et al. (2020) as our primary metric and use the default parameters of "comet-compare" for significance testhttps://github.com/Unbabel/COMET. Specifically, we use the reference-based metric COMET-20 (wmt20-COMET-da). Additionally, we also report BLEU scores Papineni et al. (2002) and ChrF Popović (2015) using SacreBLEU Post (2018) for completeness, but notably, we mainly analyze the performance in terms of model-based metric COMET.

Zero-Shot Translation

In this section, we explore the performance of ChatGPT from three aspects: temperature, task information, and domain information, and correspondingly propose an optimal temperature setting and two simple and effective prompts to improve ChatGPT’s performance.

ChatGPT is a chatting machine designed to provide fluent and diverse responses to a wide range of human requests. It is intuitive that the diversity of responses may hinder its performance on tasks with a high degree of certainty, such as machine translation, to some extent.

To investigate the influence of diversity, we compare the performance of ChatGPT in different temperature settings, including 0, 0.2, 0.4, 0.6, 0.8, and 1, across three translation directions: English $\Rightarrow$ Romanian, English $\Rightarrow$ Chinese, and English $\Rightarrow$ German. The relationship between temperature and performance of ChatGPT is shown in Figure 1 and 2.

Figure 1 and 2 show that ChatGPT’s performance largely depends on the value of temperatures, and as the temperature rises, there is a clear degradation both in COMET and BLEU scores. Furthermore, it is noteworthy that ChatGPT’s sensitivity to the temperature varies depending on the language pair: the impact of temperature is relatively small when translating to high-resource languages, e.g., German, while for complex languages, e.g., Chinese, it has a large degradation in performance ( $-4.3$ COEMT points and $-3.7$ BLEU points for Chinese) when the temperature changes from 0 to 1. We speculate that the huge resource variance in training data leads to differences in the confidence of languages, which partially explains the different performances. In the following experiments, we adopt $T=0$ as our default setting to make the most of ChatGPT and ensure the stability of generation to avoid a result of noise.

2 The Effect of Task Information

Previous studies (Jiao et al., 2023; Hendy et al., 2023) have shown that ChatGPT can achieve exceptional performance in conversational domain translation, which is attributed to its ability to generate more natural and diverse spoken language. However, given that ChatGPT is deliberately designed as a general task solver Qin et al. (2023), when asking the ChatGPT to perform as a specific task engine, there will arise a task gap. This task inconsistency may limit ChatGPT’s effectiveness in translation tasks other than the spoken domain.

To bridge the task gap and generate more translation-like sentences, we propose Task-Specific Prompts (TSP) to emphasize the translation task information. Specifically, we prepend the sentence "You are a machine translation system." to the best translation template in Jiao et al. (2023), and adopt it to query ChatGPT. The templates of prompts present in Table 2, and [TGT] represents the target languages of translation.

We have compared the performance of various models on four language pairs, covering eight distinct translation directions. These languages comprise 1) German, which is one of the most non-English languages in the GPT training data, 2) Romanian, a less frequently encountered non-English language in the GPT training data, and 3) Chinese, a large-scale language with a script distinct from English. We also adopt Chinese-Romanian as a non-English-centric use case. Table 3 lists the full results, where we list both English-centric and non-English-centric language directions (marked with green), and also, among English-centric directions, we highlight the difficult pairs (EN-ZH and EN-RO with shadow) in terms of their resources and language distance.

We first consider the performance of ChatGPT in English-centric translation language pairs. Specifically, we conduct experiments in three language pairs: German $\Leftrightarrow$ English (high-resource), Romanian $\Leftrightarrow$ English (low-resource), and Chinese $\Leftrightarrow$ English (distant language).

Our results presented in Table 3 show that our TSP method achieves comparable results on COMET score compared to Google Translator and even outperforms it in some language pairs, e.g., English $\Rightarrow$ Romanian (92.9 v.s. 91.6). We also observe that our TSP method consistently improves the performance of vanilla ChatGPT, especially when translating to low-resource or distant languages. Specifically, our TSP method brings $+0.8$ and $+0.5$ COMET score improvements in English $\Rightarrow$ Chinese and English $\Rightarrow$ Romanian, respectively, and $+0.2$ on average when translating to English. We speculate that the high-resource training data can help the model better understand the specific task from a few task-related navigations, thereby reducing the need for additional task-specific information. Although our proposed TSP consistently improves the performance in terms of semantic metric, i.e., COMTE, notably, we have not consistently bridged the task gap in terms of lexical metrics (BLEU and ChrF), which is consistent with similar findings from Vilar et al. (2022) on PALM-540B model.

2.2 Non-English-Centric Language Pairs

We also evaluate the performance of ChatGPT in non-English-centric language pairs (since the pretraining process was dominated by the English tokens and the multilingual MT community argues it may harm the non-English-centric performance Costa-jussà et al. (2022); Zan et al. (2022a, 2023).). We have an important finding that, when tackling non-English-centric MT language pairs, ChatGPT tends to generate translation hallucinations, that is, some unrelated information obeyed some patterns followed the translation, such as "Translation may vary depending on context", which will greatly affect the MT performance. We used a post-processing method to remove irrelevant information from the generated text. Specifically, we summarize some templates about irrelevant sentences and remove them from the generation texts. Some templates are shown in Table 4 and the number of post-processed sentences is presented in Figure 3.

Figure 3 shows that lower temperature can reduce the number of hallucinations (especially in distant languages, e.g., Chinese) and our TSP method can further reduce its number, which suggests that our method can help ChatGPT to better serve as a machine translation system. The full results on Romanian $\Leftrightarrow$ Chinese lists are in Table 3. As seen, our TSP method can only slightly improve ChatGPT’s performance, which could be due to the difficulty in both understanding and generating the language pairs. Meanwhile, our used post-editing approach could only roughly remove the hallucination patterns, the NLP/MT community should pay more attention to the potential hallucination when using ChatGPT to tackle the non-English text.

The subsequent experiments will use ChatGPT with TSP as the default setting.

3 The Effect of Domain Information

Compared with traditional machine translation systems, ChatGPT can incorporate additional information through the prompts to further improve its performance. While previous studies have shown that ChatGPT has great robust translation capabilities Hendy et al. (2023), we believe that we can further enhance its performance by incorporating domain-specific guidance.

In this section, we simply explore the effects of advanced in-context learning (ICL) strategies, specifically, we investigate ChatGPT’s few-shot ICL and Chain-of-Thought (CoT) abilities on MT tasks.

In-context learning Brown et al. (2020b) has shown its remarkable ability for many NLP tasks Liu et al. (2023). To further explore the capabilities of the ChatGPT, we conduct experiments with different sample selection strategies. Specifically, we evaluate the performance of few-shot machine translation in the following three directions: English $\Rightarrow$ Chinese, English $\Rightarrow$ Romanian, and English $\Rightarrow$ German in Flores-200. We conducted experiments with randomly and TopK Liu et al. (2022) sampled demonstrations from development sets in the 1-shot and 3-shot settings.

Our results are listed in Table 3.3. As seen, in-context learning with random examples consistently improves the performance in both lexical metric (BLEU) and COMET score compared to the zero-shot approach, and increasing the number of shots can lead to further improvement, which is consistent with previous finding Hendy et al. (2023). The advanced sample-selection strategy like TopK, which chooses test-sample similar examples as demonstrations, can further improve the performance, even outperform Google Translator in some language pairs, e.g., English $\Rightarrow$ Romanian (94.0 v.s. 91.6) and English $\Rightarrow$ Chinese (68.8 v.s. 68.5).

We encouragingly find that the advanced sample-selection strategy for in-context learning for MT tasks upon ChatGPT is extremely similar to the design philosophy of example-based machine translation (EBMT, Nagao, 1984), where the EBMT is often characterized by its use of a bilingual corpus as its main knowledge base, at run-time. It is worthy of designing better ICL strategies inspired by EBMT in future work.

2 Chain-of-Thought

Chain-of-Thought (CoT) prompting Wei et al. (2022c) has been demonstrated to be effective in eliciting the reasoning ability of large language models. Previous studies have shown that CoT can improve the ChatGPT’s performance in natural language understanding tasks Zhong et al. (2023), but its influence on machine translation tasks has hardly been investigated.

To investigate this further, we randomly select 20 samples from the test set and adopt the zero-shot CoT technique Kojima et al. (2022) and the 1-shot CoT technique. Specifically, as shown in Table 8, for zero-shot CoT, we use the prompt "Please provide the [TGT] translation for the following sentence step by step" to extract step-by-step translation. We also add the sentence ‘and then provide the complete sentence:’ to the end of the prompting to ensure that ChatGPT can generate the complete translation. While for the 1-shot CoT, we provide the manual intermediate reasoning steps inspired by zero-shot CoT, as shown in Table 8. Here, [S] and [T] represent the corresponding source and target sentence in the demonstration, respectively, and [S_i] and [T_i] are the i-th matching tokens in the source and target sentence.

We conduct experiments in the following two translation directions: English $\Rightarrow$ German and English $\Rightarrow$ Chinese. The results are listed in Table 4.2, which shows that there is a significant degradation in COMET score with zero-shot CoT setting, especially in English $\Rightarrow$ Chinese, which drops 8.8 COMET points. 1-shot CoT prompting can consistently outperform zero-shot CoT but still lags behind zero-shot prompting on COMET.

We looked in detail at the sentences generated by different prompts, presented in Table 10, and we have a negative but interesting observation: the CoT prompt leads to word-by-word translation behavior, which is the main reason for the significant translation degradation.

For more CoT variants designed with different principles inspired by the philosophy in statistical MT Zens et al. (2002); Koehn (2009) will be explored in the future. For example, word-by-word and then reordering the translation Du and Way (2017); Ding et al. (2020), phrase-to-phrase Feng et al. (2018); Ding et al. (2021) and then reordering the translation, and structure-to-structure translation Kaplan et al. (1989).

Large language models (LLMs) usually refer to language models with hundreds of billions of parameters, which are trained on massive text data Zhao et al. (2023). LLMs usually can be classified into three groups based on model architectures: 1) encoder-only LLMs Devlin et al. (2019); Liu et al. (2019); Zhong et al. (2022), usually used for NLU tasks; 2) decoder-only LLMs Radford et al. (2019); Brown et al. (2020a), more suitable for NLG tasks; and 3) encoder-decoder LLMs Raffel et al. (2020); Lewis et al. (2020); Zan et al. (2022b); Peng et al. (2023), which can achieve better performance on conditional text generation tasks.

Traditionally, these PLMs can achieve remarkable performance in various natural language processing (NLP) tasks through fine-tuning on specific tasks. But with the scaling up and the development of LLMs Brown et al. (2020a); Ouyang et al. (2022b), decoder-only LLMs exhibit remarkable zero-shot and few-shot abilities, denoted emergent abilities Wei et al. (2022b), and achieve comparable results with other LLMs in NLU and conditional NLG tasks. Especially the emergency of ChatGPT, developed by OpenAI, takes LLMs a big step forward in both academia and industry. ChatGPT possesses diverse abilities of NLP and can generate human-like responses by instruction-tuning Wei et al. (2022a) and Reinforcement Learning from Human Feedback (RLHF) technique Ouyang et al. (2022b).

Conclusion

The ability of ChatGPT has been widely studied in various domains Qin et al. (2023); Zhong et al. (2023), but its ability on machine translation tasks has not been fully investigated. Jiao et al. (2023) and Hendy et al. (2023) first provided an evaluation on the performance of ChatGPT for machine translation, they found that ChatGPT can perform competitively with commercial translation products on high-resource European languages but lags behind significantly on low resource or distant languages. However, they usually adopt simple prompts and basic settings which cannot fully exploit the capabilities of ChatGPT, we first proposed that ChatGPT can achieve comparable results with proper settings and investigate how to make the most of ChatGPT for machine translation.

Subsequent work follows our work to further explore the performance of ChatGPT, Gao et al. (2023) and Lu et al. (2023a) introduce new information (e.g., POS or multilingual dictionaries), He et al. (2023) proposed a CoT-like framework to generation human-like translation.

In this paper, we investigate how to further mine ChatGPT’s translation ability from three perspectives, namely temperature, task, and domain information, and correspondingly propose an optimal temperature setting and two simple but effective prompts. We empirically demonstrated that there is a high correlation between temperature and ChatGPT’s performance, and a lower temperature usually can achieve better performance. Experimental results across various language pairs and domains proved the effectiveness of our proposed prompts. We further explore the effectiveness of advanced in-context learning strategies for ChatGPT, we find that the few-shot in-context learning method can consistently improve ChatGPT’s performance, while conventional Chain-of-Thought (CoT) prompting will degrade its performance because of its word-by-word translation behavior.

In future work, besides the aforementioned explorations (EBMT-inspired prompts designing, statistical MT-inspired chain-of-thought designing), we would like to investigate how to further elicit the ability of ChatGPT by designing more effective prompts (e.g., design human-like CoT to navigate the LLMs, and better demonstration selection algorithms in few-shot ICL) and investigate the ability of ChatGPT for more MT settings (e.g., document translation).

Our work has several potential limitations. First, we only propose some simple prompts that have not been carefully designed to investigate the capabilities of ChatGPT, which may not sufficiently elicit the power of ChatGPT. Second, we have not fully studied the performance of ChatGPT in few-shot scenarios, especially the effect of Chain-Of-Thought in machine translation. In future work, we would like to design different types of prompts to further improve ChatGPT’s performance in machine translation and conduct more in-depth analyses and discussions.

We take ethical considerations very seriously and strictly adhere to the EMNLP Ethics Policy. This paper focuses on exploring the translation ability of ChatGPT on open-sourced machine translation datasets, not involving any ethics problem. Both the compared models and evaluation datasets used in this paper are publicly available and have been widely adopted by researchers. Therefore, we believe that this research will not pose ethical issues.