Adapting Large Language Models for Document-Level Machine Translation

Minghao Wu, Thuy-Trang Vu, Lizhen Qu, George Foster, Gholamreza Haffari

Introduction

Large language models (LLMs) demonstrate impressive proficiency in a wide range of applications (Ouyang et al., 2022; Wei et al., 2022a; Sanh et al., 2022; Chung et al., 2022; OpenAI, 2023; Anil et al., 2023; Touvron et al., 2023a, b; Jiang et al., 2023). However, in the realm of translation tasks, only few very large models, such as gpt-3.5-turbo and gpt-4-turbo, can match or surpass the performance of state-of-the-art supervised encoder-decoder models like NLLB (Costa-jussà et al., 2022), while they still under-perform in translating low-resource languages (Robinson et al., 2023; Jiao et al., 2023; Hendy et al., 2023). Consequently, a number of recent works attempt to bridge the gap between LLMs and supervised encoder-decoder models in translation tasks (Zhu et al., 2023; Yang et al., 2023; Zhang et al., 2023; Moslem et al., 2023; Xu et al., 2023; Kudugunta et al., 2023). Recently, research suggests that smaller, specialized models can outperform larger, general-purpose models in specific tasks (Gunasekar et al., 2023; Luo et al., 2023; Azerbayev et al., 2023). Therefore, we explore adapting LLMs for document-level machine translation (DocMT) in this study.

In this study, we analyze moderately-sized LLMs (with $7B$ parameters) across 18 translation tasks involving nine language pairs. We fine-tune three LLMs using Parameter-Efficient Fine-Tuning (PEFT) and Fully Fine-Tuning (FFT). Comparisons with state-of-the-art translation models, using metrics like $s$ BLEU, $d$ BLEU, and COMET, confirm the superior translation capabilities of LLMs after fine-tuning. However, we identify a significant issue of off-target translations, observed even after exclusive fine-tuning on bilingual corpora. Additionally, we present an in-depth analysis of our LLM-based DocNMT models from various perspectives: translation error distribution, discourse phenomena, training strategy, the scaling law of parallel documents, additional evaluations on WMT2023 test sets, and zero-shot cross-lingual transfer, aiming to enhance understanding and efficacy of LLMs in DocMT tasks.

We present extensive empirical evidence that highlights both the superior translation capabilities and limitations of the LLM-based DocMT models in this study, making several significant discoveries. Here are the main takeaways:

Selective Excellence in Translation Tasks: Our findings show that our moderately-sized LLMs outperform gpt-4-turbo in certain translation tasks, but struggle in others due to the off-target translation issue. Despite this, our DocMT models exhibit better context awareness and fewer errors, while maintaining comparable performance.

Fine-Tuning Strategies: Our research indicates that the PEFT approach outperforms the FFT approach overall. However, the FFT approach shows greater data efficiency, needing only about $1\%$ of the total dataset to reach the performance level of models trained on the entire dataset. In contrast, the PEFT approach requires $10\%$ of the total dataset for comparable results.

Evaluation on Recent Test Sets: We evaluate our models on recent test sets between English and German from WMT2023 (Koehn et al., 2023). Our empirical results show that, when the data leakage risks are mitigated, the LLM-based DocMT models generalize better on out-of-domain text, compared to the conventional DocMT models.

Advantage of Base LLMs for Task-Specific Supervised Fine-Tuning: Our study shows that base LLMs, when used as the backbone for task-specific supervised fine-tuning, perform better than instruction-tuned LLMs. They demonstrate more effective zero-shot cross-lingual transfer.

Related Work

In recent years, numerous approaches have been proposed for document-level machine translation (DocMT). There exist other approaches to DocMT, including document embedding (Macé and Servan, 2019; Huo et al., 2020), multiple encoders (Wang et al., 2017; Bawden et al., 2018; Voita et al., 2018; Zhang et al., 2018), attention variations (Miculicich et al., 2018; Zhang et al., 2020; Maruf et al., 2019; Wong et al., 2020; Wu et al., 2023), and translation caches (Maruf and Haffari, 2018; Tu et al., 2018; Feng et al., 2022). Furthermore, Maruf et al. (2022) present a comprehensive survey of DocMT.

Large Language Models

Large language models (LLMs) have demonstrated remarkable proficiency across a wide range of Natural Language Processing (NLP) tasks (Brown et al., 2020; Chowdhery et al., 2022; Scao et al., 2022; Anil et al., 2023; Touvron et al., 2023a, b). Furthermore, recent research has shown that supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) can significantly enhance their performance when following general language instructions Weller et al. (2020); Mishra et al. (2022); Wang et al. (2022); Shen et al. (2023); Li et al. (2023); Wu and Aji (2023). More recently, there is a growing body of work exploring the translation capabilities of LLMs (Lu et al., 2023; Zhang et al., 2023; Xu et al., 2023; Robinson et al., 2023). However, it is important to note that these efforts have primarily focused on sentence-level machine translation (SenMT) and have not delved into document-level machine translation (DocMT). A noteworthy study in DocMT is conducted by Wang et al. (2023b), where they investigate the document-level translation capabilities of gpt-3.5-turbo, making it the most closely related work to our work.

Ours

In contrast to the work of Wang et al. (2023b), who primarily investigate the use of gpt-3.5-turbo for DocMT through prompting techniques, our study concentrates on analyzing the effectiveness of parameter-efficient fine-tuning (PEFT) and full fine-tuning (FFT) methods on moderately-sized LLMs in the context of DocMT.

Experimental Setup

In this study, we aim to adapt multilingual pre-trained large language models (LLMs) into a bilingual document-level machine translation (DocMT) model. In this section, we describe our experimental setup of this work, including training strategy (Section 3.1), datasets (Section 3.2), models (Section 3.3), and evaluation (Section 3.4).

DocMT approaches typically begin by pre-training the translation model on sentence-level parallel corpora, subsequently refining it through fine-tuning on document-level parallel corpora (Voita et al., 2019; Maruf et al., 2019; Ma et al., 2020; Sun et al., 2022; Wu et al., 2023). More recently, Xu et al. (2023) propose a two-stage training strategy, which initially involves fine-tuning a LLM on monolingual text, followed by a second fine-tuning phase on parallel text. Given that most state-of-the-art open-sourced LLMs are trained on English-centric corpora, our approach begins with the fine-tuning of a LLM on monolingual documents, followed by fine-tuning on parallel documents. Following Xu et al. (2023), we omit the step of fine-tuning on sentence-level parallel datasets.

Existing LLMs are typically pre-trained on English-centric corpora. Recent research highlights that these LLMs often exhibit sub-optimal performance on multilingual benchmarks (Li et al., 2023; Chen et al., 2023; Scao et al., 2022). To address this limitation, our initial step involves fine-tuning all the parameters of LLMs using monolingual data from the target languages.

Fine-tuning on Parallel Documents

We fine-tune the model on document-level parallel corpora in this stage. Following Wang et al. (2023a), we condition each sentence pair on its context, consisting of the three preceding consecutive sentence pairs. As demonstrated by Wang et al. (2023b), the prompting strategy plays a significant role in translating documents using LLMs. However, they only investigate how the prompting strategies affect gpt-3.5-turbo and gpt-4-turbo at the inference stage. In our study, we first delve into how these prompting strategies impact the fine-tuning process, as shown in Figure 1, and we present our findings in Section 4.

2 Datasets

Following Zhang et al. (2022), we conduct experiments on IWSLT2017 translation tasks (Cettolo et al., 2017). IWSLT2017 comprises translation datasets sourced from TED talks, encompassing translations between English and nine other languages, including Arabic, German, French, Italian, Japanese, Korean, Dutch, Romanian, and Chinese. There are approximately $1.9K$ sentence-aligned parallel documents with about $240K$ sentences for each language pair. The dataset statistics can be found in Appendix A.

Monolingual Documents

We gather monolingual documents for all the target languages in our translation tasks, totaling ten languages. To manage computational limitations and address concerns about catastrophic forgetting that might result from excessive continued training, we leverage the data pruning technique suggested by Marion et al. (2023) to select $100M$ tokens for each language, including English, from the CulturaX corpus (Nguyen et al., 2023), totaling $1B$ tokens.

3 Models

The baseline models in this study can be classified into three categories, including state-of-the-art LLMs and SenMT models, and our re-implemented DocMT models:

State-of-the-art SenMT models: Our selection includes models such as NLLB, which are available with three different sets of parameters: 600M, 1.3B, and 3.3B.Model signatures: facebook/nllb-200-distilled-600M, facebook/nllb-200-1.3B, and facebook/nllb-200-3.3B. We also incorporate the widely-used commercial translation system, Google Translate.

State-of-the-art LLMs: For our baseline LLMs in the context of DocMT, we utilize gpt-3.5-turbo and gpt-4-turbo.Model signatures: gpt-3.5-turbo-1106 and gpt-4-1106-preview. We use the Prompt 4 as detailed in 1(d) during the translation process.

Our re-implemented DocMT models: We conduct full fine-tuning on the concatenation-based DocMT model (Tiedemann and Scherrer, 2017), as well as several recent DocMT baselines (Sun et al., 2022; Wu et al., 2023, 2024), initialized with mT5 (Xue et al., 2021). These models are available with parameters of 300M, 580M, and 1.2B, representing the strong DocMT baseline.

Ours

In this work, we utilize Llama2-7B, Bloom-7B, and Vicuna-7B, as our backbones.Llama2 signature: meta-llama/Llama-2-7b-hf, Bloom signature: bigscience/bloom-7b1, and Vicuna signature: lmsys/vicuna-7b-v1.5. Note that Vicuna-v1.5 models are fine-tuned from Llama2. The Llama2 models are predominantly pre-trained on English text, while the Bloom models are pre-trained on multilingual text. The use of Vicuna models allows us to compare the differences between base models and instruction-tuned models (Llama2 vs. Vicuna). We denote those fully fine-tuned models as L-7B-fft, B-7B-fft, and V-7B-fft. We denote those models fine-tuned with LoRA (Hu et al., 2022) as L-7B-LoRA, B-7B-LoRA, and V-7B-LoRA. The optimization details can be found in Appendix B.

4 Evaluation

We evaluate the translation quality using sentence-level BLEU (Papineni et al., 2002) and document-level BLEU (Liu et al., 2020) using SacreBLEU (Post, 2018), denoted as $s$ BLEU and $d$ BLEU.BLEU signature: nrefs:1|case:mixed|eff:no| tok:[13a|ja-mecab-0.996-IPA|ko-mecab-0.996/ko -0.9.2-KO|zh]|smooth:exp|version:2.3.1. Furthermore, as conventional MT metrics like BLEU demonstrate poor correlation to human judgments (Freitag et al., 2022), we also evaluate the translation quality with the state-of-the-art neural evaluation metric COMET Rei et al. (2020).COMET signature: Unbabel/wmt22-comet-da. Moreover, we use the average sentence-level BLEU $\mu_{s\textsc{BLEU}}$ , the average document-level BLEU $\mu_{d\textsc{BLEU}}$ , and the average COMET $\mu_{\textsc{COMET}}$ for the overall performance.

Inference

We use beam search with the beam size of 5 during translation. As shown in 1(d), previous translations serve as the context for the current translation, so the test examples are translated in their original order, beginning with the first sentence free from context.

A Preliminary Study on Prompts

The prompt plays a crucial role in LLM research. Recent studies show that an optimal prompt can greatly enhance model performance and reveal unexpected model capabilities (Kojima et al., 2022; Wei et al., 2022b). Hence, our initial focus is on investigating the prompt’s impact during fine-tuning.

Displayed in Figure 1, our preliminary study features four prompt types. These designs aim to tackle two research questions: How does context structure impact translation quality? (Prompt 1 vs. Prompt 2) and How do natural language instructions influence translation quality? (Prompt 1 vs. Prompt 3). We also investigate the combined effect of these aspects in Prompt 4.

Results

Our investigation analyzes prompt variations using three PEFT models (L-7B-LoRA, B-7B-LoRA, and V-7B-LoRA) on four English-centric translation tasks involving German and Chinese. Overall results are presented in Table 1. Comparing Prompt 1 (1(a)) and Prompt 2 (1(b)), we find that models fine-tuned with Prompt 2 generally outperform those with Prompt 1, indicating Prompt 2’s effectiveness in enhancing LLM performance. Regarding our second research question (1(a) vs. 1(c)), we observe varied performance. L-7B-LoRA and B-7B-LoRA perform better with Prompt 3, while V-7B-LoRA performs better with Prompt 1. These results highlight varying impacts of prompt variations across models and suggest natural language instructions are less effective when using instruction-tuned language models as model backbones. Finally, LLMs with Prompt 4 (1(d)) achieve the best overall performance, suggesting a positive compound effect of context structure and instructions.

Conclusion

As expected, the prompt plays a significant role in LLM performance. A well-structured prompt, which combines an appropriate context structure and natural language instructions, can significantly boost model performance. In this work, we use Prompt 4 (1(d)) in our other experiments, unless otherwise mentioned.

Main Results

In our results presented in Table 2, we observe that gpt-4-turbo and gpt-3.5-turbo significantly outshine all other models in performance. Notably, the NLLB variants, which are trained on vast amount of parallel sentence pairs, also demonstrate superior performance among specialized machine translation (MT) models. In the context of DocMT, conventional DocMT models still outperform our LLM-based DocMT models for translations from English to other languages when evaluated using standard MT metrics. Conversely, for translations from other languages to English, our LLM-based DocMT models perform on par or better than conventional DocMT models in $\mu_{s\textsc{BLEU}}$ and $\mu_{d\textsc{BLEU}}$ metrics, while those conventional DocMT models maintain superior performance in $\mu_{\textsc{COMET}}$ .

LLM-based DocMT Models

As indicated in Table 2, our models incorporating LoRA typically outperform fully fine-tuned (FFT) LLMs. However, an exception is observed where V-7B-fft outperforms V-7B-LoRA in translating from other languages to English. This discrepancy is likely attributable to overfitting. In scenarios of extensive fine-tuning with a large corpus of parallel documents, the full fine-tuning of all parameters often leads to rapid overfitting on the training dataset. In contrast, the parameter-efficient fine-tuning approach, exemplified by LoRA, updates only a select number of parameters, effectively preventing the models from overfitting the training set. Furthermore, we observe that the L-7B and V-7B models exhibit comparable performance, suggesting that initializing with instruction-tuned models does not always enhance task-specific performance.

Breakdown Performance

We present the results for the translation tasks from other languages to English in Figure 2. Regarding the readability of the figures, we present only the results provided by our models using LoRA. Our LLM-based DocMT models exhibit superior performance, sometimes even surpassing gpt-4-turbo in certain translation tasks. However, they fail completely in others. A manual review of translation tasks where our LLM-based DocMT models fail reveals that the primary cause of failure is off-target translation. We provide an in-depth analysis of the off-target translation problem in Section 6. A complete breakdown of the results is in Appendix E.

Analyses

In this section, we investigate the off-target problem and leverage gpt-4-turbo to analyze the translation errors. We also explore discourse phenomena, the training strategy, and the scaling law of parallel documents. Furthermore, we conduct additional evaluations on recent test sets from WMT2023 and examine crosslingual transfer.

In Figure 2, our LLM-based DocMT models excel in some translation tasks but struggle in others due to off-target translation issues. We investigate this problem using the fasttext library (Bojanowski et al., 2017) to identify translation languages and quantify off-target rates, which represent the proportion of translations that are off-target. Results are presented in Table 3, with off-target rates reaching up to $98.3\%$ in failing tasks. Notably, only B-7B-LoRA consistently maintains low off-target rates, likely due to Bloom-7B’s multilingual pre-training. These findings shed light on the main reason of translation failures in LLM-based DocMT models, offering insights for future research. Detailed off-target rates are provided in Appendix F.

Translation Errors

To comprehensively understand the translation capabilities of our LLM-based DocMT models, we select specific error types from the Multidimensional Quality Metrics (MQM) framework (Burchardt, 2013). Kocmi and Federmann (2023) demonstrate gpt-4 is capable of identifying error spans and achieving state-of-the-art MT evaluation accuracy, so we leverage gpt-4-turbo to analyze the translation errors of the text translated by these models. We focus on four models due to resource constraints: L-7B-LoRA, L-7B-fft, Doc2Doc-mT5-1.2B, and GoogleTrans, assessing translations from English to German, Romanian, and Chinese. The error identification prompt is detailed in Appendix D, and we present the frequency of error types in Figure 3. Notably, most errors are limited to individual sentences. Despite similar scores in metrics such as $s$ BLEU, $d$ BLEU, and COMET among the models, our LLM-based DocMT models (L-7B-LoRA and L-7B-fft) exhibit fewer context-independent and context-dependent errors. This highlights a limitation in current evaluation metrics, suggesting they may not sufficiently assess document-level translations. It also indicates that fine-tuning LLMs for machine translation holds promise for enhancing DocMT performance.

Discourse Phenomena

To evaluate our LLM-based DocMT model’s ability to leverage contextual information, we assessed it using the English-German contrastive test set by Müller et al. (2018). This evaluation tests the model’s accuracy in selecting the correct German pronoun (“er”, “es”, and “sie”) from multiple translation options. Results, shown in Table 4, reveal that models initialized with Llama2-7B and Vicuna-7B outperform Doc2Doc-mT5-1.2B, while Bloom-7B-initialized models perform worse, indicating that contextual understanding is mostly acquired during pre-training, as detailed by Scao et al. (2022) due to the lack of German text in Bloom pre-training.

Training Strategy

In this study, we follow the two-stage approach of Xu et al. (2023). Unlike traditional DocMT methods, which typically start with parallel sentence training, we explore the effectiveness of this conventional training strategy on LLM-based DocMT models. In this section, we introduce a three-stage training strategy, involving: (1) monolingual document fine-tuning, (2) parallel sentence fine-tuning, and (3) parallel document fine-tuning, for all parameters of the Llama2-7B. The results in Table 5 indicate that the three-stage training strategy is unnecessary for both high-performing languages (Dutch and Romanian) and low-performing languages (Arabic and Chinese) with LLM-based DocMT models.

Scaling Law of Parallel Documents

In this section, we explore the scaling law for fine-tuning parallel documents. We focus on English to German, Romanian, and Chinese translations due to our models’ proficiency. Results for English-German translation are presented in Figure 4, and for English-Romanian and English-Chinese in Appendix G. While LLMs typically excel with minimal training data, different fine-tuning strategies show distinct scaling behaviors. Our LoRA models match full training set performance with just $10\%$ of the data (around $20K$ examples), while fully fine-tuned models achieve near-equivalent performance with only about $1\%$ of the data (approximately $2K$ examples). These insights are crucial for low-resource languages, as recent LLMs are predominantly pre-trained on English text.

Evaluation on Recent Test Sets

Given their pre-training on extensive text corpora, LLMs may be susceptible to data leakage risks. We evaluate our models using recent test sets from WMT2023 (Koehn et al., 2023). These tests, conducted between English and German, not only evaluate the out-of-domain generalization of our models but also help mitigate the risks associated with data leakage. We use spaCy to segment documents and and discard any parallel documents where the source and target sides have a differing number of sentences. Our findings, presented in Table 7, reveal that while Doc2Doc-mT5 models outperform LLM-based models in Table 2, LLM-based models excel in translating out-of-domain text on the WMT2023 test sets. These findings highlight the ability of LLM-based DocMT to generalize well to out-of-domain translation tasks.

Zero-Shot Crosslingual Transfer

In this section, we explore the transferability of translation capabilities acquired from one language pair to others. We assess our English-German LLM-based DocMT models on English-to-other-language test sets, comparing their COMET scores to their base models in Table 6. Our results indicate that models with fine-tuned instructions (Llama2-7B and Bloom-7B) consistently exhibit positive transfer effects across all language pairs, while those with instruction-tuned backbones (Vicuna-7B) benefits only a few languages. These findings suggest that LLMs are more likely to activate their inherent translation abilities during fine-tuning rather than developing new ones.

Conclusion

This study investigates the adaptation of large language models (LLMs) for document-level machine translation (DocMT) through extensive experimentation with two fine-tuning methods, three LLM backbones, and 18 translation tasks across nine language pairs. Results demonstrate that task-specific supervised fine-tuning on parallel documents significantly boosts the performance of moderately-sized LLM-based models (with $7B$ parameters) in DocMT, surpassing gpt-4-turbo in some cases. Our analysis offers insights into LLM-based DocMT models, providing a foundation for future advancements in the field of DocMT.

Limitations

Our research is confined to language models of a moderate size, specifically those with $7B$ parameters. This limitation is due to the constraints of our available resources. Consequently, it is crucial to acknowledge that the outcomes of our study might vary if conducted with larger models.

Instability in Training

The process of supervised fine-tuning for LLMs shows instability in our observations. As detailed in Figure 4, there are noticeable inconsistencies in performance. These variations are too significant to attribute solely to the randomness inherent in training. In some cases, the fine-tuning of LLMs fails to reach convergence. Unfortunately, our limited resources restrict us from investigating these failures in depth or devising potential remedies.

Influence of Prompting Techniques

Section 4 of our study highlights the significant role of prompting methods in fine-tuning. We experiment with four different prompting techniques. It is important to note that the prompt we recommend may not be the most effective, potentially leading to suboptimal performance of our models.

We acknowledge these limitations and leave them to the future work.

References

Appendix A Statistics of Parallel Documents

We present the dataset statistics of parallel documents in Table 8.

Appendix B Optimization and Hyperparameters

We fine-tune all the parameters of large language models (LLMs) using a learning rate of $5\times 10^{-5}$ and a batch size of $256$ . During the training process, we apply the linear learning rate schedule, which includes a warm-up phase comprising $10\%$ of the total training steps.

Fine-tuning on Parallel Documents

When fine-tuning L-7B-LoRA and V-7B-LoRA on parallel documents, we employ a learning rate of $5\times 10^{-5}$ and utilize a batch size of $64$ . Additionally, we apply a linear learning rate schedule, with a warm-up phase comprising $10\%$ of the total training steps. The LoRA rank is set to $16$ , impacting only $0.1\%$ of the parameters (about 8M parameters). We maintain the same hyperparameters for fine-tuning Doc2Doc-mT5 models, with the exception of using a learning rate of $5\times 10^{-4}$ . In this phase, L-7B-LoRA and V-7B-LoRA are fine-tuned for a maximum of $3$ epochs, and Doc2Doc-mT5 models are fine-tuned for a maximum of $10$ epochs. Early stopping is applied on the validation loss.

Appendix C Prompt Types

We present concrete examples of prompt variations in Figure 5.

Appendix D GPT-4 Prompts

We present the prompts used for error type analysis in Figure 6.

Appendix E Breakdown Results

We provide detailed breakdowns of the translation tasks from English to other languages, evaluated using $s$ BLEU, $d$ BLEU, and COMET. These are presented in Table 9, Table 10, and Table 11, respectively. Additionally, we present similar breakdowns for translations from other languages to English, assessed using the same metrics. These results can be found in Table 12, Table 13, and Table 14.

Appendix F Off-Target Translation

We present the complete results on the off-target translation problem in Table 15 and Table 16.

Appendix G Scaling Law of Parallel Documents from English to Romanian and Chinese

In Section 6, we find that our LLM-based DocMT models are highly efficient in terms of the amount of training data. To confirm our findings in Section 6, we conduct additional experiments on the translation tasks from English to Romanian and Chinese. As shown in Figure 7, we can confirm the superiority of LLM-based DocMT models with regard to data efficiency.