Large language models effectively leverage document-level context for literary translation, but critical errors persist

Marzena Karpinska, Mohit Iyyer

Introduction

Separate text from context and all that remains is a con.

Large language models (LLMs) such as ChatGPT OpenAI (2022) demonstrate remarkable performance as stand-alone translation systems, rivaling and sometimes surpassing commercial models on sentence-level benchmarks Vilar et al. (2022); Hendy et al. (2023); Jiao et al. (2023). Furthermore, LLMs are increasingly being deployed for document-level translation Book Maker (2023); Pawlak (2023), a scenario for which there are currently no reliable automatic evaluation methods. In this paper, we hire human translators to conduct a rigorous fine-grained evaluation of Gpt-3.5’s ability to translate paragraph-level texts from literary works across 18 different language pairs. Our results (Figure 4) demonstrate that Gpt-3.5We completed our annotations on translations from the text-davinci-003 checkpoint obtained prior to the API release of ChatGPT and GPT-4. Nevertheless, we include preliminary analysis of GPT-4’s translations in §7. effectively leverages discourse-level context to produce higher-quality translations than when translating sentences in isolation.

Translating works of literature poses unique challenges due to the intricate nature of creative work and the importance of capturing the author’s voice and contextual nuances. Translators thus apply a wide range of translation techniques Chesterman (1997); Molina and Hurtado Albir (2004), from simple shifts in grammatical categories to more complex stylistic or content-based rearrangements that often cross sentence boundaries. Translators may also merge or split sentences, or even entire paragraphs, which renders the traditional sentence-level pipeline insufficient for capturing the full scope of the original text Toral and Way (2015); Taivalkoski-Shilov (2019b).At least 55% of the reference target paragraphs used in our study split or merge sentences from the source text (measured with an automatic sentence tokenizer). Taken together, these properties make literary texts a good testbed for document-level machine translation (Thai et al., 2022); in our work, we focus on the paragraphWe broadly define a paragraph as a distinct passage within the novel, focusing on a single theme. as a minimal discourse-level unit.

Why human evaluation?

The absence of rigorous document-level evaluations of LLM translators is striking but also somewhat understandable given the unreliability of automatic metrics (Thai et al., 2022) and the difficulty of properly conducting human evaluations Castilho (2021). Furthermore, evaluations of LLM translators are especially difficult due to data contamination Aiyappa et al. (2023), as it is unclear whether the models are pretrained on existing benchmarks (e.g., from WMT). We fill this gap by first collecting paragraphs from recently-published literary translations. Then, we provide human translators with two candidate machine translations of a given source paragraph and ask them to (1) mark error spans and categorize them based on a predefined schema inspired by MQM (Lommel et al., 2014b; Freitag et al., 2021), (2) make preference judgments of which of the two translations is of higher quality, and (3) provide free-form justifications of their preference judgments. In total, we collect such annotations on 720 pairs of translated paragraphs across 18 different language pairs (using three diverse target languages of English, Japanese, and Polish), which we then leverage for a fine-grained analysis of the behavior of different LLM translation methods.

How do we use LLMs to translate paragraphs?

We use three strategies to generate the paragraph-level translations for our evaluations that all rely on few-shot prompting with Gpt-3.5: (1) translating each sentence in the paragraph in isolation of the others (Sent); (2) translating each sentence in the paragraph when provided with the rest of the paragraph as context (Para_Sent); and (3) translating the entire paragraph in at once (Para), not sentence-by-sentence. Finally, we also compare these methods to Google Translate (GTr).

LLMs produce better translations when provided with paragraph-level context:

Our evaluations reveal that using Gpt-3.5 to translate complete paragraphs (Para) yields translations of significantly higher quality than both the sentence-by-sentence Gpt-3.5 methods as well as Google Translate. Our detailed analysis of annotated translation errors and free-form comments show that paragraph-level translations exhibit increased coherence, better preservation of literary style, and improved handling of context-dependent expressions (see Figure 2). That said, we also observe that Para still makes numerous critical mistranslations and other errors across different language pairs, which shows that LLM-based translators still have significant room for improvement, particularly when applied to translating contextually-rich literary texts.

Background

Before describing our dataset and evaluation, we first contextualize our work within both the existing body of document-levelNote that the term “document-level” has been used in MT research to denote both multi-sentence passages as well as complete documents. machine translation as well as recent papers on translation via large language models.

Before the rise of neural machine translation, several attempts were made to incorporate discourse-level phenomena into statistical machine translation systems Hardmeier (2012); Carpuat and Simard (2012); Hardmeier et al. (2013); Ding et al. (2014). Neural MT systems condition sentence-by-sentence translation on discourse-level context via concatenation models Tiedemann and Scherrer (2017); Jean et al. (2017); Agrawal et al. (2018); Junczys-Dowmunt (2019); Lopes et al. (2020), hierarchical models Miculicich et al. (2018); Tan et al. (2019); Chen et al. (2020); Zheng et al. (2020), multi-pass models Mansimov et al. (2021), dynamic context models Kang et al. (2020), multi-source models Zhang et al. (2018); Feng et al. (2022), and transfer learning approaches Zhang et al. (2022). Despite sometimes obtaining clear gains from discourse-level context Voita et al. (2019), the machine translation community has not made much progress on this problem, particularly for non-English language pairs, due largely to the scarcity of parallel document-level corpora Zhang et al. (2022). This problem has been partially addressed by introducing a pivot language Cohn and Lapata (2007); Utiyama and Isahara (2007), but this approach can also lead to substantial information loss.

2 Translation with large language models

Many recent studies explore the potential that LLMs hold for translation, an especially attractive prospect given that training or fine-tuning on large parallel corpora is not necessary.That said, parallel data is almost certainly included, at least for high-resource languages, in LLM pretraining data. These works span paragraph-level post-editing with LLMs Thai et al. (2022), translating sentence-level inputs Vilar et al. (2022); Jiao et al. (2023), analyzing hallucinations in LLM-generated translations Guerreiro et al. (2023), and employing LLMs to evaluate machine translation Kocmi and Federmann (2023). Studies on prompt engineering for translation conclude that simple sentence-level English prompt templates are effective for paragraph translations Zhang et al. (2023). Other findings reveal that automatically-generated dictionaries assist translation Ghazvininejad et al. (2023), and that example quality outweighs lexico-semantic proximity to input Vilar et al. (2022). To the best of our knowledge, the only work other than ours that evaluates LLMs for paragraph-level translation is that of Hendy et al. (2023), which focuses on automatic evaluation of context-aware sentence-by-sentence translation. Unlike Hendy et al. (2023), we perform a fine-grained human evaluation of paragraph-level translation, which sheds more light on the concrete strengths and weaknesses of LLM translators in this setting.

Data & methods

Our work differs from existing research on translating with large language models in two key ways: we focus on translating literary text at the paragraph level. In this section, we describe and motivate the paragraph-level translation dataset used in our study, which covers eighteen unique language pairs (three target languages) and is sourced from recently-published novels. Then, we outline the different ways in which we leverage Gpt-3.5 to translate these paragraphs at both the sentence and paragraph levels.

Literary texts (e.g., novels or short stories) pose unique challenges for translators due to their complex nature. Translators must interpret and honor the author’s voice with no objective reality to measure against, which can result in several equally valid translations Sager (1998). For machine translation systems, these challenges exacerbate the need for discourse-level context Thai et al. (2022): an author’s intended meaning or style is often unclear from just a single sentence.

How good are machines at translating literary paragraphs? To answer this question, we extract 20 paragraphs (dialogues and narrative texts) each from 18 recently-published translations of novels, and we manually align these paragraphs with corresponding paragraphs in the source novelIn most cases, we purchase the source ebook and its corresponding translation before extracting aligned paragraphs, but for a few books, we utilized Amazon’s free preview functionality. (see Table 1). The target language of each translation is English, Polish, or Japanese (6 books for each), and we consider eight different source languages. Almost all of the translations were published after 2021 (see Table 2), which is important to avoid data contamination with the pretraining data of large language models. In sum, we obtain 360 aligned source-target paragraphs, which we use for all of the experiments described in the rest of the paper.

Paragraph length:

All paragraphs consist of at least two sentences, and the majority of them are between four to nine sentences long (mean=7.45, std=4.14).A paragraph with fewer sentences is not necessarily short: for example, in the German novel “An Inventory of Losses,” sentences can be as long as 70 to 80 words, with the longest reaching 117 words. As automatic sentence tokenizers are not always reliable for all of the languages considered in our study, we manually perform sentence tokenization to enable a direct comparison of sentence and paragraph-level translation systems. For more details about the dataset statistics, including token and sentence counts, see Table 8 and Table 9.

Target language selection:

We select English, Japanese, and Polish as the target languages of our study, as these languages differ considerably in many linguistic aspects. English is an analytic language that is widely spoken and extensively studied in the field of natural language processing, and it serves as the primary pretraining language of most large language models, including Gpt-3.5.As of 2020, the reported distribution of languages featured in the present study within the Gpt-3 training data was as follows: English – 92.647% (1st), French – 1.818% (2nd), German – 1.469% (3rd), Russian – 0.188% (9th), Polish – 0.155% (11th), Japanese – 0.111% (15th), Chinese – 0.099% (17th), Czech – 0.071% (18th) (see https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv). The current GPT-3.5 text-davinci-003 model is reported to incorporate data up to June 2021 and it is unclear what texts or languages were added to the original training data https://platform.openai.com/docs/models/gpt-3-5. In contrast, both Japanese and Polish are comparatively under-explored. Japanese is an agglutinative language that employs three distinct writing systems: Kanji, Hiragana, and Katakana. As a high-context language, the translation of Japanese texts necessitates a profound comprehension of context and cultural nuances, rendering it a compelling choice for testing the limits of LLMs’ translation capabilities. Polish, on the other hand, is a fusional language characterized by a rich morphological system. Its complex word forms, grammatical gender, conjugation, and declension make it an apt choice for testing the accuracy and robustness of LLMs.The first author is fluent in all three target languages.

Source language selection:

As source languages, we select English (es), Polish (pl), Russian (ru), Czech (cs), French (fr), German (de), Japanese (ja), and Chinese (zh). These languages belong to a diverse array of language families – Indo-European (Romance, Germanic, Slavic), Sino-Tibetan, and Japonic – each with distinctive morphological traits – fusional, agglutinative, and analytic. Moreover, they employ a variety of writing systems such as the Latin alphabet, the Cyrillic alphabet, Hanzi, and Kanji/Hiragana/Katakana (see Table 7 in Appendix A for details). Finally, we carefully select source-target language pairs to ensure that our study encompasses both linguistically similar and dissimilar languages. For example, we paired cs-pl, as these languages are characterized by only 10% lexical distancei.e., the percentage of non-cognates in the language pair. and have similar syntactic structures Jágrová and Avgustinova (2023). Conversely, we also include ja-pl, as the two languages have very little lexical overlap, vastly different grammars, and utilize distinct writing systems.

2 Translation with large language models

In this paper, we focus on translating the literary paragraphs in our dataset using large language models. More specifically, we use the Gpt-3.5 text-davinci-003 checkpoint, which has been further tuned to follow instructions based on human feedback (Ouyang et al., 2022). Hendy et al. (2023) demonstrate that GPT-3.5 produces translations of reasonable quality, though their focus was mostly at the sentence level. Since many LLMs including Gpt-3.5 are only accessible via black-box APIs, we adapt the model for translation via in-context learning (Brown et al., 2020).

We use few-shot prompting, in which a model is provided with a prompt consisting of five demonstrations. We manually curate the five demonstrations from literary texts for each of the 18 language pairs, resulting in 90 total demonstration examples. These demonstrations are sourced from novels that are not part of our translation dataset, resulting in potential differences in topic and style (see Table 10 in the Appendix A for details). We further ensure that each set of five demonstrations includes both dialogues and narrative texts.

Prompting for translation:

We consider the following three prompting strategies for Gpt-3.5 that allow us to compare the model’s abilities to translate with and without discourse-level context (see Table 3 for templates and Appendix B for the exact prompts):

Gpt-3.5 sentence-level translation without context (Sent): Each sentence of the paragraph is translated in isolation of the others. To maintain consistency, we provide the same five sentence-level examplesSentence-level demonstrations for Sent are sampled from the demonstrations for paragraph-level translation. in each prompt for the given source-target language pair.To ensure consistent quotation mark usage and enable a fair comparison with paragraph-level translations, quotation marks in sentence-level translations were manually adjusted.

Gpt-3.5 sentence-level translation with context (Para_Sent): Each sentence of the paragraph is translated in context. The model is provided with the entire source paragraph as input, where the sentence to be translated is wrapped in and tags, in addition to a partially-translated target paragraph. The demonstrations in the prompt also contain and tags wrapped around one sentence per demonstration. For each demonstration in the prompt, a sentence in a different position was chosen (e.g., from the beginning, middle, and end of the paragraph).

Gpt-3.5 paragraph-level translation (Para): The entire source paragraph is passed into the model, and the output target paragraph is generated conditioned on this input (i.e., without any sentence tokenization). Demonstrations in the prompt are also paragraphs The examples for Para and Para_Sent configurations are necessarily lengthier. Due to the Gpt-3.5 maximum context size, it is not always possible to include all five examples within the prompt. Consequently, around 10% of the data was translated using four or fewer examples. of translations from the respective source language into the target language in question.Initially, we experimented with GPT-3 by translating between two non-English languages using English as a pivot, as it is the primary language of the model. The model had access to the source text and its English translation. After manual evaluation and comparison to translations without a pivot language, we found no significant benefit in using English as the pivot. Consequently, we directly translated paragraphs into the target language. Refer to Appendix D. for details and results of this preliminary study.

Using Google Translate (GTr) as a baseline:

In order to compare commercial-grade translation systems to LLM translators, we also translate all paragraphs in our dataset using Google Translate.All paragraphs were translated in January 2023 using the GoogleTranslate API. We opt for an off-the-shelf commercial system instead of a state-of-the-art system from, for instance, WMT competitions for two primary reasons. First, our experiments focus on literary translations. Given that WMT systems are predominantly evaluated on the news domain, it is uncertain which system would perform best, and some language pairs may not even be supported. Second, our main research question revolves around LLMs’ ability to incorporate contextual information, rather than merely comparing their performance with state-of-the-art translation systems. We employ GTr as a reasonably robust baseline to assess the extent to which context can enhance MT quality, rather than asserting that LLMs outperform all traditional MT systems.

Evaluating document-level literary translation

How do we compare the translation quality of the systems described above? Automatic metrics such as BLEURT and COMET are untested on document-level inputs as well as literary texts, and as such we do not consider them reliable, although we do report them in §5.1.Automatic metrics developed specifically for document-level MT are also insufficient as they either work best with one-to-one sentence level alignments Vernikos et al. (2022); Hendy et al. (2023) or are available only for English Jiang et al. (2022). Human evaluation is equally problematic, as direct assessments of translation quality (e.g., ‘‘rate the quality of this translation from 0-100’’) suffer from calibration issues that are exacerbated with longer texts Karpinska et al. (2021). Thus, we opt for a human evaluation inspired by Multidimensional Quality Metrics (Lommel et al., 2014b, MQM), in which annotators mark and classify error spans within the translation. Specifically, for each of the 18 language pairs studied in this work, we hire translators to identify all span-level errors in two competing translations. For each evaluated pair, the annotators were also asked to choose the better translation and provide a free-form rationale. For each source paragraph, the translators make three binary judgments of which translation is higher quality: Sent vs Para, Para_Sent vs Para, and GTr vs Para.

As our task is complex and requires fluency in both the source and target language, we hire translators to provide the annotations. We recruit 13 translators, each of whom is a native speaker of English, Polish, or JapaneseThe annotators for Czech-Polish and Russian-English were both native speakers of the respective source languages and highly proficient in their respective target languages. They collaborated with native speakers of the target languages, who possessed a basic understanding of the source language, to complete their annotations. through the Upwork freelancing platform.https://www.upwork.com/ One translator, hired directly, was a bilingual speaker of English and Polish with advanced knowledge of German; as such, she performed the pl-en, de-en, and de-pl evaluations. Evaluation of ja-pl, pl-ja, and pl-en texts was done by the first author in a collaboration with native speakers of Polish/Japanese to avoid any potential bias. Each translator was paid $2 per evaluated pair of candidate translations, with an additional$ 5 bonus to cover the time spent familiarizing themselves with the instructions. We asked them to compare three pairs of system translations (Para vs. Sent, Para vs. Para_Sent, Para vs. GTr) for 10 paragraphs per language pair; as such, 180 total source paragraphs were used in our evaluations. Altogether, we paid approximately $12 per hour, with a total cost of$ 955.

Annotation task:

First, we tasked the hired translatorsThey were presented with guidelines in their native language. The annotation task was performed using the LabelStudio annotation tool Tkachenko et al. (2020-2022). See Figure 11 for the screenshot of the interface. with annotating a subset of MQM translation errors identified through a pilot analysis and annotation of the system’s outputs. Specifically, we ask them to highlight spans within the candidate translations that contain errors belonging to any of the following error categories:

mistranslation:We note that mistranslations in literary text are often not as grave as, for instance, in news articles. Human translators hold poetic license, which allows them to change some details to make the text more enjoyable for the reader. Is changing “bonito” into “tuna” incorrect? Or can it be perceived as a way to accommodate an English-speaking readership that is likely more familiar with the latter? accuracy errors that occur when the wrong target word or phrase is chosen to represent content from the source text. In addition to canonical mistranslations, we also include overly literal translation errors that occur when systems translate word-by-word into the target language even though the result is nonsensical.

grammar: grammatical errors, such as errors in conjugation or declension, wrong prepositions, etc.

untranslated: words or phrases that should have been translated into the target language but were either left in the source language or just transliterated into the target language.

inconsistency: use of different terms to refer to the same entity, or different words where the same word should be used for stylistic reasons (e.g., ‘‘Kasia’’ and ‘‘Kate,’’ ‘‘coat’’ and ‘‘jacket,’’ or ‘‘bad’’ and ‘‘awful’’ );

register: a clear violation in the use of formal and informal language within the same text, only annotated in JapaneseWe only annotate cases where the level of formality changes abruptly within the same paragraph. It is possible that a given character would be more likely to use formal language but an informal language is being employed. As long as this is consistent we do not consider it an error as this cannot be fully determined from the paragraph context.

format: incorrect usage of punctuation (e.g., "."instead of "。").

After the span-level annotation is complete, we then ask the translators to further identify if any of the candidate translations contains significant content additions or omissions in relation to the source text.Note that this task was simplified to a binary choice – either there were serious omissions/additions or not. We did not ask the annotators to further annotate them due to the time restrictions. Finally, they are asked to choose the better translation and provide a justification for their choice in two to five sentences. We instruct them to additionally mark whether their chosen translation is significantly superior, or if the decision was difficult because both translations are of roughly comparable quality (see Figure 3 and Appendix C for details).

Results

In this section, we compare our different literary translation methodologies using both automatic metrics and aggregate statistics from the human evaluations. Overall, we observe that the Para configuration outperforms competing methods across all evaluations and language pairs. These results demonstrate that Gpt-3.5 effectively leverages paragraph-level context to produce better translations than sentence-level methods, and also that the less efficient sentence-by-sentence translation with context is (Para_Sent) is unnecessary to achieve high translation quality.

We assess the translation from all four systems using the reference-based Comet Rei et al. (2022), Bleurt Sellam et al. (2020), and BertScore Zhang et al. (2020) metrics, as well as the reference-free Comet-QE Rei et al. (2021)We use the newest wmt22-comet-da checkpoints for Comet, Bleurt-20 checkpoints for Bleurt, wmt20-comet-qe-da checkpoints for Comet-QE, and the HuggingFace implementation which employs roberta-large for BertScore. metric. Although these metrics were not explicitly designed for evaluating paragraph-level outputs and their results should be interpreted with caution, they prove more reliable than string-based metrics like Bleu, especially for literary translations Thai et al. (2022); Karpinska et al. (2022); Gehrmann et al. (2022). Table 4 shows the effectiveness of the Para translation method: a statistical analysis with linear mixed-effects models Baayen et al. (2008) demonstrates that Para significantly outperforms Sent and GTr based on Comet, Bleurt, and Comet-QE scores (p<.001), and surpasses GTr based on the BertScore results (p<.001).We present more details of this analysis in Appendix E.

2 Human evaluation also favors Para

Figure 5 contains human preference results comparing Para to Sent, Para to Para_Sent, and Para to GTr, aggregated across all 18 language pairs studied in this paper (i.e., 180 votes per system comparison). Table 12 breaks down these results for each language pair, and we observe the same trends for the vast majority of pairs. Overall, the translators significantly favored Para translations over the alternatives (p<.001, binomial test). Table 5 contains specific information about grammar and mistranslation errors split across the three target languages (see Table 6 and Table 13 for details), which we use to discuss the three comparison settings in more detail below.

Para is preferred by translators over Sent at a rate of 71.1% (p<.001, 95% CI [0.639, 0.776]). Additionally, when translators preferred Para, they were usually confident in the decision (i.e., it was clearly better than Sent); even if we exclude all ‘‘unsure’’ votes, the preference for Para translations remains significant at 78.5% (p<.001, 95% CI [0.695, 0.859]). The only language pair in which Sent is favored over Para is de-ja (see Figure 4). This result may be attributed to the fact that the German novel An Inventory of Losses by Judith Schalansky, used for this language pair, contains the longest sentences in our dataset (on average 45 tokens per sentence), which means that the intra-sentence context is likely more informative than in other books (see Table 8). Overall, Sent translations contain 29.5% more mistranslations, 65.4% more grammatical errors, over 12 times more inconsistency errors, and three times more register errors (see Table 5).

Para is clearly better than GTr:

Para translations are overwhelmingly preferred over those from Google Translate (GTr), with an 82.8% preference rate (p<.001, 95% CI [0.765, 0.880]). Even after removing the ‘‘unsure’’ votes, the preference for Para remains significant at 88.0% (p<.001, 95% CI [0.812, 0.930]). In the fr-ja, pl-ja, zh-ja, and cs-pl language pairs, Para received all of the ten votes over GTr. Part of this advantage may be attributed to GTr sometimes using English as a pivot language, which can result in information loss. Our Czech translator observed that mistakes in GTr translations suggest the text was first translated into English.For the cs-pl language pair, we separately annotated mistranslations arising from pivot translation. These errors accounted for over 50% of all mistranslations in that language pair. The elimination of the need for parallel data may therefore be beneficial for translating between lower-resource languages where sufficient parallel data is often unavailable necessitating the pivot translation. Overall, GTr translations result in 57.7% more mistranslations, 37.3% more grammatical errors, over twice as many inconsistency errors, and ten times more register errors (see Table 5). Additionally, GTr produced 125 format errors while Para produced perfect outputs in this regard. Finally, it is worth noting that GTr left fewer words untranslated, though this is inflated by the fact that in one German text, the word ‘‘Bauer’’ (‘‘farmer’’) was untranslated 14 times in the Para translation.

Para is slightly preferred over Para_Sent:

Our evaluations show that Para is better than Para_Sent, but the gap is smaller than it is for the other two methods. Para is still preferred at a 66.1% rate (p<.001, 95% CI [0.587, 0.730]). After removing the ‘‘unsure’’ votes, Para remains the preferred option at a rate of 67.8% (p<.001, 95% CI [0.569, 0.774]). Notably, the error distribution of both translations is more similar than in previous cases. Both Para and Para_Sent result in a comparable number of mistranslations (480 vs 465), grammar errors (102 vs 110), and inconsistencies (2 vs 3) (see Table 5). While Para_Sent leaves around 22% more words untranslated, it appears to leverage the contexts and even occasionally selects better equivalents in the target language, as evidenced by translator comments. One major issue with Para_Sent is that it occasionally repeats sentences, whereas Para never does so.

What do translators think about Para?

To wrap up this section, we provide a qualitative analysis of the free-form comments written by translators to justify their preference judgments. Overall, the translators praise Para for its more skillful use of rhetoric devices, and surpas[ing] Sent as a literary rendition. They also mention that Para uses more of a poetic license but this makes it stylistically much smoother than Sent. Furthermore, translators state that Para clearly better reflects the content and style of the original when compared to GTr, and that it stays consistent within the paragraph. Inevitably, translations are not flawless, and there are instances where both compared systems fall short, as highlighted by one of the translators when assessing Para against Sent: Nightmare, a mistake upon mistake (…) Despite all these mistakes, I can understand the [Para] translation better but they are equally miserable.

Analyzing translation errors

The aggregate statistics from the previous section confirm that Para-level translation via Gpt-3.5 is the strongest literary translator of the methods that we study. Translations produced by Para are favored by both automatic metrics and human translators, and it makes fewer errors than competing methods. In this section, we dive deeper into specific types of errors that are made within each high-level category (e.g., grammar, mistranslation), and we present examples of errors associated with lack of context understanding made by Sent and GTr that are fixed by Para.

We begin by analyzing the types of grammatical errors that are made by the studied translation methods in all three target languages.There are some differences in the paragraph lengths between the three target languages that should be taken into consideration when analyzing raw numbers. However, the general tendencies remain intact.

Perhaps not surprisingly, translations into English contain notably fewer grammatical mistakes than Japanese or Polish (see Table 5). The most prominent mistakes in English are incorrect articles, which is most frequent in the outputs of Sent and GTr. This is to be expected, as the choice between the definite and indefinite article in English depends heavily on the context. Other mistakes include wrong or omitted prepositions, wrong parts of speech, and incorrect word order (see Table 6).

Japanese:

Translations into Japanese contain considerably more mistakes. Most notably, the systems struggle with the correct choice of particle: Para and Sent produce twice as many mistakes in this regard than Para_Sent and GTr (see Table 6). Other mistakes include incorrect tense, verb finite form within the sentence, or incorrect word order, the latter of which is much more frequent in GTr than any of the GPT-3.5 translations.

Polish:

GPT-3.5 exhibits more difficulty with Polish, as evidenced by 55 vs 42 errors for Para, 86 vs 50 for Sent, and 64 vs 37 for Para_Sent (see Table 5). We notice that GPT-3.5 translations frequently generate incorrect gender, case, or prepositions (see Table 6). We also observe instances in which GPT-3.5 alters the gender of a noun, such as producing grilla, a non-existent feminine form, in place of the masculine grill, while accurately modifying all adjectives and verbs to match the novel feminine noun.It is worth noting that grilla can also be also the genitive form of the masculine noun grill; however, the agreement of surrounding verbs and adjectives with the feminine noun suggests that the system likely treated the word as feminine. In contrast, the performance of GTr is comparable for Polish and Japanese in terms of grammar, with 59 and 63 errors respectively. Intriguingly, GTr seems to struggle with Polish aspect, leading to 12 errors, in contrast to 1 error in both Para and Para_Sent, and 5 errors in Sent within the same category (see Table 6).

In summary, although GPT-3.5 is primarily trained on English, it is competitive with GTr at Polish and Japanese grammar proficiency. In fact, Para generates the fewest grammatical errors of any system, with a total of 97 for both languages. This is in contrast to 136 errors made by Sent, 101 errors by Para_Sent, and 122 errors by GTr (see Table 5). That said, none of these systems delivers translations devoid of grammatical inaccuracies, even for English.

2 Context-related errors

We manually classify all annotated mistranslations (2,324 instances) into subcategories, several of which include instances where the absence of discourse-level context is clearly a contributing factor (see Table 13 for detailed classification). We also further analyze all translations in terms of content-related issues. Overall, we observe that context is indeed incorporated into the translations for both Para and Para_Sent outputs, which results in fewer context-dependent issues (see Figure 6).

Unsurprisingly, the absence of discourse context results in the incorrect translation of pronouns. Consider the following example, with English glosses of important words provided in [brackets]:

И ветер [wind] то начинал шуметь в голых деревьях, то замолкал, так же как и я прислушиваясь к течению ночи. Но он [he] не уходил, он [he] был здесь.

—Russian Source (from The Story of a Life)

The wind would start to rustle in the bare trees and then fall silent, just as I listened to the flow of the night. But he didn’t leave, he was here.

The wind would start to rustle in the bare trees, then die down, just like me, listening to the flow of the night. But it didn’t go away, it was still here.

In Russian, nouns have grammatical gender. ‘‘Wind’’ in the first sentence of the source text is a masculine noun, so it is later referred to as ‘‘he’’ in (6.2). Without access to the context, the Sent model incorrectly translates it as ‘‘he’’ into English (6.2), while the Para translation correctly modifies the pronoun to ‘‘it’’ (6.2).

When translating from Russian into Polish, another language with grammatical gender, we observe issues when the gender of Russian and Polish nouns differs. Consider the following example:

Романы, как известно, печатались на разной бумаге [paper]. И гореть она [she] может по-разному.

Romany, jak wiadomo, drukowano na różnym papierze [paper]. I może ona [she] tęsknić na różne sposoby.

Jak wiadomo, powieści drukowano na różnym papierze [paper]. I może on [he] palić się na różne sposoby.

Although both Russian and Polish nouns possess grammatical gender, ‘‘Paper’’ in (6.2) is feminine in Russian and referred to as ‘‘she,’’ whereas it is a masculine noun in Polish and should be referred to as ‘‘he,’’ as in (6.2). The absence of context in Sent leads to an incorrect translation in (6.2).

Cultural nuances:

Assigning appropriate pronouns without context becomes even more challenging when translating from languages like Japanese, in which speakers frequently refer to the listener (or themselves) in the third person rather than using second-person personal pronouns such as ‘‘you’’ in English. Consider the following example:

[lit. Ms./Mrs./Mr. Furukura works every day]

—Japanese Source (from Convenience Store Woman)

“No, no, (…). Furukura-san works hard every day without taking any shortcuts!”

“No, no, (…). You work every day, but you never slack off!”

From the context of this conversation, a Japanese listener can easily infer that ‘‘Furukura-san’’ or ‘‘Miss Furukura’’Note that the gender of neither character is apparent from the fragment alone. in the last source sentence (6.2) is used instead of the second-person ‘‘you’’ as per Japanese convention. Translating this sentence without context into English, a language in which third-person reference is not common,While third-person reference can be used in English, it is only used in rare circumstances e.g. when addressing children. results in a confusing translation (6.2) that implies that the speaker refers to some other ‘‘Furukura’’ rather than their listener. However, when translating the sentence in context, the model correctly changes ‘‘Furukura’’ into ‘‘you’’ (6.2), which makes it clear whom the speaker refers to in English.

Ellipsis:

Another example where context helps is the translation of elliptical constructions. Consider the following example:

„Ne, teď uděláš nádobí!“ [(you) will do the dishes!]

— Nie, teraz zrobisz zmywanie! [(you) will do the washing]

— Nie, teraz umyjesz naczynia [(You) will wash the dishes]!

Czech uses the same collocation as English, ‘‘do the dishes’’ (6.2), which is invalid in Polish. Hence, the ellipses in the last two sentences in (6.2) require broader context to be translated correctly. Para does it properly, translating both as ‘‘wash’’ (6.2), while Sent unsurprisingly fails to choose the correct collocation (6.2).

Subject ellipsis:

Similarly, context may be needed to attribute a state or an action to the correct character due to subject ellipsis. This is an obvious issue for languages like Japanese, which tend to omit the subject of the sentence and do not encode any relevant information in the verb form, but it can also arise in English. Consider the following example:

When we were done, the lipstick went back into some mother’s Fendi handbag. We watched her apply it, unaware.

Gdy skończyliśmy, szminka wróciła do jakiejś torebki Fendi należącej do matki. Patrzyliśmy, jak to robi, nieświadomi [unaware (we)] tego.

Kiedy skończyliśmy, szminka wróciła do torebki Fendi jakiejś matki. Patrzyliśmy, jak ją nakłada, nieświadoma [unaware (she)] naszych działań.

From the second sentence alone it is not clear who is ‘‘unaware’’ (6.2) – the mother or the ‘‘we’’ (referring to children) watching her. Only from the broader context can we confidently deduce that it is in fact the mother, not the children, who is ‘‘unaware.’’ Para (6.2) correctly attributes the state of being ‘‘unaware’’ to the mother, which is exhibited by its usage of the singular feminine form of the adjective. In contrast, Sent (6.2) mistranslates it using the plural masculine form of the adjective ‘‘unaware,’’ which implies that it refers to ‘‘we’’ rather than the ‘‘mother.’’

Consistency:

Context is sometimes critical for preserving the overall consistency of the text. The simplest cases include referring to the same entity – a place or a person – in the same way. More interesting cases pertain to style and can enhance the reader’s experience. Consider the following example:

Alles zu vergessen, ist gewiss schlimm [bad]. Noch schlimmer [worse] ist, nichts zu vergessen (…).

—German Source (from An Inventory of Losses)

すべてを忘れることは確かに悲惨な[tragic]ことです。さらに悪い[worse]のは、何も忘れないことです。

すべてを忘れることは確かに悪い[bad]ことです。もっと悪い[worse]ことは、何も忘れないことです。

The German source in (6.2) translates into English as ‘‘To forget everything is bad, certainly. Worse still is to forget nothing.’’Excerpt taken from the official English translation by Jakie Smith (2020). It is arguably important for the translation to repeat the same word which is an equivalent of the German ‘‘schlimm’’ (‘‘bad’’). Para does it well, translating both as 悪い ‘‘warui,’’ or ‘‘bad’’ (6.2), in the exact same way as the human Japanese translator. Sent, on the other hand, uses two different words, ‘‘tragic’’ and ‘‘bad’’ (6.2), which while technically correct omits the intentional repetition that is meant to introduce an unexpected conclusion.

Polysemy:

The absence of context makes it difficult to interpret words or expressions that have multiple meanings in the source language. Consider the following example:

Все прошло хорошо. Книга прочитана идеально – не быстро и не медленно, минимум дыма. Классика. Я был в форме [in shape].

Wszystko poszło dobrze. Książka została przeczytana idealnie – nie szybko i nie wolno, minimalna ilość dymu. Klasyka. Byłem w mundurze [in uniform].

Wszystko poszło dobrze. Książka przeczytana idealnie – nie szybko i nie wolno, minimalna ilość dymu. Klasyka. Byłem w formie [in shape].

The ambiguity stems here from multiple meanings of the Russian noun форма ‘‘forma’’ (6.2), which can mean either ‘‘shape’’ or ‘‘uniform.’’ Since one can be ‘‘in shape’’ as well as ‘‘in a uniform’’, it is unclear from the sentence alone which meaning was intended by the author. From the preceding context, it is clear that ‘‘everything went well’’ for the narrator, who mastered the art of ‘‘book’n’grill,’’ a unique form of expression exclusive to this fictional world. Based on this, we can infer that in this instance, the term ‘‘forma’’ signifies ‘‘shape,’’ as in (6.2), rather than ‘‘uniform,’’ as in (6.2).

Appropriateness:

Finally, context may help to choose the more appropriate equivalent for the given situation. Consider the following example:

—Japanese Source (from Convenience Store Woman)

"Ah, and one pack of cigarettes, number five."

“Ah, and one pack of cigarettes, number five.”

The conversation above is between a clerk and a customer. The Japanese expression かしこまりました ‘‘kashikomarimashita’’ (6.2) is an honorific that literally means ‘‘understood.’’ However, when choosing the best equivalent, the translator needs to consider the situation at hand to best reflect its meaning in the target language. ‘‘Understood’’ in Sent (6.2) is technically correct, but it is an unfortunate word choice for the clerk to employ. On the other hand, ‘‘right away’’ in Para (6.2) fits much better in the context of this conversation. Had this been a series of commands (e.g., in a military context) ‘‘understood’’ would be the more favorable option.

Limitations

So far, we have shown that GPT-3.5 leverages paragraph-level context to produce translations that are better than those produced by sentence-level counterparts (Sent vs Para). However, there are still many issues with Para’s translations. From the annotations and translators’ comments, we observe that Para suffers from occasional omissions of content from the source paragraph. Sent and GTr are certainly not free of that problem either, but omission appears to be more prominent for Para translations (see Appendix C).

Moreover, Para still makes a sizeable number of mistranslations and grammatical errors, though fewer than Sent or GTr. We observe that Para occasionally merges sentences with two distinctive subjects attributing all states and/or actions to one of them. Very rarely, we also find cases where context possibly confuses the model, resulting in an incorrect translation. The following example illustrates this issue:

Le bois du bureau amplifie les battements de mon cœur. Le vieux mobilier Art déco conduit bien les émotions et les fatigues. Ruhlman ? Leleu ? Il [he] en a tant vu.

机の木材が私の心臓の鼓動を増幅している。古いアール・デコ家具は感情や疲労をうまく導いてくれる。ルールマン？レルー？彼ら [they] はそんなに多くを見てきた。

In the French text, the narrator wonders whether the brand of the desk was Ruhlman or Leleu, with both proper nouns possibly referring to a person. In the last sentence, the French text uses ‘‘il’’ or ‘‘he’’ (7), as a desk is a masculine noun in French (‘‘le bureau’’). Para, on the other hand, appears to be confused by the two preceding names and incorrectly translates the singular pronoun as 彼ら, or ‘‘they.’’

Furthermore, we observe (very few) cases where the paragraph-level translation disregards the context. Most representative of this class of errors is when the model struggles to translate from Japanese in cases where the subject is omitted. The following example illustrates this issue:

ミホ [Miho] は、今では結婚して地元に中古の一戸建てを買っていて、そこに友達がよく集まっている。明日もアルバイトなので億劫に思う時もあるが、コンビニ以外の世界との唯一の接点であり、同い年の「普通の三十代女性」と交流する貴重な機会なので、ミホの [Miho’s] 誘いにはなるべく応じるようにしている。

—Japanese Source (from Convenience Store Woman)

Miho [Miho] wyszła za mąż i kupiła stary, jednorodzinny dom w swoim rodzinnym mieście. Przychodzą tam często jej znajomi. Mimo że Miho ma [Miho has] jutro pracę w konbini, zazwyczaj chętnie odpowiada [(she) responds] na jej [her] zaproszenia, bo to jedyna okazja, by spotkać się z innymi kobietami w jej [her] wieku.

Miho is now married and has bought an old house in her hometown, where her friends often gather. Though she often finds it a chore to work tomorrow, it is her only connection to the world outside the convenience store, and a valuable opportunity to interact with other “normal thirty-something women” her age, so she tries to accept Miho’s invitations as often as possible.

Both Polish (7) and English (7) translations of the same source text (7) share a common issue. The narrator begins the paragraph by talking about Miho and then proceeds to describe her own (the narrator’s) feelings about the situation, although the gender of the narrator is never revealed in the Japanese text. The second sentence should be written from a first-person perspective, particularly since it directly references Miho towards the end (blue text). However, both the Polish and English translations produced by Para are confused by this: by using the third-person’s perspective (‘‘she,’’ ‘‘her’’), both translations incorrectly imply that Miho is the subject of the second sentence. Sent and GTr translate this passage accurately, albeit with some clumsy phrasing.

Finally, it is important to acknowledge that the languages covered in the current study are either mid or high-resource. Performance might be much worse when translating from or into one of the low-resource languages, such as Zulu or Armenian.

Our preliminary experiments indicate that GPT-4 OpenAI (2023) sometimes generates better paragraph-level translations than those of GPT-3.5. For instance, it seems to have a better grasp of the inverted word order in German, though no broader conclusions should be made without further testing. Nevertheless, it does not resolve all of the issues discussed in our paper. Mistranslations and grammatical errors are still abundant across many language pairs. GPT-4 produces the following translation when fed the previous example paragraph (7) as input; note that all of the issues still remain:Although the given paragraph is already comprehensible for a human reader, we also attempt to enhance the translation by incorporating three additional preceding paragraphs for context. Intriguingly, when provided with this extended context, both GPT-3.5 and GPT-4 generated accurate translations.

Miho is now married and has bought a used single-family home in her hometown where her friends often gather. Although she sometimes finds it a drag to work a part-time job the next day, she makes an effort to respond to Miho’s invitations because it’s a valuable opportunity to interact with ‘‘normal’’ women in their thirties like herself, apart from her convenience store job.

Para translations hold the potential to captivate readers, especially if LLMs continue to improve at their current pace. Indeed, some of our translators mentioned that they genuinely enjoyed the task, though integrating these paragraphs into a coherent novel still poses a considerable challenge. With all that said, literary translation involves more than just overall ‘‘correctness’’ or mere entertainment value. A translation that is perfectly ‘‘correct’’ and enjoyable might still fail to convey the author’s intentions or meaning skillfully hidden behind a simple phrase. Our fr-en translator shares her thoughts on this matter:

Both translations [Sent and Para] translate the words without the feeling; the original author’s voice is lost.

Conclusion

In this paper, we demonstrate that LLMs leverage paragraph-level context to produce translations that are more coherent and enjoyable than sentence-by-sentence translation while containing fewer mistranslations and grammatical issues. Our evaluations reveal that professional translators prefer paragraph-level translations over both sentence-level translations produced by the same language model, and also to those generated by an off-the-shelf commercial system (GTr). We release our dataset and error annotations to help facilitate the development of new evaluation methodologies and automatic metrics for document-level machine translation. Finally, a full-length novel extends far beyond the confines of paragraph-level translation. In future work, we will focus on integrating individual paragraphs into cohesive chapters, which can then be expanded to encompass the entire novel.

Ethical considerations

Translating with LLMs: The rise of large language models has also brought many ethical concerns to the forefront of NLP research Blodgett et al. (2020); Bender et al. (2021). LLMs encode biases and exhibit toxicity, and these behaviors can be exacerbated by unconstrained prompting Gehman et al. (2020); Costa-jussà et al. (2022). Further ethical concerns arise in the context of machine translation, particularly literary translation, where multiple stakeholders – the author, the translator, and the audience – are involved Taivalkoski-Shilov (2019a). Low-quality output can influence the perception of the author’s work, impair the reader’s linguistic abilities, and hinder the transfer of ideas to the target language, while overrelying on machine translation can possibly threaten the role of human translators Drugan (2013); Ning and Domínguez (2016); Taivalkoski-Shilov (2019a). On the other hand, machine translation employed responsibly as an auxiliary tool holds the potential to alleviate the translator’s cognitive burden O'Brien (2012) and make the author’s work accessible to a broader audience more swiftly Besacier (2014). Contrary to the predictions in Eloundou et al. (2023), we do not view large language models as a substitute for human translators, but rather as a means to assist translators in their work.

Human Evaluation: The experiments involving human translators were IRB-approved, and all involved translators gave their consent to disclose their annotations, comments, and preference choices. In recognizing contributions, our acknowledgments only include the names of those translators who explicitly gave their consent to be acknowledged by their full name in this publication.

Acknowledgements

First and foremost, we would like to express our gratitude to the translators hired mostly on Upwork: Malgorzata Szymczak (fr-pl), Kinga Przekota (ru-pl), Michal Sikora (cs-pl), Paula Kurzawska (de-pl, de-en, pl-en), Kristy Darling Finder (fr-en), Timothy Shostak (ja-en), Shun Enoki (zh-ja), Takanori Kurokawa (fr-ja), Yoshiko Kikawa (en-ja), Shinnosuke Kasahara (ru-ja), and all those who wish to remain anonymous. We encourage any machine translation researchers working on these language pairs to contact these translators for human evaluations.

We would also like to show our appreciation to Jan Wislicki, Tom Gally, Nader Akoury, Kalpesh Krishna, Simeng Sun, Katherine Thai, and the entire UMass NLP group for insightful discussion, which helped to shape this project.

Finally, we would like to thank Sergiusz Rzepkowski (pl), Paula Kurzawska (pl, en), Hiroshi Iida (ja), Grégory Fleurot (fr), Peyton Bowman (en), Simeng Sun (zh), Igor Zapala (pl, de), Marvin Hoffmann (de), Kinga Przekota (pl, ru), and Yuki Mori (ja) for further consultations on their respective native languages.

This project was partially supported by awards IIS-1955567 and IIS-2046248 from the National Science Foundation (NSF) as well as an award from Open Philanthropy.

Список литературы

Appendix

Приложение A The Dataset

The selection of a particular paragraph was semi-random, with certain considerations in mind during the sampling process. We prioritized the following criteria: (1) for each source language we sample paragraphs so that there is a combination of dialogue and narrative texts; (2) the paragraph should be reasonably intelligible to a human translator without additional context; and (3) alignment between the source paragraph and human translation should be feasible, meaning no major content rearrangement across paragraphs.

Nonetheless, meeting all these requirements was not always possible. For instance, the source text of Convenience Store Woman (ja) is mostly written in the first-person narrative. Since Japanese does not encode the speaker‘s gender in the verb forms, it is often impossible to determine whether the narrator is a male or a female. In cases where it was impossible to determine the gender of the character we instructed translators to accept either option, provided that the translation remained consistent within the given paragraph (i.e., the gender did not change within the paragraph).

A note on literary translation:

It is important to understand the challenges a human translator faces when translating literary texts. Even a simple sentence may lead to substantial struggles. One of the most notable examples is the first sentence of a French novel The Stranger by Albert Camus. The story begins in a seemingly trivial way:

Aujourd’hui, maman est morte. Today, mother died.

While there is nothing particularly difficult in (A), five English translations of ‘‘The Stranger’’ has already been produced and there is little consensus on what the ideal translation should be Kaplansky (2004).

This very first sentence is of the utmost importance as it introduces the reader to Meursault, the protagonist of the story, who will later kill an unnamed Arab without any apparent reason. Hence, it is crucial for the storyline that the reader understands, from this very beginning, who Meursault is and what affection he holds for his mother.

Stuart Gilbert (1946), Joseph Laredo (1982), and Kate Griffith (1982) all translate the beginning sentence as Mother died today but this translation is problematic. English word ‘‘mother,’’ while technically correct, is too formal to fully embrace the emotions conveyed by the French ‘‘maman.’’ Mathew Ward (1988) opts to leave the French ‘‘maman’’ untranslated. An understandable choice, as the English reader is likely to decipher the meaning from the surface similarity, though they may not fully grasp its sentiment. Conversely, Sandra Smith (2012) attempts to capture the intimacy of ‘‘maman’’ by rendering it as ‘‘my mother,’’ which is less formal than a simple ‘‘mother’’ but doesn’t possess the childlike connotation of the English ‘‘mom.’’

Literary translation is clearly a challenge that exceeds simple word equivalence. Professional translators face choices, that current systems are unlikely to solve independently. However, they can assist translators in their tasks, in a way similar to how computer-assisted translation (CAT) tools have been doing. This approach holds the potential to make more novels accessible to a wider audience; novels that may have remained untranslated otherwise.

Приложение B Prompt Examples

Here we present examples of prompts employed for the translation with GPT-3.5. The prompt wording for Sent, Para_sent, and Para, with one demonstration each are presented in Figure 8, Figure 9, Figure 10 respectively.

Приложение C Human Evaluation

In this section, we provide some further details about the human evaluation with a focus on the error annotation. First, discuss the issue of subjectivity in error annotation. Next, we explain some choices we had to make when annotating ‘‘inconsistency’’ and ‘‘format’’ errors. Then, we discuss the issue of omissions in the produced translations. Finally, we present some details about the translators hired for the evaluation task.

Annotating and classifying errors in translations is inherently subjective Lommel et al. (2014a); Han (2020). For instance, translating French ‘‘corsage’’ (‘‘bodice’’) as a ‘‘blouse’’ can be seen as either a mistranslation or a permissible deviation from the original text; this is, in fact, how the ‘‘corsage’’ was translated by the human translator in our data.

Furthermore, sometimes there are multiple ways of annotating errors Thomson et al. (2023). Consider the following example:

We had to hide the running, though, in case our haste betrayed us, so truer to say we slipped out quietly. When one of my parents appeared, my technique was: pretend to catch sight of someone in the next room. Move in a natural manner toward this figment of my imagination, making a purposeful face.

—English Source (from A Children’s Bible)

The translation of the last sentence in (C) into Polish as an imperative can be considered a mistranslation. We would hypothesis that the system misinterpreted the source as an imperative form. However, using the infinitive form of the verb in the translation is less clear and raises questions about whether it is a mistranslation or a grammatical error. The distinction between the two lies in the point at which the mistake was made. If the original sentence was understood correctly but the resulting translation was ungrammatical, then it is a grammatical error. On the other hand, if the use of the infinitive form resulted from interpreting ‘‘move’’ as an infinitive, it may be considered a mistranslation as well.

Inconsistency:

For marking the ‘‘inconsistency’’ errors we decided to the take minimal approach. For instance, is the same person is referred to in the translation as both ‘‘Piotr’’ and ‘‘Peter’’ we would mark only the one that is less frequent. If ‘‘Piotr’’ appears once in the paragraph, while ‘‘Peter’’ is used twice, ‘‘Piotr’’ would be annotated as being inconsistent. The same strategy was applied for ‘‘register’’ errors, such as when both polite and casual forms were acceptable, but the translation used them randomly.

Format:

We did not label ‘‘format’’ errors for the Sent and Para_Sent translations, as we manually corrected the quotation marks during post-processing of the translations. This manual correction was done to ensure that Sent and Para_Sent could be compared to Para without relying too heavily on simple heuristic (i.e., incorrect usage of the quotation marks).

Omissions:

One thing we ought to discuss is the omission issue. Upon examining translations and annotator feedback, we observe that Para occasionally omits details, which are crucial to the storyline. Preliminary investigation indicates that Para translations are more prone to omissions compared to Sent and GTr. Although Para_Sent appears to mitigate this problem to some extent, it still results in a higher number of omissions than the sentence-level approach while at the same time introducing some repetition issues.

Translators:

The translators in this study were hired on a freelancing platform, Upwork. All were highly proficient in the source language and most of them were native speakers of the target language. Only one translator reported familiarity with the book, which translation she evaluated. All translators were instructed to evaluate each paragraph in isolation without relying on any prior knowledge about the book. Details about the translators are reported in Table 11.

Приложение D Pivot Pilot

Table 14 shows the results of the pilot study evaluating the effect of pivoting on English on the quality of paragraph-level translations. The evaluation was done by the first author on all 20 passages for every language pair that did not include translation from or into English, as these do not require any pivoting. A total number of 200 pairs was evaluated employing simple preference judgments.

During the Para_Pivot translation process, the model utilized both the source text and its corresponding English translation (text-davinci-003, top-p=1.0, temp=0.3). This approach has the potential to mitigate the limitations associated with pivoting translations, where some information may be lost. For example, both Czech and Polish encode the gender information in the past tense form of the verb. English does not, so this information is lost and will most likely result in an erroneous translation. Indeed, we notice that adding the source text helps the model to overcome this shortcoming, however, we do not observe a clear gain from using English as a pivot language. Consider the following example:

Taťka mě zase zmlátil. Byl hrozně naštvanej, protože mamka řvala. On mě vždycky mlátí, když ona řve. Že prej jsem pořádně neudělala nádobí. Ale já vím, že jsem je udělala dobře. I ten hrnec jsem vydrhla pořádně. A ještě jsem to všechno utřela před koupáním. Ale možná jsem nevyždímala hadr.

Tata mnie znowu zbił. Był wściekły, bo mama krzyczała. On zawsze mnie bije, kiedy ona krzyczy. Że niby nie umyłam dobrze naczyń. Ale ja wiem, że umyłam je dobrze. I garnek też dokładnie wypolerowałam. I jeszcze wszystko wytrzepałam przed kąpielą. Ale może nie wyciśniłam ręcznika.

Tata mnie znów uderzył. Był wściekły, bo mama krzyczała. On zawsze mnie bije, kiedy ona krzyczy. Że niby nie umyłam dobrze naczyń. Ale ja wiem, że umyłam je dobrze. Nawet garnek dokładnie wytrzepałam. I jeszcze wszystko przed kąpielą wytrzepałam. Ale może nie wyżągnęłam mopa.

In each instance, the emphasized verbs could potentially be mistranslated when translated through English as the pivot language, as the speaker’s gender information would be lost. For instance, the past tense verb ‘‘washed’’ remains unchanged in English regardless of the gender of the speaker, with such details encoded only in the source (Czech) and target (Polish) languages. In this case, all verbs have been translated accurately with respect to grammatical gender, implying that incorporating the source language into the pivot pipeline does indeed improve the translation. However, Para_Pivot still selects less suitable verbs (highlighted in red) resulting in slightly more errors in this particular paragraph.

The only pair where pivoting seems to help is pl-ja. While it is unclear why this happens, it is possible that this outcome is due to the specifics of the Polish novel employed for the translation. Sword of Destiny by Andrzej Sapkowski uses a very distinct language with many archaic expressions. It is possible that translating into English, a language the GPT models were trained on, helps the model deal with these difficult phrases.

Since we do not observe any apparent gains from performing the translation via English as a pivot language (p=0.62, 95% [0.448, 0.591]) and doing so reduces the number of examples one can fit into the prompt, we continue our experiments with a direct translation.

Приложение E Automatic Metrics

We investigate the correlation of automatic metrics with human judgements in our evaluation. We consider (1) all the judgments, as well as (2) a subset of all judgments where the annotator stated that they were sure that one translation is clearly better than the other. We compute both accuracy (i.e., the percentage of cases where the metric agrees with human judgment), and a correlation coefficient Kendall’s Tau which is defined as follows:

Table 15 shows the correlation of automatic metrics with the human judgments obtained in this study. Comet exhibits the highest agreement with human judgments both in terms of the accuracy (64.04% for all data, 72.78% for confident votes only) and Kendall’s Tau (0.341 for all data, 0.456 for confident votes only).

Statistical Analysis:

We employ the linear-mixed effect models Baayen et al. (2008) to analyze the scores produced by automatic metrics. We fitted the model in R using the lme4 package Bates et al. (2015); the p-values were obtained with the LmerTest package Kuznetsova et al. (2017). Linear-mixed effects models contain both fixed-effects and random-effects (random intercept and/or slope). The fixed effect here is the translation setup (Para, Sent, Para_Sent, GTr) with the source paragraph being coded as the random effect. We inspect the residual plots to ensure that the variance across the fitted range is relatively constant. The results from the fitted model are presented in Table 16 (Bleurt), Table 18 (Comet), Table 20 (Comet-QE), and Table 22 (BertScore).

We further perform a post hoc analysis using the emmeans package Lenth (2023) to obtain p-values for the pairwise comparison. The results of the post hoc analysis are presented in Table 17 (Bleurt), Table 19 (Comet), Table 21 (Comet-QE), and Table 23 (BertScore).