Prompting PaLM for Translation: Assessing Strategies and Performance
David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, George Foster
Introduction
Large language models (LLMs) trained to predict the next token from a lengthy context have demonstrated impressive machine translation capabilities, despite being trained on corpora that are overwhelmingly English, with no intentionally-included parallel text. In this paper, we carry out an in-depth investigation into the translation capabilities of LLMs, testing different prompting strategies and carefully assessing the resulting performance. We study the recently-introduced PaLM model Chowdhery et al. (2022), a 540B-parameter decoder-only language model trained on a heavily English-centric, multilingual corpus. It has achieved the strongest MT results among LLMs trained on non-parallel multilingual corpora.
To ensure a fair assessment of PaLM’s MT capability, we begin with an exploration of example selection methods for use with fixed prompt templates. We vary both the pool from which examples are chosen and the method for choosing them, comparing standard random selection to -nearest-neighbour () selection that customizes prompts for specific inputs. Figure 1 highlights the importance of example selection by showing that two randomly-selected sets of examples can result in significantly different distributions of sentence-level bleurt scores.
Although Chowdhery et al. (2022) report interesting results on low-resource and non-English language pairs, their most striking findings concern high-resource pairs. Accordingly, we limit our investigation to French, German, and Chinese translation to and from English. We evaluate sentence-level translation quality using recommended practices for high-quality MT, specifically: (i) we use recent WMT test sets to guard against train/test data leakage, and to facilitate comparison with state-of-the-art (SOTA) MT systems; (ii) we use a SOTA automatic metric (bleurt) instead of bleu which has been demonstrated to be suboptimal for high-quality translations Kocmi et al. (2021); Freitag et al. (2021b); and (iii) we conduct an expert-based human evaluation with detailed categories to characterize the error patterns of the automatically generated translations.
We carry out the first systematic study of LLM prompting for MT, exploring both the example candidate pool and the selection strategy. We find that the quality of examples matters more than the domain from which they are drawn or their lexico-semantic proximity to the current input.
We evaluate the translation capability of LLMs with the procedure currently recommended by the MT community. We find that, although impressive, the sentence-level translation capacity of LLMs still lags behind SOTA MT.
Related Work
Inspired by the findings of Radford et al. (2019); Brown et al. (2020), prompting strategies for LLMs have become a topic of intense interest, generating work across a broad spectrum of methods and applications Liu et al. (2021). A basic distinction can be made between hard (explicit text) prompting such as we use, and soft prompting that seeks to learn embeddings Lester et al. (2021), activations Li and Liang (2021); Hambardzumyan et al. (2021), or attention weights Liu et al. (2022a) that condition the model to perform a desired task. The latter approach is more expressive and more efficient at inference time, but performance can be sensitive to initialization Hou et al. (2022), and some techniques require modifications to the model.
Hard prompts have the advantage of being easy to interpret and modify. Work in this area includes tools to facilitate development of handcrafted prompts Strobelt et al. (2022); Bach et al. (2022); algorithms to find optimal prompts through gradient-guided search Shin et al. (2020) or exhaustive search through labels Schick and Schütze (2021) or both labels and templates Gao et al. (2021); as well as studies on the effect of example order Kumar and Talukdar (2021); Lu et al. (2022). Hard prompts have also been used to analyze model capabilities Garg et al. (2022); Li et al. (2022a), the role of data Singh et al. (2022), and the nature of prompting itself Min et al. (2022); Wei et al. (2022).
With few exceptions, e.g. Li et al. (2022b); Liu et al. (2022b); Valvoda et al. (2022), early approaches to hard prompting tended to condition on the task rather than the specific input. Our approach for conditioning on the input was pioneered by Liu et al. (2022b), who used RoBERTa embeddings to identify relevant GPT-3 prompts for sentiment, table-to-text, and QA tasks. They found that works better than a random-selection baseline, and that the advantage grows as the size of the (domain-controlled) example pool increases.
Work on prompting LLMs for MT began with the GPT-3 and PaLM papers Brown et al. (2020); Chowdhery et al. (2022), which adopted similar approaches, comparing 0, 1, and -shotWhere is 64 for GPT-3 and 5 for PaLM. random selection of independent sentence pairs from WMT training corpora, and testing on older French, German, and Romanian WMT test sets traditionally used in ML, augmented in PaLM with FrenchGerman and Kazakh. For both models, performance increased with number of shots, and -shot bleu scores were found to be competitive with previous unsupervised SOTA, and in some settings—particularly into English—supervised SOTA as well.
In other early MT work, Reynolds and McDonell (2021) experimented with prompt templates for GPT-3, and found that 0-shot prompts with carefully-chosen templates can outperform -shot prompts with sub-optimal templates. Garcia and Firat (2022) explored using prompts with mT5 Xue et al. (2021) to control output attributes such as formality, and also examine the effect of using prompt-like natural-language tags during fine-tuning. Patel et al. (2022) proposed autoregressive prompting: concatenating only the first predicted word to a prompt and output prefix at each step.
Since our paper appeared on arXiv in November 2022, there has been a flood of work on using LLMs for MT, which we summarize briefly for completeness. A number of papers Agrawal et al. (2022); Zhang et al. (2023); Jiao et al. (2023); Hendy et al. (2023) investigate prompt quality and source proximity using methods similar to ours but with different LLMs, notably GPT-3.5, GPT-4 and their instruction-tuned counterparts. Their findings are in line with ours, with the exception of Agrawal et al. (2022), who achieve significant gains using lexical matching augmented with a diversity mechanism to select prompts. Apart from differences in model and setting, a potentially salient discrepancy is their emphasis on BLEU rather than neural metrics to measure performance. Other interesting work that conditions prompts on source segments uses dictionaries to supply translations in low-resource settings Ghazvininejad et al. (2023); Lu et al. (2023), or chain-of-thought inspired prompts that elicit keywords, topic, and related examples from the model itself He et al. (2023).
Further recent work looks at the role of data, attributing LLM MT capabilities to the presence of incidental bilingual examples Briakou et al. (2023), or showing that parallel data Schioppa et al. (2023), dictionaries Jones et al. (2023), or restriction to bilingual settings Garcia et al. (2023) can boost performance in smaller LMs. Another popular line aims at controlling various properties of translations such as formality or use of specified terminology, either statically Garcia et al. (2023); Moslem et al. (2023) or with human interaction Pilault et al. (2023). Finally, there is extensive work on analyzing the translation output of LLMs, generally finding that it is more fluent than accurate Hendy et al. (2023); Anonymous (2023), good at handling document context Wang et al. (2023); Karpinska and Iyyer (2023) but also prone to problems such as hallucination Zhang et al. (2023); Guerreiro et al. (2023), and frequently sub-par in low-resource settings Zhu et al. (2023); Bawden and Yvon (2023)
Prompting for Machine Translation
For a general task, prompting an LLM to generate a desired output from an input can involve many steps Liu et al. (2021), including template generation, slot filling, answer search, and answer mapping. In MT, the answer search and mapping processes are simplified because the answers generated by the LLM can be used directly; we simplify further by using a fixed template. What we explore in depth is the slot filling portion; in particular, we test a variety of methods to select few-shot examples for the prompt.
In initial experiments we determined that for few-shot prompting the exact form of the template is unimportant, see Appendix A for details. Following this observation, we decided to adopt simple templates where each example if preprended by the corresponding language name. These results in prompts of the form (for -shot prompting):
where [source] and [target] are instantiated with the names in English of the source and target languages, e.g. English and German. Note that this scheme has been found to be present in the training data as a marker for multilingual content Briakou et al. (2023). Each slot pair is filled with a translation example for these languages, and the final slot is filled with the current source text. Our algorithm for -shot translation from a source text to a target text is:
Choose translation example pairs … . In general, these can depend on .
Plug the example pairs and into the template. Condition PaLM on the resulting string.
Perform a greedy search,We found that using a sampling temperature other than 0 tended to degrade translation quality. stopping when the model outputs a newline.
Output the predicted suffix verbatim as .
Example selection operates in two phases: first choose a pool containing parallel text, then choose examples from the pool. Choosing the pool lets us control global attributes of examples such as domain and average quality. Our baseline method for choosing examples is to select them randomly from the pool. We also experiment with selecting examples that are “closest” to the source text, on the hypothesis that such examples will help guide the model to produce similar translations.
To find relevant examples, we use -nearest neighbor () search on the source side of our parallel pool, inspired by Khandelwal et al. (2021). We carry out the search itself using the method of Guo et al. (2020)Available at https://github.com/google-research/google-research/tree/master/scann., and investigate two possible representations of the sentences, with associated distance measures:
Bag-of-words (BOW): Each sentence is represented by a (sparse) vector of counts associated with words in the vocabulary. As the associated distance measure we use cosine distance. This representation focuses on the surface form of the words, and thus favors lexical similarity between the examples.
Roberta: Sentences are represented as embeddings in the space defined by Roberta Liu et al. (2019), a multilingual transformer-based model, with Euclidean distance used for retrieval. We expect these embeddings to reflect the semantics of the sentence, and thus retrieve prompts that are relevant to their subject matter.Note that it would be conceivable to use PaLM itself as embedding model, which would provide a representation (and associated similarity measure) closer to the application that we are targeting. However, due to the high computational cost and large amounts of data (for some experiments we embed the totality of the WMT training data) we decided to use a smaller model.
Data
We experiment with translation into and out of English for Chinese, French and German. After English (78.0%), German (3.5%) and French (3.3%) are the two largest languages in PaLM’s 780B token training corpus; Chinese (0.4%) is the 15th largest, and it also represents an inherently more difficult translation task. To facilitate comparisons with recent SOTA systems, and to minimize the chance of overlap with PaLM’s training corpus, we test on news data from the WMT 2021 evaluation campaign Akhbardeh et al. (2021). Since French was not included in WMT21, we use data from WMT14; apart from being older, these test sets are not purely source-original Freitag et al. (2019) like the more recent ones. Table 1 shows statistics for our test data.
For prompt selection, we use three distinct pools: the full WMT training corpus for each language pair (WMT-full), the corresponding WMT development sets (WMT-dev), and a manually-curated “high-end” pool. Sizes are shown in Table 2. The WMT-full pool is largest and offers the highest probability of close matches, but it is crawled text drawn from sources of varying quality. The WMT-dev pool has generally better quality, and is a closer domain match to our test set; to encourage PaLM to produce more natural text, we included only target-original texts.As identified by sacrebleu. For German English and Chinese English we include all the news test sets from 2010 to 2020. As English French was discontinued after 2015, we used sets from 2010 to 2013, augmented with newsdiscussion2015.
The high-end pool comes from websites containing bilingual articles that we judged to be professionally edited, with native or near-native quality in both languages. The articles are drawn from various domains (biography, business, commentary, culture, fashion, food, news, and obituary), with the news domain of the test sets comprising less than 50% for each language. We treat these articles as symmetrical, and use them as prompt sources in both translation directions. Due to the non-literal nature of the translations, there is frequently no 1-1 correspondence between sentence pairs, so we extract aligned paragraphs for prompting. More detailed information about the high-end pool is provided in Appendix B.
Experiments
For compatibility with Chowdhery et al. (2022), we ran all experiments at the sentence level, translating each test sentence individually and in isolation from its context. This deprives PaLM of the ability to exploit the longer contexts it was exposed to during training, but it matches the operating mode of our baselines (including SOTA baselines), and facilitates evaluation.Evaluation of document-level translations is complicated by potentially non 1-1 sentence correspondences, resulting in long translation units that are truncated by bleurt and can be difficult for humans to rate reliably. We leave an exploration of potential gains from conditioning on longer histories to future work.
In preliminary experiments, we varied the number of shots from 0 to 10, and found clear performance gains as we increased the number of shots, with diminishing returns after 5 sentence pairs (see Appendix A). Accordingly we report all results on the WMT pools in the 5-shot setting, where each shot is a single sentence pair, matching the configuration in Chowdhery et al. (2022). For the high-end pool, lacking 1-1 sentence alignments, we use 1-shot examples, where each shot is a single paragraph pair. This provides roughly the same quantity of text as 5-shot with sentences, although it creates a stylistic mismatch with our test setup, as we still translate on a sentence-by-sentene basis, as in the other conditions.
When randomly selecting examples, we observed that there is little variability in automatic scores when selecting different samplesNote that this holds for document level scores. The effect on single sentences can still be very important, cf. Figure 1. (see Appendix C). For the results reported in this section, we let PaLM produce translations with 5 different seeds and we selected the run with the median bleurt score. Translation time was some orders of magnitude longer than a dedicated translation system.
Following recent recommendations Kocmi et al. (2021); Freitag et al. (2021a) we favour neural metrics (bleurt in our case) over bleu, although we also report bleu scores for completeness. We use a cased version of bleurt Sellam et al. (2020) that is based on Rembert Chung et al. (2020). We use bleu as implemented in sacrebleu sacrebleu signature: nrefs:1|case:mixed|eff:no| tok:TOK|smooth:exp|version:2.1.0, where TOK is 13a or zh. Post (2018), with zh tokenization for English-Chinese, and 13a tokenization for all other languages.
To perform human evaluation, we hired professional translators (7 for EnDe, 5 for DeEn, 4 for ZhEn, and 4 for EnZh) and measure translation quality with a document-context version of mqm Lommel et al. (2014) which mimics the setup proposed in Freitag et al. (2021a). This includes using the same error categories, severity levels and error weighting schema. As suggested in the study, we weight each major error with and each minor error with , except for minor punctuation errors which get a score of . We depart from Freitag et al. (2021a) in using only a single annotator per segment, and in not imposing a limit of 5 errors per sentence. Additionally, due to technical restrictions on the length of an evaluation session, we limited the mqm evaluation to the first 12 segments per document.
We warm up by comparing example selection strategies on the two WMT pools, using automatic metrics to evaluate quality on EnglishGerman. Results are shown in Table 3. The main observation is that the choice of pool is much more important than the selection method: the results for WMT-dev are notably higher than those for WMT-full across all settings. When comparing selection methods, Roberta is more effective than BOW, but it does not provide a consistent advantage over random selection.
We conjecture that the quality of an example is more important than its proximity to the current source sentence. The larger size of the full WMT pool means that the approaches will in general be able to find examples that are closer to each source sentence than those from the dev pool, but any resulting gain is offset by the greater risk that an example from the full pool will be a poor translation (since we match only on the source side). Interestingly, had we relied only on bleu, we would have concluded that the choice of pool is unimportant, and that random selection consistently outperforms .
2 Results on all language pairs
Table 4 contains our main results, for German English, Chinese English, and French English. For each language pair, we ran PaLM with random selection on all three pools and with Roberta on the WMT-full pool. We compared these systems to output from the best performing system in the 2021 WMT evaluation campaign for German and Chinese, and for off-the-shelf Google Translate for all six language pairs. We evaluate with bleu and bleurt as in the previous section, augmented with human mqm assessments for German and Chinese. French is a special case, as its evaluation set is eight years old, and it is difficult to ensure that any of the MT systems we evaluate have not been exposed to it during training. We include it mostly for the purposes of comparison to Chowdhery et al. (2022), and do not provide SOTA results or perform human evaluation.
Comparing PaLM results for German and Chinese, the pattern from the previous section holds up: random selection from the WMT-dev pool outperforms selection from the full pool. mqm scores correlate well with bleurt for these results. Despite domain and style mismatch, results for the high-end pool are very similar to those for WMT-dev—closer than any results on the full pool—adding support to the hypothesis that example quality is the main determinant of PaLM’s output quality.
The French results reverse the general pattern. For this language pair, random selection from the WMT-full pool does best, although the results for all methods are fairly similar, with a difference of approximately 0.5 bleurt between the best and worst. One potential explanation is the age and quality of newstest2014, as WMT test-set creation has dramatically improved since then.
Turning to a comparison between PaLM and conventional MT systems, the specialized SOTA systems have a substantial advantage of between 1 and 3 bleurt points over the best PaLM results, a gap that is reflected in their much lower mqm scores. The difference is narrower for the general-purpose Google Translate system: less than 1 bleurt except for ChineseEnglish (1.8), with FrenchEnglish at parity. PaLM’s performance relative to the best MT system for each language pair is generally better when translating into English, where it is lower by 1.0, 2.3, and 0.0 bleurt for German, Chinese, and French, compared to drops of 2.1, 2.5, and 0.6 in the reverse direction.
The mqm results show some interesting characteristics of translations produced by PaLM. In all language pairs evaluated, fluency mqm scores for PaLM are generally similar to those for SOTA systems, while accuracy scores are lower. The accuracy gap is dominated by Major Accuracy/Omission errors, followed by inconsistent patterns of other Accuracy/* errors across language pairs. In some languages, the best-performing PaLM systems make fewer Style/Awkward errors than SOTA. Table 5 shows a selection of mqm error counts for PaLM WMT-dev random and SOTA systems; full details are provided in Appendix D.
3 Comparison to previous results
Our only results that are directly comparable to the few-shot results from Chowdhery et al. (2022) are the WMT-full bleu scores in table 4(c) (WMT14 French test-set). Our result for FrenchEnglish matches theirs exactly, but our score for EnglishFrench is lower by 1.7 (42.3 versus 44.0). We attribute this discrepancy to their use of the sacrebleu intl tokenizer; when we evaluate our output using this version, we obtain matching scores.
Our general finding that PaLM’s into-English performance is better than the reverse direction matches the conclusion from Chowdhery et al. (2022), while our comparison with recent SOTA systems on current test sets contrasts with their results indicating that PaLM can rival supervised performance in older settings.
Analysis
In this section we delve further into various aspects of PaLM’s MT performance.
To understand the performance difference between Roberta and randomly-selected examples, we performed a qualitative analysis, choosing sentences with the largest bleurt difference between the two systems. Table 14(a) in Appendix F shows an example where the system correctly retrieves relevant translation examples in the football domain, guiding PaLM to produce a better translation than the random selection system. This contrasts with the example in Table 14(b), where the retrieved source sentences are also from the relevant domain, but all have alignment errors, causing PaLM to generate hallucinated output. In general, random selection is also prone to landing on alignment errors, but as each prompt is selected independently, the odds that all examples will be errors are low. An informal analysis of examples indicates that if one non-parallel prompt is selected, the others also tend to be of poor quality, perhaps due to corpus alignment errors that are concentrated in particular documents or topics. Since matches only on the source side, it is not robust to this noise.
2 Example Translations
Example translations comparing PaLM and SOTA systems for GermanEnglish and EnglishChinese are given in Appendix 6.2, in Table 15 and Table 16, respectively. We compared the translations of both systems and chose examples that are short, but include the most frequent patterns that we observed also in longer translations. In general, PaLM’s translations are less literal when compared to supervised NMT systems. Even though this is one of the strengths of PaLM, it occasionally misses some important information in the source or hallucinates facts not present in the source sentence. The supervised models on the other hand are faithful to the source; this reduces the risk of omission and addition errors, but occasionally leads to translations that are not natural in the target language (e.g. translating street names or using the wrong time format). These findings are in line with the mqm results presented in section 5.2.
3 Overlap of test and training data
One major change with respect to Chowdhery et al. (2022) is our use of more recent WMT test sets, which are unlikely to overlap with PaLM’s training data.Here we measure target-side overlap only; we assume there is no substantial parallel data in PaLM’s training corpus, and therefore no substantial parallel overlap. We test this hypothesis using the technique from Chowdhery et al. (2022), which involves measuring high-order -gram matches; specifically, we measure 15-gram overlap as tokenized by the mBERT tokenizer Devlin et al. (2019).We selected the mBERT tokenizer, as opposed to the PaLM’s sentence-piece tokenizer, because it decouples the measurement of overlap from the model under test. For test sequences with fewer than 15 tokens, we consider them overlapping if the complete sequence is found as a subsequence of a training example. We report the degree of overlap by showing the percentage of original test examples that survive in the clean test set after removing overlap in Table 6. This confirms that the older FrenchEnglish and GermanEnglish sets have substantial overlap with PaLM’s training data, while the newer test sets, whether into or out of English, have much smaller overlapping portions.
Chowdhery et al. (2022) also measure the effect of test-set overlap on translation quality, comparing scores on the original test set to the clean set with overlapping examples removed. In section H we report similar scores for the older test sets, and extend the analysis to calibrate the effect of overlap on MT evaluation, by comparing to an overlap-free off-the-shelf system.
Conclusion
We perform a careful assessment of the sentence-level MT capabilities of PaLM, which we compare to SOTA and a current off the shelf (COTS) MT system for three high-resource languages—German, Chinese, and French—into and out of English, using the latest test sets from WMT. We chose to focus on a small set of high-resource language pairs in order to test the claims of the original PaLM paper, which are most striking for these pairs. The time and expense of performing high-quality human evaluations precluded a broader investigation.
Comparing and random strategies for selecting 5-shot translation examples to instantiate fixed prompt templates, we find that ’s potential advantage in identifying examples relevant to the source sentence is outweighed by its susceptibility to corpus noise. Choosing examples randomly from small, high-quality pools works well, and performance appears to be independent of the domain and translation style of the pool, suggesting that example quality is the most important factor.
Using both the bleurt metric and mqm human evaluations, we show that PaLM’s performance, while very impressive for a system never deliberately exposed to parallel text, still significantly lags that of competition-grade SOTA systems on recent WMT test sets, and to a lesser extent the performance of COTS systems as well. This contrasts with some of the findings of Chowdhery et al. (2022). As in that work, we find that performance into English is somewhat better than the reverse direction. Finally, we perform an extensive analysis of the characteristics of PaLM’s MT output, notably finding that in all languages we tested it tends to be creative and fluent but prone to omissions and other accuracy errors; broadly speaking, it matches the fluency but lags the accuracy of conventional NMT.
In future work we look forward to testing PaLM on document-level translation tasks, unleashing its formidable capacity for leveraging long contexts. We would also like to explore prompt tuning methods that are more sophisticated than the hard-prompt setting we adopted for this paper, particularly to see if these might offer a way to tighten up PaLM’s MT accuracy without destroying its impressive ability to generate highly-fluent text.
Limitations
As we use only a small number of language pairs, it is not clear how general our conclusions are; in particular, they pertain only to languages that are well represented in PaLM’s training corpus, and only to translation into and out of English. Our restriction to independent sentence-level translations may have caused us to underestimate PaLM’s true capabilities, since some of the accuracy problems we observed might be considered less severe in the context of whole-document translation where less literal translations are more typical. Our exploration of prompting barely scratches the surface of the many methods that have been proposed for adapting LLMs to particular tasks, and we may have missed a technique that produces higher-quality translations than we observed. Finally, the human evaluation we rely on to provide our most accurate results is necessarily subjective, and if we were to have carried out the evaluation with different raters and a different methodology, our conclusions might well have been different.
Ethical Considerations
Working with large language models comes with many ethical concerns that are discussed in detail in Brown et al. (2020) and Chowdhery et al. (2022). There, MT is often one task of many, while we focus on the question of proper example selection for few-shot prompting of MT, which adds a few specific concerns. Our conclusion that prompt quality is important could lead one to build a system with prompts drawn from a small set of trusted sources; indeed, our high-end set is one such example of this. In such a scenario, this small source will have an outsized impact on the output of the translation system, and one must be careful to manage issues of attribution and intellectual property. Furthermore, an editorial choice defining high-quality language can potentially reduce quality for groups and topics not typically discussed in this style Gururangan et al. (2022). Finally, by highlighting the power of few-shot examples, one might be tempted to turn example selection over to the users of a system. There, special steps must be taken to avoid exposing users to biased or toxic outputs, which may be triggered by unconstrained prompting Gehman et al. (2020); Costa-jussà et al. (2022).
References
Appendices
Appendix A Prompt Exploration
As preliminary experiments we tried different prompting templates:
This is the prompt template used in the paper (see Section 3). It prepends the examples with the corresponding language name in English.
Like “Language”, but instead of full English names, two-letter languages codes are used (e.g. “en”, “de”).
Like “Language”, but the header “Translate following sentences:” is added.
A textual request for translating a sentence: “Translate from English into German: ”, where and are the translation examples, as in Section 3. The source sentence is given with the same template, but without specifying any translation.
Like “Language”, but the language names are given in German (“Englisch”, “Deutsch”).
No added text. Source and target examples are just input one after the other.
As shown in Table 7, the choice of a prompting strategy has a crucial impact when the number of shots is low, but the effect is reduced when we increase the number of examples shown. The number of examples also has a significant impact on translation quality. We chose to work with 5 examples, as there are diminishing returns when increasing the number of prompts, and choosing a higher number has additional practical implications (e.g. possibly exceeding the maximum input length).
Appendix B High-end pool
Table 9 describes the high-end pool. All listed articles were manually downloaded in June–August 2022, and semi-automatically divided into bilingual paragraphs. Our high-end pool consists of all paragraphs from all articles. The domain breakdown for each language pair is shown in Table 8.
Appendix C Variability of Random Runs
Table 10 shows the automatic scores for random runs for the GermanEnglish language pair. It can be observed that the range of scores is quite small, less than 0.5 bleurt points for all language directions. For both directions, the use of WMT-dev, as opposed to WMT-full, for the random pool reduces the observed range in bleurt by at least .
Appendix D Detailed mqm Scores
Table 11 presents mqm scores for PaLM WMT-dev random and SOTA systems in the four language pairs evaluated, along with the breakdown of the scores into their Accuracy and Fluency components. Table 12 presents detailed mqm error counts for PaLM WMT-dev random and SOTA systems in ende and deen.
Appendix E Significance numbers
We calculate pairwise significance numbers based on PERM-BOTH pair-wise significance testing Koehn (2004); Deutsch et al. (2021). Results can be seen in Table 13.
Appendix F Example Prompts
Tables 14(a) and 14(b) show prompt examples where and random selection do better, respectively, as described in section 6.1.
Appendix G Example Translations
Tables 15 and 16 show example translations for GermanEnglish and EnglishChinese as described in section 6.2.
Appendix H Overlap Analysis
Chowdhery et al. (2022) show bleu differences between clean and original test sets, and provide some evidence that differences are not due to memorization, but it still isn’t clear how much overlap actually inflates a model’s score. We directly quantify the effect of train-test overlap on decision making by comparing 5-shot PaLM to Google Translate (GT)We chose Google Translate for comparison because it is non-trivial to build a SOTA baseline for older WMT scenarios. Through personal communication, we understand that Google Translate has no overlap with WMT test sets. on our two sets with substantial overlap, testing under original, clean and clean (including only overlapping examples) scenarios. bleu and bleurt scores for the two systems and three test sets are shown in Table 17.
We can see that directly comparing original and clean results for a single system conflates differences from overlap with those from the increased difficulty of the clean subset. For example, for deen bleu, comparing PaLM’s original and clean scores gives an overlap gap of 2.6-bleu, in line with the gaps reported by Chowdhery et al. (2022). However, the non-overlapping GT system also has lower scores on the clean set, indicating that it may simply be more difficult.The difference in difficulty between Clean and Clean for systems without overlap is not easily explained. A common difficulty indicator is sentence length, but average lengths, as measured by number of sacrebleu tokens per sentence, are similar between Clean and Clean for both deen (23.8 versus 23.0) and fren (21.1 versus 22.7). It’s more useful to see that the original test indicated a 1.5-bleu difference between the two systems, while the clean test indicates a 2.0-bleu difference, meaning PaLM benefited from overlap by 0.5 bleu in this comparison. The fully overlapping clean further distorts the difference between the two systems: the true (clean) delta of 2.0 bleu shrinks to only 0.4. Trends for fren are similar: though PaLM and GT are very close according to the original test set, the clean set reveals a delta of 0.8 bleu. Interestingly, bleurt may be less sensitive to overlap, with the original-versus-clean deltas hovering around 0 for fren regardless of the test subset, and deen showing that PaLM benefits from an overlap bonus of only 0.3 bleurt.
In summary, overlap between the target side of the test data and the LLM training data can have an impact on both bleu and bleurt scores, altering the delta between two systems where one benefits from overlap and another does not by up to 0.7 bleu or 0.3 bleurt for a 20-30%-overlap. However, we should emphasize that the differences due to overlap are small overall, and certainly much smaller than expected if one looked only at the difference between original and clean scores.
Appendix I Fixed versus random prompts
The results from section 5.2 indicate that random selection from small, high-quality prompt pools can work better than trying to customize prompts for specific inputs. In this section we investigate the effect of using a single high-quality prompt for all inputs, chosen using a maximum-likelihood criterion. For convenience, we carried out experiments on the high-end pool with 1-shot paragraph prompts. For each prompt in the pool, we computed the probability of a set of held-out high-end paragraphs when PaLM was conditioned on that prompt. We select the prompt that resulted in the highest probability for each language pair.
Table 18 compares this method to random selection from the high-end pool. For all language pairs except ChineseEnglish, the fixed prompt does as well or better than the average performance over 5 random runs where a different prompt is selected for each input during each run. In ChineseEnglish, the prompt that ranked 5th according to the probability criterion also outperformed the random average, suggesting problems with our held-out set for that language pair.
We conclude that using a single high-quality prompt can be a safer strategy than choosing a fresh randomly-selected prompt for each input. Model probability appears to be a reasonable criterion for judging quality, but we look forward to refining this heuristic in future work.