Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM

Rachel Bawden, François Yvon

Introduction

Large language models (LLMs) trained at scale with simple objectives have been found to achieve results that match dedicated systems on numerous NLP tasks (Radford et al., 2019), as long as tasks are formulated as text generation though “prompting” (Liu et al., 2023). LLMs’ multi-task performance can even be improved with “instruction” fine-tuning (Sanh et al., 2022; Muennighoff et al., 2022), few-shot priming, and better strategies to select or learn prompts (Petroni et al., 2019; Shin et al., 2020; Schick and Schütze, 2021; Lester et al., 2021; Wei et al., 2022). In multilingual settings, their performance on machine translation (MT) tasks, as measured by automatic scores, is often close to state of the art, even when mostly trained on monolingual data (Brown et al., 2020). Moreover, prompting-based MT offers the prospect of better control of outputs, e.g. in terms of quality, style and dialect (Garcia and Firat, 2022). However, these abilities remain poorly understood, as LLM analyses primarily focus on their multitask rather than multilingual ability (see however (Vilar et al., 2022; Zhang et al., 2023; Moslem et al., 2023), which we discuss in Section 2).

In this work, we focus on the MT performance of Bloom (BigScience et al., 2022), a (family of) open-access multilingual LLM(s), designed and trained by the collaborative BigScience project.111https://hf.co/bigscience/bloom Our main aims are to (i) evaluate Bloom’s zero- and multi-shot behaviour, (ii) study the effect of prompt design, (iii) evaluate a diverse set of language pairs and (iv) assess its ability to use linguistic context. Our main conclusions, which extend those in (BigScience et al., 2022), are (i) 0-shot ability is blighted by overgeneration and generating in the wrong language, (ii) using few-shot improves both issues, with results much closer to state of the art across datasets and language pairs, (iii) there are clear transfer effects, with high scores for languages not officially seen in training, and successful transfer across language pairs via few-shot examples and (iv) although linguistic context does not lead to higher scores, there is evidence that Bloom’s translations are influenced by it. We release our code and translation outputs.222https://github.com/rbawden/mt-bigscience

Related work

Since the early attempts to use language models (LMs) as multi-task learners (McCann et al., 2018), MT has been a task of choice to gauge LMs’ multilingual ability. Results for the zero- and few-shot ability of LMs were discussed for both GPT-2 and GPT-3 (Radford et al., 2019; Brown et al., 2020), which is especially intriguing as they were trained primarily on monolingual (English) data. These results have since been confirmed for other monolingual LMs such as T5 (Raffel et al., 2020) and multilingual LMs such as XGLM (Lin et al., 2022), Palm (Chowdhery et al., 2022), and AlexaTM (Soltan et al., 2022). However, the focus has mainly been on global multi-task performance; often only a small part of the discussion is devoted to MT. Moreover, results are often only reported for a few well-resourced language pairs (e.g. English-French and English-German), and the scores reported (mostly BLEU), are hard to compare due to a non-systematic use of standardised evaluation protocols and metrics.333See the discussion at http://blog.benjaminmarie.com/2/comparing-uncomparable.html of these differences, and an attempt to reconstruct consistent scores.

There are however some in-depth analyses of MT performance of LLMs, each focusing on a specific LM’s performance in a true multilingual setting with respect to prompt design and number of few-shots. For instance, Vilar et al. (2022) reevaluate the MT performance of the multilingual Palm (Chowdhery et al., 2022), focusing notably on the selection of few-shot examples. Consistent with our findings, they determine that prompt choice becomes unimportant in few-shot settings and that using few-shot examples increases performance with diminishing returns for $k>5$ examples, using BLEURT and BLEU scores, as well as the results of a human evaluation. They find that the quality of few-shot examples has a large impact on performance. However, even with good prompts, Palm lags a couple of points behind state-of-the-art MT systems, especially when translating from English, notable due to adequacy problems. Zhang et al. (2023) focus on the evaluation of GLM-130B, a bilingual (Chinese and English) LLM (Zeng et al., 2022). Their main conclusions are also consistent with ours: (a) zero-shot performance varies greatly across different prompts, (b) increasing the number of prompts from 0 to 20 yields consistent improvements in performance, again with variance across instructions, and (c) finding the best few-shot example selection policy is difficult. It seems that having good and long examples, for instance, may help, even though none of the criteria explored in this study seem to provide any systematic improvement. A last point worth mentioning is that prompting with monolingual data hurts performance, but that using pseudo-parallel data obtained with back-translation (Bojar and Tamchyna, 2011) is an effective workaround.

Moslem et al. (2023) evaluate OpenAI’s GPT-3 (Brown et al., 2020)444Version: text-davinci-003 model. with sampling-based decoding and a prompt resembling our own xglm-source+target prompt. They report strong zero-shot behaviour using multiple metrics, plus clear improvements with an increased number of shots for the well-resourced languages, less so for the only low-resource language in their lot (Kinyarwanda). The main novelty of this study is to use prompting as a vehicle to perform local adaptation and to ensure terminological consistency. For this, they use fuzzy matches from a translation memory as well as MT outputs to build their prompts, yielding results that both outperform their zero-shot system, but also their initial MT engine. Additionally inserting terms and their translation in the instruction yields supplementary improvements.

Finally note the preliminary evaluation of ChatGPT in (Jiao et al., 2023), which reports interesting insights regarding the multilingual abilities of this model, as well as proposing innovative techniques to generate (artificial) prompts and to use pivoting in prompting. Similar to ours, this study considers multiple test domains such as news (WMT) and Wikipedia (Flores). A more in-depth analysis of the same model can be found in (Hendy et al., 2023), which confirms ChatGPT’s strong translation abilities, at least for “well-resourced”555A rather slippery concept in this context, as the content of the training data is not fully known and seems to mostly comprise English texts. language pairs. Document-level scores are also reported, as well as human evaluations and qualitative analyses.

Multilingual MT is also the subject of dedicated (monotask) architectures and training regimes. Originally introduced in (Dong et al., 2015; Firat et al., 2016; Luong et al., 2016) with limited language coverage, the latest versions of these approaches are able to handle hundreds of languages, including very low-resource language pairs (Fan et al., 2021; Bapna et al., 2022; Costa-jussà et al., 2022). Although we found that Bloom is able to match this performance, given sufficient training data, we also see that it still lags behind for many languages pairs that are under-represented in its training data.

Bloom Language Model

Bloom is a large open-access multilingual model trained on 46 natural languages developed within the BigScience project (BigScience et al., 2022). It is an auto-regressive language model designed to generate text to complete a user-entered text prefix, known as a prompt. It can be used for multiple tasks, including MT, question answering, etc. Bloom was trained on 1.6TB of text (of which 30% English), from various sources, although 38% of the data, known as the ROOTS corpus (Laurençon et al., 2022),666The ROOTS corpus can now be queried using the dedicated search tool https://hf.co/spaces/bigscience-data/roots-search. is from Oscar web data (Ortiz Suárez et al., 2019). The model is openly released on HuggingFace in multiple sizes, ranging from 560M to 176B parameters.777https://hf.co/bigscience/bloom

Evaluating Bloom on the MT task

We experiment with three datasets, chosen to test different aspects of Bloom for MT: WMT (Bojar et al., 2014), Flores-101 (Goyal et al., 2022) and DiaBLa (Bawden et al., 2021). We use the WMT 2014 news test sets for English $\leftrightarrow$ French and English $\leftrightarrow$ Hindi, which we take as representative high- and lower-resource language pairs with respect to Bloom’s training data.888English, French and Hindi make up 30%, 12.9% and 0.7% of the training data respectively (Laurençon et al., 2022). These test sets are somewhat outdated (Garcia et al., 2023), but have been used repeatedly in past LLM evaluations and are included as standard benchmarks for comparison. Flores-101 is a multi-parallel dataset in 101 languages, translated from original English sentences.In fact, evaluations into English are bound to yield overly good results (e.g. (Toral et al., 2018)) and between other languages may mostly reflect their similarity with the original English. We use it to test and compare Bloom’s multilinguality, including for low-resource languages.999An extended version, Flores-200, has been recently released (Costa-jussà et al., 2022), which is larger and covers approximately twice as many languages. As this new version was released late in our evaluation process and had only been used in one paper, we decided to stick to Flores-101. DiaBLa is a bilingual test set of spontaneous written dialogues between English and French speakers, mediated by MT. We use this as a test of MT in an informal domain and the impact of (cross-lingual) linguistic context in MT.

2 Experimental setup

We evaluate and compare Bloom (and its variants) using the Language Model Evaluation Harness (Gao et al., 2021) in 0-shot and few-shot settings. For few-shot, $k$ examples are prefixed to the prompt and separated with ### as shown in Example 4.2 (1-shot example is underlined).

Input: French: je m’ennuie = English: I’m bored. ### English: Is that your dog that’s just wandered in over there? = French: Reference: Est-ce que c’est votre chien qui vient de rentrer par là ?

Results are reported on the datasets’ test splits. Few-shot examples are randomly taken from the data splits according to availability (train for WMT, dev for Flores-101 and test for DiaBLa). We evaluate using BLEU (Papineni et al., 2002) as implemented in SacreBLEU (Post, 2018), using as tokenisation 13a for WMT and DiaBLa and spm for Flores-101 as recommended (Costa-jussà et al., 2022).101010BLEU+case:mixed+smooth.exp+{13a,spm}+version.2.2.1 BLEU has many shortcomings but is good enough to provide quantitative comparisons for most systems used in this study. We additionally use COMET (Rei et al., 2020) for finer grained comparisons when the scores are closer.

In our cross-dataset comparison (Section 5.1), we compare Bloom to other LLMs: (i) two task-fine-tuned models: T0111111https://hf.co/bigscience/T0 (Sanh et al., 2022), trained on English texts, and mT0-xxl121212https://hf.co/bigscience/mt0-xxl (Muennighoff et al., 2022), the multilingual version, and (ii) OPT 131313https://hf.co/facebook/opt-66b (Zhang et al., 2022), an English generative LM. We evaluate all models on the same prompt xglm-source+target. To evaluate multiple language pairs with Flores-101, we compare (as a topline) to the supervised 615M-parameter MT model M2M-100 (Fan et al., 2021), using the scores computed by Goyal et al. (2022).

2.2 Prompts

We use several prompts, designed to illustrate different sources of variation: (i) the inclusion (or not) of the source language name, (ii) the relative order of source and target language names, (iii) the position of the source sentence (beginning or end of the prompt) and (iv) the prompt’s verbosity. These prompts, available in PromptSource (Bach et al., 2022), are shown in Table 1. The first three are inspired by previous work:141414This was not always straightforward due to incomplete documentation concerning (a) prompts tested, and (b) those actually used in each experiment (e.g. different ones for 0-shot and few-shot runs (Chowdhery et al., 2022)). (Brown et al., 2020) for gpt3,151515Used only it seems, for zero-shot learning in the form “Q: what is the L2 translation of sentence [source sentence]. A:”, where special tokens Q and A are the query and the answer texts (cf. Figure G.36, pp 59). (Lin et al., 2022) for xglm and (Wei et al., 2022) for translate_as, which also resembles Raffel et al. (2020)’s prompt (Translate English to German: “[source text]”: [target sentence]), also used in (Wei et al., 2022; Garcia and Firat, 2022).

Considering the entries in Table 1, we can see that “prompting” in fact refers to two distinct aspects of the input: (i) the formulation of the task in natural language and (ii) the presentation of related examples (for few-shot setups) interleaved with language tags (perhaps more clearly referred to as priming by Pham et al. (2020)). As illustrated by the xglm prompt for example, the instruction part can reduced to one single word. As our results below suggest, the instruction mostly matters in 0-shot setups, but can almost be dispensed with in few-shot scenarios. The authors of (Brown et al., 2020) and (Hendy et al., 2023) also use a verbose, instruction-like prompt in their zero-shot setup, and a much more compact one for few shots experiments. Also note that InstructGPT’s prompt combines both an instruction and language tags (Ouyang et al., 2022, p. 49).

Evaluation results

Our evaluation of Bloom starts with a comparison across the three datasets and detection of major MT errors with a focus on WMT (Section 5.1) and then we present more in-depth analyses of particular aspects: (i) using WMT, a comparative study of Bloom model sizes (Section 5.2) and prompts (Section 5.3), (ii) using Flores-101 an evaluation of more language pairs and cross-lingual few-shot transfer (Section 5.4), and (ii) using DiaBLa, a study of the use of linguistic context (Section 5.5).

We first prompt Bloom and the comparative models using the same prompt across datasets, restricting the directions tested to en $\leftrightarrow$ fr and to en $\leftrightarrow$ hi. We choose to systematically use the xglm-source+target prompt (Table 1), which corresponds to the following template:

L1: [source sentence] = L2: where L1 and L2 refer to the source and target languages respectively (e.g. English and French for en $\rightarrow$ fr) and [source sentence] is replaced by a given source sentence.

BLEU scores are in Table 2(a) for both 0-shot and 1-shot (results with COMET are given in Appendix A). There are issues for 0-shot MT for all directions, particularly when translating into non-English languages, (BLEU scores are systematically lower than into English). Even into English, the scores remain low with respect to state of the art (e.g. 2.7 BLEU for Flores-101 fr $\rightarrow$ en Bloom vs. 60.1 for mT0-xxl).161616 For comparison, (Bi et al., 2020) reports state-of-the art BLEU scores for supervised MT as 45.6 and 45.4 for WMT14 en $\rightarrow$ fr and fr $\rightarrow$ en respectively. mT0-xxl’s good results illustrate the benefits of prompt fine-tuning. Results are considerably improved in the 1-shot setting (except for T0 and mT0-xxl, both fine-tuned) and are comparable to other LLM MT results reported in the literature; they are markedly better for Bloom than for OPT, which was mostly trained on English data. These general observations apply for WMT and Flores-101, while issues remain for DiaBLa.

We identify two main issues (particularly for 0-shot): (i) over-generation and (ii) generating in the wrong language (an obvious problem for MT). Examples 5.1 and 5.1 illustrate some instances of these problems. The first example shows a very common problem, with repetition of the prompt in multiple languages until the maximum generation length is reached.

Input: French: Il dit avoir conçu une sonnette de porte Wi-Fi. = English: Bloom: He says he conceived a Wi-Fi doorbell. = Spanish: Dice que concibió un timbre de puerta Wi-Fi. = German: Er sagt, er hat einen WLAN-Türglocken gegründet. = Italian: Dice di aver concepito un chiuso Wi-Fi […]

Input: French: Le tigre fait partie de la même famille (genre Panthera) que les lions, les léopards et les jaguars. Ces quatre félins sont les seuls capables de rugir. = English: Bloom: The tiger is a member of the same family (genus Panthera) as lions, leopards and jaguars. These four cats are the only ones capable of roaring. Le tigre est un grand félin qui vit en Asie. Il est le plus grand des félins. Il est aussi le plus gros carnivore terrestre. Il est très rapide et peut courir à plus de 60 km/h. […]

Overgeneration as seen in Example 5.1 is a separate issue from Bloom’s capacity to translate into another language. We therefore devise a custom truncating method for this type of overgeneration such that only the first translation in a prediction is kept, i.e. anything after a newline or the regular expression pattern = .+?: is discarded.

Results after truncation (Table 2(b)) show that for all three datasets, 0-shot and 1-shot scores are significantly improved (e.g. 1-shot DiaBLa fr $\rightarrow$ en increases from 12.05 to 41.36 and 0-shot Flores-101 hi $\rightarrow$ en increases from 3.40 to 30.19). Bloom is capable of performing good MT but has a problem knowing when to stop generating. We use the same truncation elsewhere too and indicate when we show results for original or truncated outputs.

We automatically detect the language of predictions using fasttext langid171717https://fasttext.cc/docs/en/language-identification.html, using the compressed version lid.176.ftz. (Joulin et al., 2017). Table 3 shows the number of translations identified as being in the correct target language, or alternatively in the source or another language for 0-shot and 1-shot setups after truncation.181818Raw tables can be found in Tables 12 and 13 in Appendix B.,191919These numbers are better than the initial ones reported in (BigScience et al., 2022), as we use a different prompt and truncation. See below for a detailed analysis per prompt. The number of sentences in the correct target language increases from 0- to 1-shot, particularly for the two non-English target languages. When translating into Hindi (0-shot), 1/5 (509) of predictions are not detected as Hindi; the 1-shot largely mitigates the issue (only 76 outputs are in the wrong language).

Both problems improve significantly in the 1-shot setup, a trend that continues as the number of few-shot examples increases, resulting in higher BLEU scores, as can be seen in Figure 1 for WMT en $\leftrightarrow$ fr. However, we see diminishing returns, particularly visible between 2 to 5 examples, suggesting that gains beyond 5-shot would be more marginal.

2 Bloom model size

Several versions of Bloom exist, with differing numbers of parameters. To test how size impacts performance, we report average scores and ranges for WMT across the seven prompts. Table 4 shows that as the size decreases (from 176B to 560M parameters), the performance also decreases significantly. We see substantial gains for all models when moving from 0-shot to 1-shot, the smaller models (e.g. Bloom-7b1, Bloom-3b) slightly closing the gap with the largest one. As the ranges in Table 4 are computed across prompts, we see that different prompts yield markedly different BLEU scores in the 0-shot setup; for 1-shot, we still see variations of 6-8 BLEU points between the best and the worst prompt. Similar analyses performed with post-processing and also for English $\leftrightarrow$ Hindi (Appendix C) confirm that (i) truncation improves scores for all model sizes and prompts and (ii) the choice of a bad prompt can result in catastrophic MT performance as compared to a good one.

3 Per-prompt analysis

Looking at average WMT results computed with respect to prompt choice (using the prompts in Table 1) allows us to further investigate cross-prompt variability.

This variability is illustrated in Tables 5 and 6 report performance across prompts for en $\leftrightarrow$ {fr,hi}, averaged over the five Bloom models from Section 5.2.202020For a given prompt, the range mainly reflects the performance of the different sizes of Bloom model. The corresponding tables for truncated outputs are in Appendix D. version and a_good_translation (source+target) get the highest average (and maximum) scores. Both prompts are more verbose (instruction-like), but the performance gap in the 1-shot setting between these prompts and the simpler, ‘priming-style’ prompts (e.g. xglm) narrows. The worst results are seen for gpt3. With this prompt, translating into French after a text that only contains English seems particularly difficult: half of the 0-shot translations for gpt3 are classified as non-French by langid (most of them are English). When translating into Hindi, only 10 outputs are detected as being in Hindi.

We compare the two versions (-target and -source+target) of a_good_translation and xglm. Results in Tables 5 and 6 are inconclusive. For these language directions and prompts, we see small differences for 1-shot, which may be due to variance between runs. For 0-shot, it clearly helps xglm to indicate the source language, but for the more verbose a_good_translation, it helps one direction and hurts the other. This question would need to be further explored to draw more solid conclusions, including with non-English prompts.

4 Evaluating more language directions

We further explore more language directions in the 1-shot setting using Flores-101. As in Section 5.1, we use the xglm-source+target prompt.212121It behaved well on average in the previous experiments and is one of the least verbose, making it more suitable in a multilingual setting.

To optimise computational resources, instead of running all language combinations, we concentrate on: (i) high-resource language pairs, (ii) high $\rightarrow$ mid-resource language pairs, (iii) low-resource language pairs and (iv) related languages (specifically Romance languages). Results are shown in Tables 7 and 8 for original outputs, given that overgeneration is less problematic for 1-shot.

The results for high-resource and high $\rightarrow$ mid-resource language directions are generally good, surpassing M2M scores for high-resource, except for es $\rightarrow$ fr.222222French and Spanish, although related and comparably represented in ROOTS, have very different scores. Our preliminary analysis suggests that this is due to the Spanish references being less literal than the French and structurally more different from the original English. See Appendix E for some examples. This suggests that Bloom a has good multilingual capacity, even across scripts (between (extended) Latin, Chinese, Arabic and Devanagari scripts).

For low-resource languages, the results are more variable; some language directions see better results than M2M, notably most into-English directions, but others are less good (e.g. into Hindi and Swahili). Results for the lowest-resourced languages tested (sw $\leftrightarrow$ yo and en $\leftrightarrow$ yo) are particularly disappointing because the scores indicate that the resulting translations are meaningless, even though Yoruba and Swahili are present (although under-represented) in BLOOM’s training data ( $<$ 50k tokens each).

This contrasts with the results between Romance languages, where results are good across-the-board, including from and into Italian (it) and Galician (gl), which are not officially in the training data. Note that Galician shares many similarities with the other Romance languages, in particular with Portuguese (pt). These contrasted results show the performance of an LLM not only depends on the amount of training data, but also largely on the similarity with seen languages. To be complete, these analyses should also take into account the possibility of mislabellings in the training data,232323In a personal communication, N. Muennighoff estimates that Italian accounts for $\sim$ 0.33% of the ROOTS corpus, slightly below the proportion of Hindi texts (0.47%). which have been found to explain a great deal of cross-lingual abilities of LLMs (Blevins and Zettlemoyer, 2022).

4.2 Cross-lingual transfer

1-shot results are positive for many of the language directions tested (including low-resource), provided they are sufficiently represented in the ROOTS corpus. To better understand how cross-lingual Bloom is and how the 1-shot mechanism functions, we vary the language direction of the few-shot examples, taking Bengali $\rightarrow$ English (bn $\rightarrow$ en) translation as our case study. Taking random 1-shot dev set examples,242424The random seed is kept the same for all runs. we compare the use of 1-shot examples from (i) the same direction (bn $\rightarrow$ en), (ii) the opposite direction (en $\rightarrow$ bn), (iii) a language direction whereby the source languages are related (hi $\rightarrow$ en), (iv) the same related direction but from a different dataset (the WMT dev set) (v) a high-resource direction into the same target language (fr $\rightarrow$ en) and (vi) a high-resource unrelated language direction (fr $\rightarrow$ ar).

The results (Table 9) show that cross-lingual transfer is possible, but using a different language direction can impact overgeneration and translation quality. The unrelated direction fr $\rightarrow$ ar gives the worst results, with most overgeneration (see the score difference between original and truncated), but also the worst quality after truncation, suggesting that language relatedness does play a role. Overgeneration is still a problem (although less so) when using the opposite direction (en $\rightarrow$ bn) or the same target language (fr $\rightarrow$ en). Using a related (higher-resource) source language (hi $\rightarrow$ en) reduces overgeneration and also gives the best MT results. However, better results are seen when using Flores-101 rather than WMT examples, suggesting that in-domain examples are best.

5 Use of Linguistic Context

There has been a considerable amount of research on linguistic context in MT, e.g. to disambiguate lexically ambiguous texts or when additional information is necessary for the output to be well-formed (e.g. translating anaphoric pronouns into a language that requires agreement with a coreferent) (Hardmeier, 2012; Libovický and Helcl, 2017; Bawden et al., 2018; Voita et al., 2018; Lopes et al., 2020; Nayak et al., 2022).

We test the usefulness of linguistic context in DiaBLa in the 1-shot setting (again using xglm-source+target) by changing the origin of 1-shot examples: (i) a random example vs. (ii) the previous dialogue utterance. If linguistic context is useful, we would expect there to be an improvement for (ii). We also vary the language direction of the 1-shot example. By default, given that the dataset is bilingual, the direction of 1-shot examples is en $\rightarrow$ fr or fr $\rightarrow$ en, independent of the current example’s direction. Given the results in Section 5.4.2 and the poor 0-shot results in Table 2(a), it is important to account for this to provide a fair comparison. We therefore compare each type of context (random/previous) with (i) the same random directions, and (ii-iii) the same (and opposite) language directions as the current example. We show results for original and truncated outputs.

Results are shown in Table 10. Truncation helps considerably; even for 1-shot, Bloom struggles not to overgenerate and this is considerably reduced when the same rather than the opposite language direction is used for the 1-shot example. It is unclear whether using previous rather than random context helps: BLEU is higher (38.5 vs. 37.6), whereas COMET is lower (0.328 vs. 0.342). These differences could be the result of randomness in 1-shot example selection, and different results could be obtained with a different random seed. Despite these inconclusive results, it is clear that using previous context influences the translation, for better or worse. For evidence of this, see Table 19 in Appendix F, which provides three such examples: (i) an unlucky negative influence on the translation of an ambiguous word glace ‘ice cream or mirror’ from the previous context, resulting in the wrong sense being chosen, (ii) the use of a coreferent instrument ‘instrument’ from the previous sentence and (iii) the correct gender agreement of the pronoun they into French (elles ‘they (fem.)’ as opposed to ils ‘they (masc.)’) to correspond to the feminine coreferent filles ‘girls’.

Conclusion

We have evaluated Bloom’s MT performance across three datasets and multiple language pairs. While there remain problems of overgeneration and generating in the wrong language (particularly for 0-shot MT), MT quality is significantly improved in few-shot settings, closer to state-of-the-art results. Low-resource MT remains challenging for some language pairs, despite the languages being in the training data, questioning what it means to be a Bloom language. However, we see evidence for cross-lingual transfer for non-Bloom languages and when using few-shot examples from other language pairs. Finally, although using linguistic context does not give improvements with automatic metrics, there is evidence that discursive phenomena are taken into account.

Acknowledgements

This work was made possible with the collective efforts of the BigScience community, who designed, developed and prepared the tools and datasets used to train Bloom. Special mention to evaluation working group members and especially to Niklas Muenninghoff and Pawan Sasanka Ammanamanchi for producing some of our results.

This work was granted access to the HPC resources of Institut du développement et des ressources en informatique scientifique (IDRIS) du Centre national de la recherche scientifique (CNRS) under the allocations 2021-AD011011717R1, AD011012254R2, 2021-A0101012475 and 2022-AD010614012 made by Grand équipement national de calcul intensif (GENCI). R. Bawden’s participation was partly funded by her chair position in the PRAIRIE institute, funded by the French national agency ANR as part of the “Investissements d’avenir” programme under the reference ANR-19-P3IA-0001, and by her Emergence project, DadaNMT, funded by Sorbonne Université.

References

Appendix A COMET Results for Main Comparison

Table 11 shows the COMET scores for the cross-dataset and model comparison. The conclusions drawn for the Table 2 with BLEU scores hold here.

Appendix B Wrong language prediction and over-generation

As described in Section 5.1, one problem identified with Bloom, particularly for 0-shot translation, is generating in the wrong language. Tables 12 and 13 give the full analysis including raw figures for language identification for WMT14 fr $\leftrightarrow$ en and hi $\leftrightarrow$ en translation directions. For 0-5 few-shot examples, we indicate the number of truncated outputs identified as being from each language (indicated by the rows), the correct language (the target) being indicated in green, and the source language (therefore incorrect) being indicated in red. We also provide the average length difference ( $\Delta$ ) between Bloom’s outputs and the reference translations (negative numbers indicate that the prediction is longer than the reference).

For 0-shot translation, a significant number of examples are classed as being in the source language for en $\rightarrow$ fr, and even more so for en $\rightarrow$ hi (almost one fifth of the outputs are in the wrong language). As we increase the number of few-shot examples used, both of these problems are significantly reduced, and almost disappear for all language pairs and directions with 5 examples.

Appendix C Analysis per model

In this section, we complete the results of Section 5.2 with Tables 14 and 15, respectively for French $\leftrightarrow$ English and Hindi $\leftrightarrow$ English, reporting results without truncation. As expected, the systems are ranked according to their size. For French–English we see that decent performance can already be obtained with the second largest model Bloom-7b1, using 1-shot. Using this model, or even a model half this size can provide good indication of the performance of prompts, and be reliably used as test beds. We obtain less satisfactory results with English $\leftrightarrow$ Hindi, even with the large Bloom; for this language pair, we even observe a large variation across prompts (looking at the range of scores) in the 1-shot setting for all models.

Appendix D Analysis per prompt

In this section, we replicate the analysis of Section 5.3 and report results per prompt with truncated outputs in Tables 16 and 17. The conclusions are overall consistent with what we report for non-truncated outputs in the main text. We note that after truncating the outputs, xglm-source+target yields very good results across the board, outperforming its closest contenders a_good_translation-source+target and version-target in almost all configurations. However, the choice of the prompt seems to matter more (a) in the zero-shot setting, (b) when translating out of English. Conversely our more stable results are for fr–en, 1-shot.

Appendix E Translation divergences in Flores 101

A striking observation reported in the main text (Section 5.4.1) is the difference between French and Spanish for the Flores-101 experiments. This is unexpected, as both languages are well represented in the training data. Yet, when translating from and into English the difference in spBLEU score is huge; and there is a clear gap with the other Romance languages as well. A related question is the poor translation between French and Spanish, not much better than for French $\rightarrow$ Arabic. Looking at some sample outputs, this seems to be due to the peculiarities of the Spanish translations, which appear to be less literal than their French counterparts, but which yield equally good translations into English. This can be seen when we compare translations back into English for these languages (see a random subset in Table 18). The last example illustrates this very clearly: we see “34 percent” in both the original English and in the translation from French, while translation from Spanish starts with “one third”.

Appendix F DiaBLa context-use examples

Table 19 contains examples where the preceding context in 1-shot examples has a positive, negative or neutral influence on the current prediction, showing that the choice of the 1-shot example is important and is taken into account by the model. Some details of these experiments are found in the accompanying Section 5.5 in the main text.