In-context Examples Selection for Machine Translation

Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, Marjan Ghazvininejad

Introduction

In-context learning Brown et al. (2020) has recently received a lot of attention from the NLP research community due to its remarkable ability to utilize only a few input-output examples to perform many NLP tasks Liu et al. (2021). For example, Lin et al. (2021) demonstrate that a 7.5B multilingual generative model, XGLM, outperforms a supervised sequence-to-sequence baseline in 45 translation directions on the FLORES-101 machine translation benchmark Goyal et al. (2022) using just $32$ randomly sampled translation examples as demonstrations. While these results are compelling, recent work has also shown that the performance and capability of a pre-trained language model (PLM) can be highly sensitive to many factors, such as the choice of in-context examples Liu et al. (2022b), their ordering Lu et al. (2022) and the template Jiang et al. (2020).

Typically, in-context learning for MT uses examples that are randomly sampled from a small development set that resembles the domain of the test dataset. The effect of the aforementioned factors (such as the choice of the examples) on the translation quality of the PLM hence remains unclear and unexplored. Yet another crucial gap in using in-context learning for MT in the current literature is the effect of the domain of in-context examples on translation quality since out-of-domain generalization is a known and important challenge in MT Koehn and Knowles (2017).

In this work, we systematically analyze how factors such as the choice and the number of few-shot in-context examples and their ordering impact MT output quality. We show that while noisy unrelated 1-shot example can have a significantly adverse effect on translation quality, a single prompt optimized to maximize the translation quality on a development set can sufficiently elicit task-based information from the PLM. Our analysis thus demonstrates the importance of selecting good examples for MT and raises the question: What are the properties of good in-context examples for MT? In that direction, our findings suggest that a well-formed meaning-equivalent translation example results in higher quality translation than randomly selected in-context examples.

Furthermore, motivated by the use of Translation Memory in Computer-Aided Translation Yamada (2011) and its usage in computational approaches to Machine Translation (Somers, 1999; Koehn and Senellart, 2010; Khandelwal et al., 2020, inter alia), we retrieve similar examples to the test source from a datastore that includes pairs of source text and their corresponding translations via BM25, an unsupervised efficient retriever to provide additional context to the model. As the context window of the PLM is usually limited ( $\sim$ 3096 tokens, $16-20$ examples), we propose a novel in-context-example selection and reranking strategy that maximizes the coverage of the source n-grams in the selected examples. Experiments on WMT’19 English $\leftrightarrow$ German and English $\leftrightarrow$ Russian datasets show that our proposed re-ranking strategy can consistently improve the translation quality over the outputs generated using BM25 retrieved examples.

Combining optimized 1-shot task-level with example-specific in-context examples using a simple concatenation strategy further improves translation quality, outperforming state-of-the-art inference-adapted nearest-neighbor MT models (kNN-MT) on two out-of-domain datasets (Medical and IT) while being memory and compute efficient as our approach does not require constructing and querying a dense token-level datastore.

Background: In-context Learning

Generating translations from large-scale multilingual language models like mGPT Shliazhko et al. (2022), XGLM Lin et al. (2021) or AlexaTM 20B Soltan et al. (2022) requires conditioning the decoder-only language model with in-context parallel examples. These examples serve two purposes: a) providing the model with the format and knowledge of the task (task-level) and b) guiding the output generation via providing useful information about the unseen source sentence (example-specific). This is different from the standard sequence-to-sequence models, where the task is always known, and the model learns generalizable patterns from the input-output examples to perform the task (in this case, translation) for the unseen source text.

Formally, given $k$ in-context examples $\{x_{i},y_{i}\}_{1}^{k}$ the prefix input or the prompt, $x_{j}^{p}$ , is generated by concatenating the demonstration examples $\{(x_{i},y_{i})\}_{1}^{k}$ to the test input, $x_{j}^{s}$ according to a template, $P$ (see Table 1). The output, $\hat{y}$ , is then generated via the PLM with parameters $\theta$ via greedy decoding as follows:

Prompt Selection

Ideally, good in-context examples can trigger the pre-trained language model to generate the desired output and also elicit the information learned during pre-training Jiang et al. (2020). Min et al. (2022) show that, for classification tasks, the in-context examples provide information about the task (the distribution of the input text, the label space, and the format of the task) and that the model does not rely on these examples to generate the final output. However, their analysis is limited to a) classification tasks and 2) randomly sampled in-context examples. Prior work has also shown that the order of these in-context examples can also lead to high variance in downstream performance Zhang et al. (2022). However, less is understood about how these factors impact text generation tasks like MT. Do we need multiple in-context examples? What makes good in-context examples for MT? How sensitive is the model to the order of the prompts?

In this work, we aim to better understand the impact of prompt selection on the translation quality of the outputs. Given a training dataset consisting of $n$ parallel examples $D=\{x_{i},y_{i}\}_{i=1}^{n}$ , and a test source $x_{j}$ , we select a subset of $m$ informative samples to form a prompt which either provides task-level and/or example-specific information as discussed below.

A good task-level in-context example should be able to elicit information learned during pretraining from the PLM. One way to measure the efficacy of an example as a prompt is via computing the translation quality of the outputs generated when prompting the PLM given an example. Hence, we select the task-level prompt as follow: For a given example sampled from the training dataset, $(x_{i},y_{i})\in D^{S}$ , we create a prompt, $x_{i}^{p}$ by concatenating the example $\{(x_{i},y_{i})\}$ to each source in the development set. The system outputs are then generated using equation 1. We then rank examples from $D^{S}$ as task-level prompts based on the Bleu of the generated outputs against the references on this held-out development set, $D^{dev}=\{X,Y\}$ :

2 Example-specific In-context Examples

Prior work on retrieving good in-context example-specific prompts for tasks other than MT (like question answering or knowledge retrieval) either trains a dense-retriever Rubin et al. (2021) or utilizes samples that are closer to the test source in the embedding space of a PLM like BERT Devlin et al. (2019), RoBERTa Liu et al. (2019), or XLNet models Liu et al. (2022b). While contextual models can generate a global sentence representation, they overlook rare lexicons which can be important for generating translations in unseen domains like medical or IT Wrzalik and Krechel (2021).

However, for MT, overlapping n-grams between the source and the retrieved sentences ensures informativeness as the target associated with the retrieved sentence is likely to include partial translations of the source. We can thus use BM25 as an efficient unsupervised retrieval method to retrieve similar examples. However, as the examples are scored independently and BM25 favors rare word matches Robertson et al. (2009), the top retrieved candidates might not cover all the terms in the source text (Figure 1). Given that the context window of the PLM is usually limited ( $\sim$ 3096 tokens, $16-20$ examples), maximizing the coverage of all the terms found in the test input might be favorable. Hence, we propose to re-rank the top $100$ candidates retrieved from BM25 using our algorithm outlined in 1.

We extract all the word n-grams, and their counts from the test source, $x_{j}^{s}$ and source of the BM25 retrieved examples, $\{P_{j}(x_{i}\}_{1}^{k}$ (lines 1-1). Let S and Q denote the set of the source n-grams and the n-grams from a BM25 retrieved example, respectively. We compute a recall-based (R) n-gram overlap score (line 1) using the following equation:

The example with the maximum score is then added to the set of selected prompts, and the found n-grams from the test source are then down-weighted by a factor, $\lambda$ for the next iteration of selection (line 1). For example, setting $\lambda=0$ will select the example that covers the n-grams from the test source in the subsequent iteration that has not already been encountered. This process is then repeated over the retrieved pool until a set threshold of the score is reached.

Figure 1 shows the top-100 candidates retrieved via BM25 for the input: “Welche Risiken sind mit Poulvac FluFend H5N3 RG verbunden?”. The top few candidates provide the same information to the PLM, i.e., translation of the phrase “Poulvac FluFend H5N3 RG”. The examples including the other terms (“Welche Risiken sind mit verbunden ?”) from the input text, are ranked lower. On the other hand, our proposed re-ranking strategy can cover all the terms from the input text, in this case, with just the top-2 examples.

Evaluation Settings

We perform our in-domain evaluation on the WMT-19 German (de) $\Leftrightarrow$ English (en) and WMT-19 Russian (ru) $\Leftrightarrow$ English (en) datasets Barrault et al. (2019). For the out-of-domain evaluation, we use the multi-domain dataset from Aharoni and Goldberg (2020) for the following domains: Medical, Law, IT, and Koran. The dataset statistics are included in the Appendix (Table 11). We normalize punctuation using the Moses toolkit Koehn et al. (2007) and remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5 from the in-domain datasets. We evaluate the detokenized length truncated outputs generated by the model using sacreBLEU Post (2018).https://github.com/mjpost/sacrebleu We also report Comet Rei et al. (2020) scores for evaluating translation quality in Appendix Tables 15 and 16. The generated outputs from the PLM are truncated to twice the source length, as preliminary analysis suggested degeneration in a few ( $\sim$ 10-20) examples.

2 Experimental Conditions

We use the publicly available checkpoint of the $\text{XGLM}_{7.5B}$ , a decoder-only causal language multilingual model Lin et al. (2021) for all our experiments, which has 32 layers and a hidden dimension of 4096.

Baselines and Comparisons

Random: $p$ random few-shot examples sampled from the training dataset (number of trials=3).

Task-level: top- $p$ examples that achieve the highest Bleu on the development set (§ 3.1).

Retrieved In-context (BM25): $q_{max}$ examples retrieved via BM25, since unlike task-level examples, there is no guarantee that exactly $q$ similar examples will be found in the training dataset for each input.

Retrieved Re-ranked In-context (R-BM25): $q_{max}$ re-ranked examples using our proposed approach as detailed in § 3.2.

We additionally compare our results with kNN-MT Khandelwal et al. (2020) for out-of-domain evaluation. We use $\lambda=0.1$ , threshold= $1.0$ and order the examples according to their similarity to the source, with the most similar examples on the left in all our experiments based on an initial hyperparameter search on the development dataset (Appendix Tables 12,13).

Results

Table 2 and 3 summarizes our main results for the in-domain evaluation when translating between English <-> German and English <-> Russian and the four German-English out-of-domain datasets, respectively.

Our experiment suggests that it is possible to elicit the task-level knowledge from the large-scale language model using a single prompt as opposed to using 16 random few-shot examples when translating into English (Table 2). Using a single task-level prompt (optimized on the development set) improves Bleu over using $16$ random few-shot examples for 2 out of 4 translation directions (De-En, Ru-En). We hypothesize that when translating out of English, the model still benefits from getting exposed to multiple and diverse random few-shot examples as the target language model is relatively weaker.

Multiple example-specific prompts are required to improve translation quality over a single task-level prompt.

Using a single task-level ( $p=1$ ) prompt attains higher Bleu over using a single example-specific prompt ( $q=1$ ; BM25, R-BM25) across the board. By contrast, using upto $16$ BM25 prompts ( $q_{max}=16$ ) significantly improves output quality over using task-level prompts, with an average gain of $1.41$ in Bleu.

Re-ranking BM25 retreived examples improves Bleu.

Our proposed re-ranking strategy consistently improves Bleu across the board over BM25 for both values of $q_{max}=\{1,16\}$ showing that both the order and the choice of the in-context examples matters.

Both task-level and R-BM25 examples provide complementary advantages, as combining them using a simple concatenation strategy improve output quality over task-level or R-BM25 examples. We leave the exploration of optimizing the number and the joint order of task-level and example-specific prompts to future work.

2 Out-of-domain Evaluation

As XGLM is trained on monolingual Common Crawl snapshots, translation in any domain and language could be considered an out-of-domain task from the model’s perspective. However, we hypothesize that translation in specific domains like medical, law, or IT could still be challenging for the PLM as the model is less likely to have observed even sufficient monolingual datasets for these specialized domains, in contrast to the news text found in WMT. Examples from these domains might require translating rare terminologies and carry domain-specific idiosyncrasies, which is known to pose a challenge even for a well-trained supervised neural MT model Koehn and Knowles (2017). Hence, we also evaluate PLM under these specialized out-of-domain scenarios.

Task-level in-context examples drawn from the domain of evaluation, i.e., domain-specific, obtain on an average higher Bleu scores across the board than using examples from a distant WMT corpus as expected (Table 3) in both 1-shot ( $p=1$ : +1.4) and 16-shot ( $p=16$ : +2.7) settings.

Example-specific prompts significantly improve translation quality over task-level prompts.

Unlike the in-domain evaluation, retrieved and re-ranked example-specific prompts (R-BM25) improve the translation quality significantly across the board with up to 23 Bleu gain in the Law domain using just a single example as a prompt over a task-level prompt. This can be attributed to the high lexical overlap in the examples retrieved from the training data for these domains (Table 8).

Task-level and R-BM25 prompts are complementary.

Both task-level and R-BM25 provide supporting information for a given test source sentence as concatenating these set of prompts improves output quality over using these methods independently, outperforming a strong kNN-MT baseline on 2 out of 4 domains (Medical and IT) without requiring access to a strong base MT model or token-level retrieval during inference. Our manual analysis suggests that the higher gain obtained in the IT domain ( $+0.86$ ) when using both task-level and example-specific prompts can be explained by the observation that for $100$ test source sentences, there are no training examples with any lexical overlap with the test source. The task-level prompt can still elicit learned information from the PLM over using no examples for these inputs.

Analysis

We show the distribution of output quality as measured by Bleu when using 100 different examples as prompts in Figure 2. Across all four language pairs, there is a large variation in Bleu scores (up to 20 Bleu), where noisy or unrelated prompts can lead to significantly worse output quality. Given that most existing parallel corpora are web-crawled and the quality of bitext can vary significantly across different language pairs Kreutzer et al. (2022), randomly sampled examples can under-estimate the translation quality attainable by prompting these PLM.

Impact of Pool Size on Task-level Prompt Selection

We select the best task-level prompt based on the translation quality on the development set from a random sample of 100 examples (pool) as detailed in Section 3.1. However, one concern regarding the selection of the best task-level prompt in this fashion could be that we might still be underestimating the PLM (s) performance, as a larger pool size could result in better output quality. We study the impact of using a larger pool size in Table 5 where increasing the number of examples from $100$ to $1000$ only leads to a gain of 0.5 points in the maximum Bleu. From the same table, we can also observe that for any subset of random $100$ few-shot examples, we can extract a task-level prompt (Bleu: 36) with a small standard deviation in overall outpust quality ( $0.18$ ).

Translation direction

Figure 3 shows the correlation between output quality in forward ( $x\rightarrow y)$ and reverse ( $y\rightarrow x)$ translation directions when using 1-shot prompts — there is a moderate to high correlation in output quality for both language pairs suggesting that the best and worst 1-shot prompts in one direction exhibit similar behavior in the opposite translation direction.

Properties of good Task-level prompts

Our manual analysis on the best task-level prompts suggests that any well-formed and meaning-equivalent translation Vyas et al. (2018); Briakou and Carpuat (2020) could make a good task-level prompt (see examples in Appendix Table 14). To quantify the meaning equivalence of the 1-best task-level prompt against random 1-shot examples, we report the percentage of aligned words between the source and reference translation (“% Aligned words”) using fastAlign Dyer et al. (2013) and the probability of generating the reference translation conditioned on the source using a pre-trained multilingual NMT model, Prism-src Thompson and Post (2020); Agrawal et al. (2021) in Table 4.https://github.com/clab/fast_align, https://github.com/thompsonb/prism Across all language pairs and both metrics, task-level examples achieve higher semantic similarity scores than random 1-shot examples suggesting that task-level examples are relatively more equivalent in meaning than random examples.

Impact of Noise

Impact of Ordering

To further explore the sensitivity to the order of few-shot prompts on MT quality, we use all possible order permutations of four randomly sampled examples and the top four task-level examples as prompts ( $4!$ ) and report the translation quality as measured by Bleu in Table 7: Task-level prompts are less sensitive to prompt order, as suggested by the lower standard deviation achieved in all settings, and result in higher translation quality than randomly selected examples. Across the three different runs of randomly sampled examples, there is a significant difference in Bleu, further corroborating that the choice of in-context examples and their ordering matters.

2 Informativeness of Example-specific Prompts

To understand the benefit of retrieved examples in the out-of-domain evaluation, we measure the lexical overlap between the test input ( $x$ , $y$ ) and the prompts ( $I_{x},I_{y}$ ) using Bleu (Avg. Bleu ( $I_{x}$ , x), Avg. Bleu ( $I_{y}$ , y)), where $I_{x}$ and $I_{y}$ are the sources and target translations of the retrieved in-context examples. We also report the correlation against the translation quality $\textsc{Bleu}(\hat{y},y)$ . Table 8 shows that the source lexical overlap is a good indicator of the informativeness of a prompt for 3 out of 4 domains, with Koran as an exception. For Koran, while the retrieved sentences have a high overlap with the source (36.03), the target associated with the prompts ( $I_{y}$ ) does not get high Bleu with the reference (10.36) compared to other domains. We hypothesize that this might be due to a bias in the reference translations towards a particular output style. We provide an analysis of the impact of this phenomenon on MT quality in Section 7.

Output Analysis

We report two interesting findings when prompting PLM with task-level and example-specific prompts:

One advantage of using a single task-level in-context example to prompt the PLM is that it allows us to systematically study how the choice of prompt influences the style of the generated translation. Table 9 illustrates one such example: we can observe that as the prompt includes a contraction (“we are” vs. “we’re”), the outputs generated by the PLM also include contractions and can be incorrectly penalized by Bleu while being meaning equivalent.

Template-based MT

Template-based translation in medical, legal, it, or e-commerce domain can be preferable as they reduce the risk of generating errors in automatically generated translations. We present some examples in Table 10 on how PLM can seamlessly use retrieved prompts to synthesize a translation from the template provided.

1 Size of the Datastore

Figure 4 shows Bleu when varying the size of the datastore used to retrieve similar in-context examples using BM25 on the Medical dataset. As the size of the datastore increases, the likelihood of retrieving a more similar example increases. However, similar output quality in Bleu can be achieved by using multiple in-context examples when a smaller in-domain datastore is available as multiple examples can provide better coverage of the source terms — Bleu @q=16 with a datastore size of 100k is equivalent to Bleu @q=1 with twice as many examples (200k).

Related Work

Garcia and Firat (2022) use natural language prompts (e.g. Translate to {language_name}: {text}) to control the target language in multilingual MT and investigate the impact of scale, number of languages, and their similarity for this phenomena. Wang et al. (2022) utilize BM25 retrieved training examples in a supervised fashion to learn from similar examples during training. Contrary to prior work, we utilize similar examples to form a textual prompt which is used to guide the generation of translation output during inference and systematically study the properties of good in-context examples for MT.

Domain Adaptation for MT

Prior work on domain adaptation for machine translation uses out-of-domain bilingual or monolingual datasets to improve the translation quality of a pre-trained neural sequence-to-sequence MT model either during training Luong and Manning (2015); Freitag and Al-Onaizan (2016); Wang et al. (2017) or inference Zheng et al. (2021); Khandelwal et al. (2020). Similar to past work, our work utilizes out-of-domain datasets during inference to adapt a pre-trained generative language model to improve the translation quality on unseen domains. However, our approach does not rely on creating a domain-specific token-level datastore, but directly uses similar examples to provide additional context, hence is more compute and memory efficient.

Prompt Selection

The importance of selecting good in-context examples and their impact on downstream NLP task performance has been studied in prior work Liu et al. (2022b); Lu et al. (2022); Jiang et al. (2020); Min et al. (2022); Zemlyanskiy et al. (2022); Rubin et al. (2021); Liu et al. (2022a). However, how these examples and their properties impact MT quality remains unexplored, which we investigate in our work.

Conclusion

We investigate the choice of in-context examples selection for MT in both in-domain and out-of-domain settings. We propose a novel recall-based re-ranking approach to utilize similar training examples as prompts and show their efficacy across multiple datasets and domains. Our findings show that task-level prompts can provide a complementary advantage to example-specific prompts, outperforming a strong kNN-MT baseline in 2 out of 4 out-of-domain datasets while being memory and compute efficient. Our manual analysis of the generated outputs reveals that the PLM can mimic the style of the in-context examples provided and can be used for template-based translation synthesis. These results open space for future research to evaluate the potential of generating diverse and style-specific outputs for MT.

References

Appendix A Statistics of Datasets

Table 11 includes statistics of training, development and test sets used for the experiments discussed in the paper.

Appendix B Compute Infrastructure & Run time

Each experiment is run on a single Nvidia Tesla V100 Volta GPU machine with 32G Ram. A single inference experiment on $2000$ test examples using XGLM with $16$ in-context examples takes around 3-4 hrs to complete.

Appendix C Results using Second Metric: Comet

We report translation quality using Comet Rei et al. (2020) in Tables 15 and 16. We use the eamt22-cometinho-da model Rei et al. (2022) to generate the scores as it was shown to achieve higher correlations with human judgments than lexical overlap metrics while being computationally efficient. Our re-ranking strategy (with $q_{max}=16$ ) consistently performs the best across the board except for Koran, outperforming strong kNN-MT baselines on the multi-domain test set in 3 out of 4 settings. Adding a task-level prompt to 16 R-BM25 prompts via concatenation further improves quality in 5 out of 8 settings.

Appendix D Hyperparameter Search

We report the Bleu when using two different orderings of example-specific prompts on the development set for the medical domain. Ordering the examples with the most similar examples on the left attains higher Bleu than the right-to-left order. We note that the trend could vary depending on the noise in the training dataset, the degree of similarity, and the number of retrieved examples. We leave the exploration of the ordering of example-specific prompts to future work.

D.2 Choice of λ𝜆\lambda, Threshold

Table 13 shows the Bleu and the average number of in-context examples selected when varying $\lambda$ and the threshold described in Section 3.2. We select $\lambda=0.1$ and threshold value of 1.0 as it achieves the best Bleu on the Medical development set as shown below:

Appendix E Example Task-Level Prompts

Table 14 shows the best task-level in-context example selected by our method described in § 3.1 and the respective Bleu scores on the development set for the German-English and Russian-English tasks.