To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, Arul Menezes

Introduction

Automatic evaluation metrics are commonly used as the main tool for comparing the translation quality of a pair of machine translation (MT) systems (Marie et al., 2021). The decision of which of the two systems is better is often done without the help of human quality evaluation which can be expensive and time-consuming. However, as we confirm in this paper, metrics badly approximate human judgement Mathur et al. (2020b), can be affected by specific phenomena Zhang and Toral (2019); Graham et al. (2020); Mathur et al. (2020a); Freitag et al. (2021) or ignore the severity of translation errors Freitag et al. (2021), and thus may mislead system development by incorrect judgements. Therefore, it is important to study the reliability of automatic metrics and follow best practices for the automatic evaluation of systems.

Significant research effort has been applied to evaluate automatic metrics in the past decade, including annual metrics evaluation at the WMT conference and other studies (Callison-Burch et al., 2007; Przybocki et al., 2009; Stanojević et al., 2015; Mathur et al., 2020b). Most research has focused on comparing sentence-level (also known as segment-level) correlations between metric scores and human judgements; or system-level (e.g., scoring an entire test set) correlations of individual system scores with human judgement. Mathur et al. (2020a) emphasize that this scenario is not identical to the common use of metrics, where instead, researchers and practitioners use automatic scores to compare a pair of systems, for example when claiming a new state-of-the-art, evaluating different model architectures, deciding whether to publish results or to deploy new production systems.

The main objective of this study is to find an automatic metric that is best suited for making a pairwise ranking of systems and measure how much we can rely on the metric’s binary verdicts that one MT system is better than the other. We design a new methodology for pairwise system-level evaluation of metrics and use it on – to the best of our knowledge – the largest collection of human judgement of machine translation outputs which we release publicly with this research. We investigate the reliability of metrics across different language pairs, text domains and how statistical tests over automatic metrics can help to increase decision confidence. We examine how the common use of BLEU over the past years has possibly negatively affected research decisions. Lastly, we re-evaluate past findings and put them in perspective with our work. This research evaluates not only the utility of MT metrics in making pairwise comparisons specifically – it also contributes to the general assessment of MT metrics.

Based on our findings, we suggest the following best practices for the use of automatic metrics:

Use a pretrained metric as the main automatic metric; we recommend COMET. Use a string-based metric for unsupported languages and as a secondary metric, for instance ChrF. Do not use BLEU, it is inferior to other metrics, and it has been overused.

Run a paired significance test to reduce metric misjudgement by random sampling variation.

Publish your system outputs on public test sets to allow comparison and recalculation of different metric scores.

Data

In this section, we describe test sets, the process for collecting human assessments, and MT systems used in our analysis. We publish all human judgements, metadata, calculated metrics scores, and the code with replication of our findings and promoting further research. We cannot release the proprietary test sets and so system outputs for legal reasons. The collection is available at https://github.com/MicrosoftTranslator/ToShipOrNotToShip. Moreover, we plan to evaluate new metrics emerging in the future.

When evaluating our models, we use internal test sets where references are translated by professional translators from monolingual data. Freitag et al. (2020) have demonstrated that the quality of test set references plays an important role in automatic metric quality and correlation with human judgement. To maintain a high quality of our test sets, we create them by a two-step translation process: the first professional translator translates the text manually without post-editing followed by a second independent translator confirming the quality of the translations. The human translators are asked to translate sentences in isolation; however, they see context from other sentences.

The test sets are created from authentic source sentences, mostly drawn from news articles (news domain) or cleaned transcripts of parliamentary discussions (discussion domain). The news domain test sets are used in both directions, where the authentic side is mostly English, Chinese, French, or German. The discussion domain test sets are used in the direction from authentic source to translationese reference, e.g., we have two distinct test sets, one for English to Polish and second for Polish to English. Furthermore, some systems are evaluated using various other test sets.

We evaluate 101 different languages within 232 translation directions.We compare metrics only over the intersection of languages supported by all evaluated metrics, which means that we use only 39 different target languages when Prism is part of the evaluation. The size of the test sets can vary, and more than one test set or its subsets can be used for a single language direction. The average size of our test sets is 1017 sentences. The distribution of evaluated systems is not uniform, some language pairs are evaluated only a few times and while others repeatedly with different systems. The majority of the language pairs are English-centric, however, we evaluate a small set of French, German, and Chinese-centric systems (together only 90 system pairs). Details about the system counts of evaluated language pairs and average test set sizes can be found in the Appendix in Table 7.

2 Manual quality assessment

Our human evaluation is run periodically to confirm translation quality improvements by human judgements. For this analysis, we use human annotations performed from the middle of 2018 until early 2021. All human judgements were collected with identical settings with the same pool of human annotators. Thus, the human annotations should have similar distributions and characteristics.

The base unit of our human evaluation is called a campaign, in which we commonly compare two to four systems in equal conditions: We randomly draw around 500 sentences from a test set, translate them with each system and send them to human assessment. Each human annotator on average annotates 200 sentences, thus a system pair is evaluated by five different annotators (each annotating distinct set of sentences translated by both systems).

We use source-based Direct Assessment (DA, Graham et al., 2013) for collecting human judgements, where bilingual annotators are asked to rate all translated sentences on a continuous scale between 0 to 100 against source sentence without access to reference translations. This eliminates reference bias from human judgement by design.

We use the implementation of DA in the Appraise evaluation framework (Federmann, 2018), the same as is used in WMT since 2016 for out-of-English human evaluation (Bojar et al., 2016a).

We do not use crowd workers as human annotators. Instead, we use paid bi-lingual speakers that are familiar with the topic and well-qualified in the annotation process. Moreover, we track their performance, and those who fail quality control (Graham et al., 2013) are permanently removed from the pool of annotators, so are their latest annotations. This increases the overall quality of our human quality assessment.

We have two additional constraints in contrast to the original DA. Firstly, each system is compared on the same set of sentences which removes the problem of a system potentially benefitting from an easier set of randomly selected sentences. Moreover, it allows us to use a stronger paired test that compares differences in scoring of equal sentences instead of an unpaired one that evaluates scores of both systems in isolation. We use the Wilcoxon signed-rank test (Wilcoxon, 1946) in contrast to the Mann-Whitney U-test (Mann and Whitney, 1947) originally suggested for DA (Graham et al., 2017). Secondly, each annotator is assigned the same number of sentences for each evaluated system which mitigates bias from different rating strategies as each system is affected evenly by each annotator.

When calculating the system score, we take the average of human judgements.We do not assume a normal distribution of annotator’s annotations; therefore, we do not use z-score transformation. We analyze human judgements for 4380 systems and 2.3 M annotated sentences. This data is one and a half orders of magnitude larger than the data used at WMT Metric Shared Tasks, which evaluate around 170 systems each year (see Section 6).

3 Systems

We evaluate competing systems against human judgement. The system pairs could be separated into three groups: (1) model improvements, (2) state-of-the-art evaluation, and (3) comparisons with third-party models. The first group contains system pairs where one system is a strong baseline (usually our highest quality system so far) and the second system is an improved candidate model; this group evaluates stand-alone models without additional pre- and post-processing steps (e.g., rule-based named entity matching). The second group contains pairs of the candidate for the new best performing system and the current best performing system. The third group compares our best-performing model at the time with a publicly available third-party MT system.

Analyzing the variety of systems, hyperparameters, training data, and even architectures is out of the scope of this paper. However, all models are based on neural architectures.

Automatic metrics

In this study, we investigate metrics that were shown to provide promising performance in recent studies (see Section 6) and currently most widely used metrics in the MT field.The YiSi – high correlating metric (Ma et al., 2019) – was not publicly available at the time of our evaluation. We focus on language-agnostic metrics, therefore we do not include metrics supporting only a small set of languages. The full list of evaluated metrics and their main features is presented in Table 1.

Two categories of automatic machine translation metrics can be distinguished: (1) string-based metrics and (2) metrics using pretrained models. The former compares the coverage of various substrings between the human reference and MT output texts. String-based methods largely depend on the quality of reference translations. However, their advantage is that their performance is predictable as it can be easily diagnosed which substrings affect the score the most. The latter category of pretrained methods consists of metrics that use pretrained neural models to evaluate the quality of MT output texts given the source sentence, the human reference, or both. They are not strictly dependent on the translation quality of the human reference (for example, they can better evaluate synonyms or paraphrases). However, their performance is influenced by the data on which they have been trained. Moreover, the pretrained models introduce a black-box problem where it is difficult to diagnose potential unexpected behavior of the metric, such as various biases learned from training data.

For all metrics, we use the recommended implementation. See Appendix A for implementation details. Most metrics aim to achieve a positive correlation with human assessments, but some error metrics, such as TER, aim for a negative correlation. We simply negate scores of metrics with anticipated negative correlations. Pretrained metrics usually do not support all languages, therefore to ensure comparability, we evaluate metrics on a set of language pairs supported by all metrics.

Evaluation

Most previous works studied the system-level evaluation of MT metrics in an isolated scenario correlating individual systems with human judgements (Callison-Burch et al., 2007; Mathur et al., 2020b). They have mostly employed Pearson’s correlation (see Section 6) as suggested by Macháček and Bojar (2014) and evaluated each language direction separately. However, Mathur et al. (2020a) suggest using a pairwise comparison as a more accurate scenario for the general use of metrics.

As the primary unit, we use the difference in metric (or human) scores between system A and B:

We gather all system pairs from each campaign separately as only systems within a campaign are evaluated under equal conditions. All campaigns compare two, three, or four systems, which results in one, three, or six system pairs, respectively.

To understand the relationship between metrics and absolute human differences, we plot these differences and calculate Pearson’s and Spearman’s correlations in Figure 1. All metrics exhibit a positive correlation with human judgements but differ in behavior. For example, COMET has the smallest deviation which results in the highest correlation with human judgements. However, when we evaluate into-English and from-English language directions separately, we observe that COMET, Prism, and mainly BLEURT have inconsistent value ranges for different language pairs.A possible explanation for BLEURT is that it is trained on English-only. But this does not explain other metrics.

Hence, we cannot assume equal scales for one metric and different language pairs, so we can not use Pearson’s nor Spearman’s correlation in pairwise metrics evaluation. Nonetheless, we provide both correlations in Appendix Table 8 for the complete picture.

2 Pairwise system-level metric quality

As standard correlation cannot be used, we investigate a different approach to evaluation. We advocate that the most important aspect of a metric is to make reliable binary pairwise decisions (i.e., which of two systems provides a higher translation quality) without the focus on the magnitude of difference.The value of score difference (e.g., a difference of 2 BLEU) is important mainly to measure the confidence of a ranking decision. Therefore, given the size of our data set, we propose to use accuracy on binary comparisons: which system is better when human rankings are considered gold labels.

We define the accuracy as follows. For each system pair, we calculate the difference of the metric scores (metric $\Delta$ ) and the difference in average human judgements (human $\Delta$ ). We calculate accuracy for a given metric as the number of rank agreements between metric and human deltas divided by the total number of comparisons:

Assuming human judgements as a gold labels, accuracy gets an intrinsic meaning of how ,,reliable” a given metric is when making pairwise comparisons. On the other hand, accuracy does not take into account that two systems can have comparable quality, and thus the accuracy of a metric can be over-estimated by chance if a small human score difference has the same sign as the difference in a metric score. To overcome this issue, we also calculate accuracy over a subset of system pairs, where we remove system pairs that are deemed to not be different based on Wilcoxon’s signed-rank test over human judgements.

In order to estimate the confidence interval for accuracy, we use the bootstrap method (Efron and Tibshirani, 1994), for more details see Appendix B. We consider all metrics that fall into the 95% confidence interval of the best performing metric to be comparable. We visualize the clusters of best-performing metrics in our analysis with a grey background of table cells.

Results

In this section, we examine all available system pairs and investigate which metric is best suited for making a pairwise comparison.

The results presented in Table 2 show that pretrained methods (except for Prism-src) generally have higher accuracy than string-based methods, which confirms findings from other studies (Ma et al., 2018, 2019; Mathur et al., 2020b). COMET reaches the highest accuracy and therefore is the most suited for ranking system pairs. The runner-up is COMET-src, which is a surprising result because, as a quality estimation metric, it does not use a human reference. This opens possibilities to use monolingual data in machine translation systems evaluation in an effective way. On the other hand, the second reference-less method Prism-src does not reach high accuracy, struggling mainly with into-English translation directions (see Figure 2 in the Appendix). In terms of string-based metrics, the highest accuracy is achieved by ChrF, which makes it a better choice for comparing system pairs than the widely used BLEU.

To minimize the risk of being affected by random flips due to a small human score delta, we also explore the accuracy after removing systems with comparable performance with respect to Wilcoxon’s test over human judgements. We incrementally remove system pairs not significantly different with alpha levels of 0.05, 0.01, and 0.001. As expected, removing pairs of most likely equal-quality systems increases the accuracy, however, no metric reaches 100% accuracy even for a set of strongly different systems with an alpha level of 0.001. This implies that either current metrics cannot fully replace human evaluation or remaining systems are incorrectly assessed by human annotators.An alpha level of 0.001 could (mis)lead to the conclusion that 0.1% of human judgements are incorrect. However, the alpha level only determines if two systems are different enough and cannot be used to conclude that a human pairwise rank decision is incorrect. Moreover, we observe that the ordering of metrics by accuracy remains the same even after removing system pairs with comparable performance, which implies that accuracy is not negatively affected by non-significantly different system pairs. Due to that where we analyze only subsets of the data, we use systems that are statistically different by human judgement with an alpha level of 0.05.

Ma et al. (2019) have observed that system outliers, i.e., systems easily differentiated from other systems, can inflate Pearson’s correlation values. Moreso, Mathur et al. (2020a) demonstrated that after removing outliers some metrics would actually have negative correlation with humans. To analyze if outliers might affect our accuracy measurements and the ordering of metrics, we analyze a subset of systems with human judgement p-values between 0.05 and 0.001, i.e. removing system pairs that have equal quality and outlier system pairs that are easily distinguished. From column “Within” in Table 2, we see that the ordering of metrics remains unchanged. This shows that accuracy is not affected by outliers making it more suitable for metrics evaluation than Pearson’s $\rho$ .

2 Are metrics reliable for non-English languages and other scenarios?

The superior performance of pretrained metrics raises the question if unbalanced annotation data might be responsible; around half of the systems translate into English. Moreover, COMET and BLEURT are fine-tuned on human annotations from WMT on the news domain. This could lead to an unfair advantage when being evaluated w.r.t. human judgements.We double-checked and removed all campaigns containing test sets from WMT 2015 to 2020 from our work and analysis. To shed more light on metrics behavior and robustness, we analyze various subsets, including into and from English translation directions, languages with non-Latin scripts, and non-news domain.

We showed in Section 4.1 that some metrics perform differently for systems translating from and into English. Analyzing this scenario in Table 3 reveals that BLEURT does better (the second best metric) for “into English” translation compared to other metrics. It is surprising that BLEURT has a high accuracy for unseen “from English” pairs which suggests that BLEURT might have learned some kind of string-matching. We also observe in Table 3 gains for Prism for the “from English” directions. The overall ranking of metrics, however, remains similar which confirms that the high accuracy of pretrained methods compared to the string-based ones cannot be attributed to the abundance of system pairs with English as the target.

When investigating language pairs with non-Latin (Arabic, Russian, Chinese, …) or logogram-based scripts (Chinese, Korean and Japanese) as the target languages, we observe a slight drop in metric ranks for some pretrained metric in contrast to higher score for ChrF. This indicates that non-Latin scripts might be a challenge for pretrained metrics but more analysis would be required here. For an summary on individual language pairs, refer to Table 9 in the Appendix.

We also investigate if some pretrained methods might have an unfair advantage due to being fine-tuned on human assessments in the news domain. For this, we analyze a subset of news test sets with target languages that were not part of WMT human evaluation (i.e., languages which those methods have not been fine-tuned on) and call this set “non-WMT”, and also system pairs evaluated on a proprietary test sets in the EU parliamentary discussions domain covering ten languages. Neither results on non-WMT nor discussion domains in Table 3 show a change in the ranking of metrics, suggesting that COMET is not overfitted to the WMT news domain or WMT languages. Somewhat surprisingly, we actually see a drop in accuracy for the string-based metrics for the discussion domain. We speculate this might be due to their inability to forgivingly match disfluent utterances to expected fluent translations (Salesky et al., 2019).

Overall, the results for various subsets show a similar ordering of metrics based on their accuracy, confirming the general validity of our results.

3 Are statistical tests on automatic metric worth it?

Mathur et al. (2020a) studied the effects of statistical testing of automatic metrics and observed that even large metric score differences can disagree with human judgement. They have shown that even for a BLEU delta of 3 to 5 points, a quarter of these systems are judged by humans to differ insignificantly in quality or to contradict the verdict of the metric. In our analysis, we have 203 system pairs deemed statistically significant by humans (p-value smaller than 0.05) for which using BLEU results in a flipped ranking compared to humans. The median BLEU difference for these system pairs is 1.3 BLEU points. This is concerning as BLEU differences higher than one or two BLEU points are commonly and historically considered to be reliable by the field.

In this section, we corroborate that statistical significance testing can largely increase the confidence of the MT quality improvement and increase the accuracy of metrics. We compare how accurate a metric would be under two situations: either when not using statistical testing and solely trusting in the metric score difference; or when using statistical testing and throwing away systems that are not statistically different.

We evaluated the first situation in Section 5.1 and the results are equal with the first column of Table 2. For the second situation, we calculate accuracy only over the system pairs that are statistically different. We use paired bootstrap resampling (Koehn, 2004), a non-parametric test, to calculate the statistical significance for a pair of systems.Approximate randomization (Riezler and Maxwell III, 2005) can be used as an alternative test, and for metrics based on the average of sentence-level scores, we can use also tests such as the Student t-test.

Additionally, the second situation introduces type II errors which represent systems where the statistical significance test rejected a system pair as being non-significant, but humans would judge the given pair as significantly different. In other words, it shows how many system pairs are incorrectly rejected as non-significantly different. See Appendix C for a detailed explanation.

From the results in Table 4, we can see that if we apply paired bootstrap resampling on automatic metrics with an alpha level 0.05 the accuracy increases by around 10% for all metrics in contrast to not using statistical testing. On the other hand, when using statistical testing, we introduce type II errors, where 17.3%, for COMET, of non-significantly different system pairs are deemed significantly different by humans.Wilcoxon’s test on human judgement and alpha level 0.05.

In conclusion, we corroborate that using statistical significance tests largely increases reliability in automatic metric decisions. We encourage the usage of statistical significance testing, especially in the light of Marie et al. (2021) who show that statistical significance tests are widely ignored.

4 Does BLEU sabotage progress in MT?

Freitag et al. (2020) have shown that reference translations with string-based metrics may systematically bias against modeling techniques known to improve human-judged quality and raised the question of whether previous research has incorrectly discarded approaches that improved the quality of MT due to the use of such references and BLEU. They argue that the use of BLEU might have mislead many researcher in their decisions.

In this section, we investigate the hypothesis if the usage of BLEU negatively affects model selection. To do so, we compare two groups of system pairs based on the premise if they could be directly affected by BLEU. The first group contains pairs of incremental improvements of our systems. We can assume that incremental models use similar architecture, data, and settings, although we do not study particular changes. We use BLEU as the main automatic metric to guide model development. If BLEU shows improvements, we evaluate models with human judgements to make a final deployment decision. Therefore, systems with degraded BLEU scores which would be deemed improved by humans are missing in this group as we reject them based on BLEU scores during development. The second group contains independent system pairs, which use different architectures, data, settings, and therefore BLEU has not been used to preselect them. In this group, we compare our systems with publicly available third-party MT systems.

We compare three models within the same campaign, two internalThe pair of internal models contains the best model from the last year and our latest improved model. and one external system. Thus, the same annotators annotated the same sentences from all three systems under the same conditions. We call system pairs comparisons between two internal models “incremental”, and comparisons between the newer internal model and the external model as “independent”.

Over the past three years we carried out 333 campaigns across 17 language pairs (each campaign comparing three models), resulting in almost 530000 human annotations.

The results in Table 5 show that for independent systems, the ranking of the metrics is comparable with results in Table 3. Pretrained metrics generally outperform string-based ones and COMET is in the lead. However, when inspecting the incremental systems, BLEU wins. This indicates that BLEU influenced our model development and we rejected models that would have been preferred by humans.

Another possible explanation is that systems preselected by BLEU are easy to differentiate by all metrics. This could explain why all metrics have high accuracy in contrast to the “Independent” column and most of them are in a single cluster.

In conclusion, results showing BLEU as the metric with the highest accuracy where we would expect pretrained metrics to dominate, suggests that BLEU affected system development and we rejected improved models due to the erroneous degradation seen in the BLEU score. However, this is indirect evidence as for sound conclusions we would need to evaluate those rejected systems with other metrics and human judgement as well.

Meta Analysis

We analyze findings from past research to put our results in the broader context. We focus on the results on the system-level evaluation, however, a large part of the research studied a sentence-level evaluation. The largest source of metrics evaluation is yearly WMT Metric Shared Task occurring over more than the past ten years (Callison-Burch et al., 2007), where various methods are evaluated with human judgement over the set of submitted systems and language pairs in WMT News Translation Shared Tasks. Recently, Freitag et al. (2021) reevaluated two translation directions from WMT 2020 with the multidimensional quality metric framework and raised a concern that general crowd-sourced annotators used in into-English evaluation in WMT prefer literal translations and have a lower quality than some automatic metrics.

Past studies evaluate system-level correlations with Pearson’s correlation calculated for each translation direction separately. We are interested in how metrics correlate with human judgement in general across different language pairs. Thus, to generalize the past findings, we use the Hunter-Schmidt method (Hunter and Schmidt, 2004), which allows combining already calculated correlations with various sizes. We use it to generalize correlations within each study across all language pairs. For this purpose, Hunter-Schmidt is effectively a weighted mean of the raw correlation coefficients.

Although past studies evaluated a larger number of methods and their variants, we have selected a subset of metrics that are evaluated in more than one study or showed promising performance over other metrics in a given study. When a study evaluated several variants of a metric with various parameters, we selected the setting closest to either the recommended setting in the recent years, such as SacreBLEU, or a setting that is used in the later evaluation study, mainly in Mathur et al. (2020b).

Meta-analysis in Table 6 shows that pretrained methods outperform string-based methods as concluded by Mathur et al. (2020b); Ma et al. (2019, 2018). The second important observation is that there was not a single year where BLEU had a higher correlation than ChrF. This supports our conclusions and shows that the MT community had results supporting the deprecation of BLEU as a standard metric for several years. Comparing the pretrained methods, ESIM is the best performing method in general (Mathur et al., 2020b), while COMET is the best performing method when removing the suspicious system.

In the study by Mathur et al. (2020b), COMET under-performed other pretrained metrics. We found out that submitted COMET scores failed to score one English-Chinese system with tokenized output. However, we obtain valid COMET scores on that system output when replicating the results. Moreover, we have not seen any problems with COMET on Chinese. As this one system largely skews Pearson’s correlation, we also present analysis without English-Chinese systems in Table 6.

Discussion

We corroborate results from past studies that pretrained methods are superior to string-based ones. However, pretrained methods are relatively new techniques and we can potentially discover significant drawbacks, for example, they could resemble biases from training data, fail on particular domains, or prefer fluency over adequacy. Another problem could arise if an MT system would be trained on the same data as the metric was or if it incorporates the same pretrained model, for example, XLM-R (Conneau et al., 2020) used by COMET. Pretrained methods support only a selected set of languages and the quality can differ for each of them. Thus, we argue that the string-based method should be used as a secondary metric.

An interesting solution to dissipate potential drawbacks of any metric would be if different research groups preselect a different primary pretrained metric in advance to lead their research decisions and to discover improvements not apparent under other metrics. However, we fear that it could lead to “metric-hacking”, i.e., picking a metric that confirms results. Therefore, we recommend using COMET as the primary metric. And to use ChrF, the best performing string-based method, as a secondary metric and for unsupported languages.

A surprising results is the high accuracy of COMET-src, a reference-free metric. It allows automatic evaluation over monolingual domain-specific testsets as suggested by Agrawal et al. (2021).

Limitations of BLEU are well-known (Reiter, 2018; Mathur et al., 2020a). Callison-Burch et al. (2006) argued that MT community is overly reliant on it, which Marie et al. (2021) confirmed by showing that 98.8% of MT papers use BLEU. We present indirect evidence that the over-use of BLEU negatively affects MT development and support deprecation of BLEU as the evaluation standard.

We show that the reliability of metrics decisions can be increased with statistical significance tests. However, Dror et al. (2018) point out the assumption of statistical significance tests that data samples are independent and adequately distributed is rarely true. Also, statistical significance tests do not account for random seed variation across training runs. Thus, one should be cautious when making conclusions based on small metrics improvements. Wasserstein et al. (2019) give recommendations for a better use of statistical significance testing.

Marie et al. (2021) have shown that almost 40% of MT papers from 2020 copied score from different papers without recalculating them, which is a concerning trend. Also, new and better metrics will emerge and there is no need to permanently adhering to a single metric. Instead, the simplest and most effective solution to avoid the need to copy scores or stick to obsolete metric is to always publish translated outputs of test sets along with the paper. This allows anyone to recalculate scores with different tools and/or metrics and makes comparisons with past (and future) research easier.

There are some shortcomings in our analysis. We have only a handful of non-English systems, therefore we cannot conclude anything about the behaviour of the metrics for language pairs without English. Similarly, the majority of our language pairs are high-resource, therefore, we cannot conclude the reliability of metrics for low-resource languages. Lastly, many of our translation directions are from translationese into authentic, which as Zhang and Toral (2019) showed is the easier direction for systems to score high by human judgement. These are potential directions of future work.

Lastly, we assume that human judgement is the gold standard. However, we need to keep in mind that there can be potential drawbacks of the method used for human judgement or human annotators fail to capture true assessment as Freitag et al. (2021) observe. For example, humans cannot explicitly mark critical errors in DA and instead they usually assign low assessment scores.

Conclusion

We show that metrics can use a different scale for different languages, so Pearson’s correlation cannot be used. We introduce accuracy as a novel evaluation of metrics in a pairwise system comparison.

We use and release a large collection of the human judgement confirming that pretrained metrics are superior to string-based. COMET is the best performing metric in our study, and ChrF is the best performing string-based method. The surprising effectiveness of COMET-src could allow the use of large monolingual test sets for quality estimation.

We do not see any drawbacks of the metrics when investigating various languages or domains, especially, for methods pretrained on human judgement. We present indirect evidence that the over-use of BLEU negatively affects MT development.

We show that statistical testing of automatic metrics largely increases the reliability of a pairwise decision based on automatic metric scores.

We endorse the recommendation for publishing translated outputs of research systems to allow comparisons and recalculation of scores in the future.

Acknowledgments

We are grateful for a feedback and review of the paper to many researchers, namely: Shuoyang Ding, Markus Freitag, Hieu Hoang, Alon Lavie, Jindřich Libovický, Nitika Mathur, Mathias Müller, Martin Nejedlý, Martin Popel, Matt Post, Qingsong Ma, Richardo Rei, Thibault Sellam, Aleš Tamchyna, anonymous reviewers, and our colleagues.

References

Appendix A Metrics Implementation Details

We use the most common implementation with default or recommended parameters to simulate standard metric usage.

For BLEU (Papineni et al., 2002), ChrF (Popović, 2015) and TER (Snover et al., 2006) metrics, we use SacreBLEU implementation https://github.com/mjpost/sacrebleu/ version 1.5.0. We use “mteval-v13a” tokenizer for all language pairs except for Chinese and Japanese which use their own tokenizer, as is recommended.

For CharacTER (Wang et al., 2016), we use https://github.com/rwth-i6/CharacTER commit c4b25cb.

For EED (Stanchev et al., 2019), we use https://github.com/rwth-i6/ExtendedEditDistance commit f944adc.

For BERTScore (Zhang et al., 2020), we use https://github.com/Tiiiger/bert_score version 0.3.7.

For BLEURT (Sellam et al., 2020), we use the recommended model “bleurt-base-128” and implementation https://github.com/google-research/bleurt version 0.0.1. It is important to mention, that BLEURT is fine-tuned for English only. Additionally, we evaluated other variants and “bleurt-large-512” performed better than recommended variant. We add it in Table 8.

For COMET (Rei et al., 2020), we use recommended model “wmt-large-da-estimator-1719” and for COMET-src we use “wmt-large-qe-estimator-1719”. The implementation is https://github.com/Unbabel/COMET in version 0.0.6. We evaluated all other COMET models, but neither performed better than recommended model.

For Prism and Prism-src (Thompson and Post, 2020), we use https://github.com/thompsonb/prism commit 06f10da.

For ESIM (Mathur et al., 2019), we use https://github.com/nitikam/mteval-in-context.

Appendix B Confidence Interval for Metric Accuracy

To estimate the confidence interval for the best performing metric, we use the bootstrap method (Efron and Tibshirani, 1994). It creates multiple resamples (with replacement) from a set of observations and calculates accuracy on each of these resamples. We employ modified paired bootstrap resampling (Koehn, 2004), a method which we also use for testing statistical significance of the metric difference in Section 5.3. However, the usage is different.

To calculate the bootstrap resampling. First, we note the best performing metric on all system pairs from the collection as metric $\alpha$ . We create 10 000 resamples by drawing system pairs with replacements from the collection of all. For each resample, we calculate accuracy for all metrics. We note which metrics have equal or higher accuracy than metric $\alpha$ in a given resample.

If metric $\alpha$ outperforms metric X by less than 95% of the time, we draw the conclusion that metric X performs on par with 95% statistical significance to the winning metric $\alpha$ .

Appendix C Comparing Statistical Tests

The problem if two systems have the same MT quality is still an open question. Applying statistical tests over the metric scores allows us to confirm if the difference in score is significant or due to a random change based on the set of translated sentences and a given alpha level. To get the gold truth about system equivalence, we employ Wilcoxon’s test on human judgement and alpha level 0.05. We use paired bootstrap resampling approach as the statistical test for automatic metrics. Unfortunately, we cannot directly compare the outputs of two statistical tests (for example, the Wilcoxon test on human judgements with the bootstrap resampling on metric scores) as even with the same alpha level, these tests have a different power. Therefore, we need to investigate it in isolation.

The null hypothesis in our setting is that both evaluated systems have the same translation quality. There are two possible outcomes of a statistical test: accept the null hypothesis (i.e. MT quality of systems is not significantly different) or reject the null hypothesis (i.e. MT quality of systems is significantly different). When observing outcomes of statistical tests over human judgement and over automatic metric, we get four possible outcomes:

There are two outcomes for the statistical test over a metric that we investigate separately.

In the first scenario, the bootstrap resampling confirms the statistical difference between systems. However, even when both tests agree that systems have statistically different MT quality, it still may happen that humans and metrics disagree on which system is better than the other. The goal is to evaluate how accurate metric decisions are if we employ statistical testing. Therefore, we are interested in the accuracy of a metric over system pairs that are deemed statistically different according to the paired bootstrap resampling, in other words, accuracy for system pairs that are either truly different (top left quadrant) or fall into type I. error (bottom left quadrant).

In the second scenario, we want to find out how many system pairs are diagnosed as non-significant even though human judgements would deem them different. For this scenario, we investigate for how many system pairs bootstrap resampling fails to reject the null hypothesis. However, keep in mind that two statistical tests cannot be directly compared because different tests have different power and the type II error will differ based on that.