COMET: A Neural Framework for MT Evaluation

Ricardo Rei, Craig Stewart, Ana C Farinha, Alon Lavie

Introduction

Historically, metrics for evaluating the quality of machine translation (MT) have relied on assessing the similarity between an MT-generated hypothesis and a human-generated reference translation in the target language. Traditional metrics have focused on basic, lexical-level features such as counting the number of matching n-grams between the MT hypothesis and the reference translation. Metrics such as Bleu Papineni et al. (2002) and Meteor Lavie and Denkowski (2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation.

Modern neural approaches to MT result in much higher quality of translation that often deviates from monotonic lexical transfer between languages. For this reason, it has become increasingly evident that we can no longer rely on metrics such as Bleu to provide an accurate estimate of the quality of MT Barrault et al. (2019).

While an increased research interest in neural methods for training MT models and systems has resulted in a recent, dramatic improvement in MT quality, MT evaluation has fallen behind. The MT research community still relies largely on outdated metrics and no new, widely-adopted standard has emerged. In 2019, the WMT News Translation Shared Task received a total of 153 MT system submissions Barrault et al. (2019). The Metrics Shared Task of the same year saw only 24 submissions, almost half of which were entrants to the Quality Estimation Shared Task, adapted as metrics Ma et al. (2019).

The findings of the above-mentioned task highlight two major challenges to MT evaluation which we seek to address herein Ma et al. (2019). Namely, that current metrics struggle to accurately correlate with human judgement at segment level and fail to adequately differentiate the highest performing MT systems.

In this paper, we present CometCrosslingual Optimized Metric for Evaluation of Translation., a PyTorch-based framework for training highly multilingual and adaptable MT evaluation models that can function as metrics. Our framework takes advantage of recent breakthroughs in cross-lingual language modeling Artetxe and Schwenk (2019); Devlin et al. (2019); Conneau and Lample (2019); Conneau et al. (2019) to generate prediction estimates of human judgments such as Direct Assessments (DA) Graham et al. (2013), Human-mediated Translation Edit Rate (HTER) Snover et al. (2006) and metrics compliant with the Multidimensional Quality Metric framework Lommel et al. (2014).

Inspired by recent work on Quality Estimation (QE) that demonstrated that it is possible to achieve high levels of correlation with human judgements even without a reference translation Fonseca et al. (2019), we propose a novel approach for incorporating the source-language input into our MT evaluation models. Traditionally only QE models have made use of the source input, whereas MT evaluation metrics rely instead on the reference translation. As in Takahashi et al. (2020), we show that using a multilingual embedding space allows us to leverage information from all three inputs and demonstrate the value added by the source as input to our MT evaluation models.

To illustrate the effectiveness and flexibility of the Comet framework, we train three models that estimate different types of human judgements and show promising progress towards both better correlation at segment level and robustness to high-quality MT.

We will release both the Comet framework and the trained MT evaluation models described in this paper to the research community upon publication.

Model Architectures

Human judgements of MT quality usually come in the form of segment-level scores, such as DA, MQM and HTER. For DA, it is common practice to convert scores into relative rankings (DARR) when the number of annotations per segment is limited Bojar et al. (2017b); Ma et al. (2018, 2019). This means that, for two MT hypotheses $h_{i}$ and $h_{j}$ of the same source $s$ , if the DA score assigned to $h_{i}$ is higher than the score assigned to $h_{j}$ , $h_{i}$ is regarded as a “better” hypothesis.In the WMT Metrics Shared Task, if the difference between the DA scores is not higher than 25 points, those segments are excluded from the DARR data. To encompass these differences, our framework supports two distinct architectures: The Estimator model and the Translation Ranking model. The fundamental difference between them is the training objective. While the Estimator is trained to regress directly on a quality score, the Translation Ranking model is trained to minimize the distance between a “better” hypothesis and both its corresponding reference and its original source. Both models are composed of a cross-lingual encoder and a pooling layer.

The primary building block of all the models in our framework is a pretrained, cross-lingual model such as multilingual BERT Devlin et al. (2019), XLM Conneau and Lample (2019) or XLM-RoBERTa Conneau et al. (2019). These models contain several transformer encoder layers that are trained to reconstruct masked tokens by uncovering the relationship between those tokens and the surrounding ones. When trained with data from multiple languages this pretrained objective has been found to be highly effective in cross-lingual tasks such as document classification and natural language inference Conneau et al. (2019), generalizing well to unseen languages and scripts (Pires et al., 2019). For the experiments in this paper, we rely on XLM-RoBERTa (base) as our encoder model.

2 Pooling Layer

The embeddings generated by the last layer of the pretrained encoders are usually used for fine-tuning models to new tasks. However, (Tenney et al., 2019) showed that different layers within the network can capture linguistic information that is relevant for different downstream tasks. In the case of MT evaluation, (Zhang et al., 2020) showed that different layers can achieve different levels of correlation and that utilizing only the last layer often results in inferior performance. In this work, we used the approach described in Peters et al. (2018) and pool information from the most important encoder layers into a single embedding for each token, $\bm{e}_{j}$ , by using a layer-wise attention mechanism. This embedding is then computed as:

where $\mu$ is a trainable weight coefficient, $\bm{E}_{j}=[\bm{e}_{j}^{(0)},\bm{e}_{j}^{(1)},\dots\,\bm{e}_{j}^{(k)}]$ corresponds to the vector of layer embeddings for token $x_{j}$ , and $\bm{\alpha}=\textrm{softmax}([\alpha^{(1)},\alpha^{(2)},\dots,\alpha^{(k)}])$ is a vector corresponding to the layer-wise trainable weights. In order to avoid overfitting to the information contained in any single layer, we used layer dropout Kondratyuk and Straka (2019), in which with a probability $p$ the weight $\alpha^{(i)}$ is set to $-\infty$ .

Finally, as in Reimers and Gurevych (2019), we apply average pooling to the resulting word embeddings to derive a sentence embedding for each segment.

3 Estimator Model

Given a $d$ -dimensional sentence embedding for the source, the hypothesis, and the reference, we adopt the approach proposed in RUSE Shimanaka et al. (2018) and extract the following combined features:

Element-wise source product: $\bm{h}\odot\bm{s}$

Element-wise reference product: $\bm{h}\odot\bm{r}$

Absolute element-wise source difference: $|\bm{h}-\bm{s}|$

Absolute element-wise reference difference: $|\bm{h}-\bm{r}|$

These combined features are then concatenated to the reference embedding $\bm{r}$ and hypothesis embedding $\bm{h}$ into a single vector $\bm{x}=[\bm{h};\bm{r};\bm{h}\odot\bm{s};\bm{h}\odot\bm{r};|\bm{h}-\bm{s}|;|\bm{h}-\bm{r}|]$ that serves as input to a feed-forward regressor. The strength of these features is in highlighting the differences between embeddings in the semantic feature space.

The model is then trained to minimize the mean squared error between the predicted scores and quality assessments (DA, HTER or MQM). Figure 2 illustrates the proposed architecture.

Note that we chose not to include the raw source embedding ( $\bm{s}$ ) in our concatenated input. Early experimentation revealed that the value added by the source embedding as extra input features to our regressor was negligible at best. A variation on our HTER estimator model trained with the vector $\bm{x}=[\bm{h};\bm{s};\bm{r};\bm{h}\odot\bm{s};\bm{h}\odot\bm{r};|\bm{h}-\bm{s}|;|\bm{h}-\bm{r}|]$ as input to the feed-forward only succeed in boosting segment-level performance in 8 of the 18 language pairs outlined in section 5 below and the average improvement in Kendall’s Tau in those settings was +0.0009. As noted in Zhao et al. (2020), while cross-lingual pretrained models are adaptive to multiple languages, the feature space between languages is poorly aligned. On this basis we decided in favor of excluding the source embedding on the intuition that the most important information comes from the reference embedding and reducing the feature space would allow the model to focus more on relevant information. This does not however negate the general value of the source to our model; where we include combination features such as $\bm{h}\odot\bm{s}$ and $|\bm{h}-\bm{s}|$ we do note gains in correlation as explored further in section 4 below.

4 Translation Ranking Model

Our Translation Ranking model (Figure 2) receives as input a tuple $\chi=(s,h^{+},h^{-},r)$ where $h^{+}$ denotes an hypothesis that was ranked higher than another hypothesis $h^{-}$ . We then pass $\chi$ through our cross-lingual encoder and pooling layer to obtain a sentence embedding for each segment in the $\chi$ . Finally, using the embeddings $\{\bm{s},\bm{h^{+}},\bm{h^{-}},\bm{r}\}$ , we compute the triplet margin loss Schroff et al. (2015) in relation to the source and reference:

$d(\bm{u},\bm{v})$ denotes the euclidean distance between $\bm{u}$ and $\bm{v}$ and $\epsilon$ is a margin. Thus, during training the model optimizes the embedding space so the distance between the anchors ( $\bm{s}$ and $\bm{r}$ ) and the “worse” hypothesis $\bm{h^{-}}$ is greater by at least $\epsilon$ than the distance between the anchors and “better” hypothesis $\bm{h^{+}}$ .

During inference, the described model receives a triplet $(s,\hat{h},r)$ with only one hypothesis. The quality score assigned to $\hat{h}$ is the harmonic mean between the distance to the source $d(\bm{s},\bm{\hat{h}})$ and the distance to the reference $d(\bm{r},\bm{\hat{h}})$ :

Finally, we convert the resulting distance into a similarity score bounded between 0 and 1 as follows:

Corpora

To demonstrate the effectiveness of our described model architectures (section 2), we train three MT evaluation models where each model targets a different type of human judgment. To train these models, we use data from three different corpora: the QT21 corpus, the DARR from the WMT Metrics shared task (2017 to 2019) and a proprietary MQM annotated corpus.

The QT21 corpus is a publicly availableQT21 data: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2390 dataset containing industry generated sentences from either an information technology or life sciences domains Specia et al. (2017). This corpus contains a total of 173K tuples with source sentence, respective human-generated reference, MT hypothesis (either from a phrase-based statistical MT or from a neural MT), and post-edited MT (PE). The language pairs represented in this corpus are: English to German (en-de), Latvian (en-lt) and Czech (en-cs), and German to English (de-en).

The HTER score is obtained by computing the translation edit rate (TER) Snover et al. (2006) between the MT hypothesis and the corresponding PE. Finally, after computing the HTER for each MT, we built a training dataset $D=\{s_{i},h_{i},r_{i},y_{i}\}_{n=1}^{N}$ , where $s_{i}$ denotes the source text, $h_{i}$ denotes the MT hypothesis, $r_{i}$ the reference translation, and $y_{i}$ the HTER score for the hypothesis $h_{i}$ . In this manner we seek to learn a regression $f(s,h,r)\rightarrow y$ that predicts the human-effort required to correct the hypothesis by looking at the source, hypothesis, and reference (but not the post-edited hypothesis).

2 The WMT DARR corpus

Since 2017, the organizers of the WMT News Translation Shared Task Barrault et al. (2019) have collected human judgements in the form of adequacy DAs Graham et al. (2013, 2014, 2017). These DAs are then mapped into relative rankings (DARR) Ma et al. (2019). The resulting data for each year (2017-19) form a dataset $D=\{s_{i},h_{i}^{+},h_{i}^{-},r_{i}\}_{n=1}^{N}$ where $h_{i}^{+}$ denotes a “better” hypothesis and $h_{i}^{-}$ denotes a “worse” one. Here we seek to learn a function $r(s,h,r)$ such that the score assigned to $h_{i}^{+}$ is strictly higher than the score assigned to $h_{i}^{-}$ ( $r(s_{i},h_{i}^{+},r_{i})>r(s_{i},h_{i}^{-},r_{i})$ ). This dataThe raw data for each year of the WMT Metrics shared task is publicly available in the results page (2019 example: http://www.statmt.org/wmt19/results.html). Note, however, that in the README files it is highlighted that this data is not well documented and the scripts occasionally require custom utilities that are not available. contains a total of 24 high and low-resource language pairs such as Chinese to English (zh-en) and English to Gujarati (en-gu).

3 The MQM corpus

The MQM corpus is a proprietary internal database of MT-generated translations of customer support chat messages that were annotated according to the guidelines set out in Burchardt and Lommel (2014). This data contains a total of 12K tuples, covering 12 language pairs from English to: German (en-de), Spanish (en-es), Latin-American Spanish (en-es-latam), French (en-fr), Italian (en-it), Japanese (en-ja), Dutch (en-nl), Portuguese (en-pt), Brazilian Portuguese (en-pt-br), Russian (en-ru), Swedish (en-sv), and Turkish (en-tr). Note that in this corpus English is always seen as the source language, but never as the target language. Each tuple consists of a source sentence, a human-generated reference, a MT hypothesis, and its MQM score, derived from error annotations by one (or more) trained annotators. The MQM metric referred to throughout this paper is an internal metric defined in accordance with the MQM framework (Lommel et al., 2014) (MQM). Errors are annotated under an internal typology defined under three main error types; ‘Style’, ‘Fluency’ and ‘Accuracy’. Our MQM scores range from $-\infty$ to 100 and are defined as:

where ${I}_{\text{Minor}}$ denotes the number of minor errors, ${I}_{\text{Major}}$ the number of major errors and ${I}_{\text{Crit.}}$ the number of critical errors.

Our MQM metric takes into account the severity of the errors identified in the MT hypothesis, leading to a more fine-grained metric than HTER or DA. When used in our experiments, these values were divided by 100 and truncated at 0. As in section 3.1, we constructed a training dataset $D=\{s_{i},h_{i},r_{i},y_{i}\}_{n=1}^{N}$ , where $s_{i}$ denotes the source text, $h_{i}$ denotes the MT hypothesis, $r_{i}$ the reference translation, and $y_{i}$ the MQM score for the hypothesis $h_{i}$ .

Experiments

We train two versions of the Estimator model described in section 2.3: one that regresses on HTER (Comet-hter) trained with the QT21 corpus, and another that regresses on our proprietary implementation of MQM (Comet-mqm) trained with our internal MQM corpus. For the Translation Ranking model, described in section 2.4, we train with the WMT DARR corpus from 2017 and 2018 (Comet-rank). In this section, we introduce the training setup for these models and corresponding evaluation setup.

2 Evaluation Setup

We use the test data and setup of the WMT 2019 Metrics Shared Task Ma et al. (2019) in order to compare the Comet models with the top performing submissions of the shared task and other recent state-of-the-art metrics such as Bertscore and Bleurt.To ease future research we will also provide, within our framework, detailed instructions and scripts to run other metrics such as chrF, Bleu, Bertscore, and Bleurt The evaluation method used is the official Kendall’s Tau-like formulation, $\tau$ , from the WMT 2019 Metrics Shared Task Ma et al. (2019) defined as:

where Concordant is the number of times a metric assigns a higher score to the “better” hypothesis $h^{+}$ and Discordant is the number of times a metric assigns a higher score to the “worse” hypothesis $h^{-}$ or the scores assigned to both hypotheses is the same.

As mentioned in the findings of Ma et al. (2019), segment-level correlations of all submitted metrics were frustratingly low. Furthermore, all submitted metrics exhibited a dramatic lack of ability to correctly rank strong MT systems. To evaluate whether our new MT evaluation models better address this issue, we followed the described evaluation setup used in the analysis presented in Ma et al. (2019), where correlation levels are examined for portions of the DARR data that include only the top 10, 8, 6 and 4 MT systems.

Results

Table 1 shows results for all eight language pairs with English as source. We contrast our three Comet models against baseline metrics such as Bleu and chrF, the 2019 task winning metric YiSi-1, as well as the more recent Bertscore. We observe that across the board our three models trained with the Comet framework outperform, often by significant margins, all other metrics. Our DARR Ranker model outperforms the two Estimators in seven out of eight language pairs. Also, even though the MQM Estimator is trained on only 12K annotated segments, it performs roughly on par with the HTER Estimator for most language-pairs, and outperforms all the other metrics in en-ru.

2 From X into English

Table 2 shows results for the seven to-English language pairs. Again, we contrast our three Comet models against baseline metrics such as Bleu and chrF, the 2019 task winning metric YiSi-1, as well as the recently published metrics Bertscore and Bleurt. As in Table 1 the DARR model shows strong correlations with human judgements outperforming the recently proposed English-specific Bleurt metric in five out of seven language pairs. Again, the MQM Estimator shows surprising strong results despite the fact that this model was trained with data that did not include English as a target. Although the encoder used in our trained models is highly multilingual, we hypothesise that this powerful “zero-shot” result is due to the inclusion of the source in our models.

3 Language pairs not involving English

All three of our Comet models were trained on data involving English (either as a source or as a target). Nevertheless, to demonstrate that our metrics generalize well we test them on the three WMT 2019 language pairs that do not include English in either source or target. As can be seen in Table 3, our results are consistent with observations in Tables 1 and 2.

4 Robustness to High-Quality MT

For analysis, we use the DARR corpus from the 2019 Shared Task and evaluate on the subset of the data from the top performing MT systems for each language pair. We included language pairs for which we could retrieve data for at least ten different MT systems (i.e. all but kk-en and gu-en). We contrast against the strong recently proposed Bertscore and Bleurt, with Bleu as a baseline. Results are presented in Figure 3. For language pairs where English is the target, our three models are either better or competitive with all others; where English is the source we note that in general our metrics exceed the performance of others. Even the MQM Estimator, trained with only 12K segments, is competitive, which highlights the power of our proposed framework.

5 The Importance of the Source

To shed some light on the actual value and contribution of the source language input in our models’ ability to learn accurate predictions, we trained two versions of our DARR Ranker model: one that uses only the reference, and another that uses both reference and source. Both models were trained using the WMT 2017 corpus that only includes language pairs from English (en-de, en-cs, en-fi, en-tr). In other words, while English was never observed as a target language during training for both variants of the model, the training of the second variant includes English source embeddings. We then tested these two model variants on the WMT 2018 corpus for these language pairs and for the reversed directions (with the exception of en-cs because cs-en does not exist for WMT 2018). The results in Table 4 clearly show that for the translation ranking architecture, including the source improves the overall correlation with human judgments. Furthermore, the inclusion of the source exposed the second variant of the model to English embeddings which is reflected in a higher $\Delta\tau$ for the language pairs with English as a target.

Reproducibility

We will release both the code-base of the Comet framework and the trained MT evaluation models described in this paper to the research community upon publication, along with the detailed scripts required in order to run all reported baselines.These will be hosted at: https://github.com/Unbabel/COMET All the models reported in this paper were trained on a single Tesla T4 (16GB) GPU. Moreover, our framework builds on top of PyTorch Lightning Falcon (2019), a lightweight PyTorch wrapper, that was created for maximal flexibility and reproducibility.

Related Work

Classic MT evaluation metrics are commonly characterized as $n$ -gram matching metrics because, using hand-crafted features, they estimate MT quality by counting the number and fraction of $n$ -grams that appear simultaneous in a candidate translation hypothesis and one or more human-references. Metrics such as Bleu Papineni et al. (2002), Meteor Lavie and Denkowski (2009), and chrF Popović (2015) have been widely studied and improved Koehn et al. (2007); Popović (2017); Denkowski and Lavie (2011); Guo and Hu (2019), but, by design, they usually fail to recognize and capture semantic similarity beyond the lexical level.

In recent years, word embeddings Mikolov et al. (2013); Pennington et al. (2014); Peters et al. (2018); Devlin et al. (2019) have emerged as a commonly used alternative to $n$ -gram matching for capturing word semantics similarity. Embedding-based metrics like Meteor-Vector Servan et al. (2016), Bleu2vec Tättar and Fishel (2017), YiSi-1 Lo (2019), MoverScore Zhao et al. (2019), and Bertscore Zhang et al. (2020) create soft-alignments between reference and hypothesis in an embedding space and then compute a score that reflects the semantic similarity between those segments. However, human judgements such as DA and MQM, capture much more than just semantic similarity, resulting in a correlation upper-bound between human judgements and the scores produced by such metrics.

Learnable metrics Shimanaka et al. (2018); Mathur et al. (2019); Shimanaka et al. (2019) attempt to directly optimize the correlation with human judgments, and have recently shown promising results. Bleurt Sellam et al. (2020), a learnable metric based on BERT Devlin et al. (2019), claims state-of-the-art performance for the last 3 years of the WMT Metrics Shared task. Because Bleurt builds on top of English-BERT Devlin et al. (2019), it can only be used when English is the target language which limits its applicability. Also, to the best of our knowledge, all the previously proposed learnable metrics have focused on optimizing DA which, due to a scarcity of annotators, can prove inherently noisy Ma et al. (2019).

Reference-less MT evaluation, also known as Quality Estimation (QE), has historically often regressed on HTER for segment-level evaluation Bojar et al. (2013, 2014, 2015, 2016, 2017a). More recently, MQM has been used for document-level evaluation Specia et al. (2018); Fonseca et al. (2019). By leveraging highly multilingual pretrained encoders such as multilingual BERT Devlin et al. (2019) and XLM Conneau and Lample (2019), QE systems have been showing auspicious correlations with human judgements Kepler et al. (2019a). Concurrently, the OpenKiwi framework Kepler et al. (2019b) has made it easier for researchers to push the field forward and build stronger QE models.

Conclusions and Future Work

In this paper we present Comet, a novel neural framework for training MT evaluation models that can serve as automatic metrics and easily be adapted and optimized to different types of human judgements of MT quality.

To showcase the effectiveness of our framework, we sought to address the challenges reported in the 2019 WMT Metrics Shared Task Ma et al. (2019). We trained three distinct models which achieve new state-of-the-art results for segment-level correlation with human judgments, and show promising ability to better differentiate high-performing systems.

One of the challenges of leveraging the power of pretrained models is the burdensome weight of parameters and inference time. A primary avenue for future work on Comet will look at the impact of more compact solutions such as DistilBERT (Sanh et al., 2019).

Additionally, whilst we outline the potential importance of the source text above, we note that our Comet-rank model weighs source and reference differently during inference but equally in its training loss function. Future work will investigate the optimality of this formulation and further examine the interdependence of the different inputs.

Acknowledgments

We are grateful to André Martins, Austin Matthews, Fabio Kepler, Daan Van Stigt, Miguel Vera, and the reviewers, for their valuable feedback and discussions. This work was supported in part by the P2020 Program through projects MAIA and Unbabel4EU, supervised by ANI under contract numbers 045909 and 042671, respectively.

References

Appendix A Appendices

In Table 5 we list the hyper-parameters used to train our models. Before initializing these models a random seed was set to 3 in all libraries that perform “random” operations (torch, numpy, random and cuda).