Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing

Brian Thompson, Matt Post

Introduction

Machine Translation (MT) systems have improved dramatically in the past several years. This is largely due to advances in neural MT (NMT) methods, but the pace of improvement would not have been possible without automatic MT metrics, which provide immediate feedback on MT quality without the time and expense associated with obtaining human judgments of MT output.

However, the improvements that existing automatic metrics helped enable are now causing the correlation between human judgments and automatic metrics to break down Ma et al. (2019); Mathur et al. (2020) especially for BLEU Papineni et al. (2002), which has been the de facto standard metric since its introduction almost two decades ago. The problem currently appears limited to very strong systems, but as hardware, modeling, and available training data improve, it is likely BLEU will fail more frequently in the future. This could prove extremely detrimental if the MT community fails to adopt an improved metric, as good ideas could quietly be discarded or rejected from publication because they do not correlate with BLEU. In fact, this may already be happening.

We propose using a sentential, sequence-to-sequence paraphraser to force-decode and score MT outputs conditioned on their corresponding human references. Our model implicitly represents the entire (exponentially large) set of potential paraphrases of a sentence, both valid and invalid; by “querying” the model with a particular system output, we can use the model score to measure how well the system output paraphrases the human reference translation. Our model is not trained on any human quality judgements, which are not available in many domains and/or language pairs.

The best possible MT output is one which perfectly matches a human reference; therefore, for evaluation, an ideal paraphraser would be one with an output distribution centered around a copy of its input sentence. We denote such a model a “lexically/syntactically unbiased paraphraser” to distinguish it from a standard paraphraser trained to produce output which conveys the meaning of the input while also being lexically and/or syntactically different from it. For this reason, we propose using a multilingual NMT system as an unbiased paraphraser by treating paraphrasing as zero-shot “translation” (e.g., Czech to Czech). We show that a multilingual NMT model is much closer to an ideal lexically/syntactically unbiased paraphraser than a generative paraphraser trained on synthetic paraphrases. It also allows a single model to work in many languages, and can be applied to the task of “Quality estimation (QE) as a metric” Fonseca et al. (2019) by conditioning on the source instead of the reference. Figure 1 illustrates our method, which we denote Prism (Probability is the metric).

We train a single model in 39 languages and show that it:

Outperforms or ties with prior metrics and several contrastive neural methods on the segment-level WMT 2019 MT metrics task in every language pair;Except for Gujarati, where we had no training data.

Is able to discriminate between very strong neural systems at the system level, addressing a problem raised at WMT 2019; and

Significantly outperforms all QE metrics submitted to the WMT 2019 QE shared task

Finally, we contrast the effectiveness of our model when scoring MT output using the source vs the human reference. We observe that human references substantially improve performance, and, crucially, allow our model to rank systems that are substantially better than our model at the task of translation. This is important because it establishes that our method does not require building a state-of-the-art multilingual NMT model in order to produce a state-of-the-art MT metric capable of evaluating state-of-the-art MT systems.

We release our model, metrics toolkit, and preprocessed training data.https://github.com/thompsonb/prism

Related Work

Early MT metrics like BLEU Papineni et al. (2002) and NIST Doddington (2002) use token-level n-gram overlap between the MT output and the human reference. Overlap can also be measured at the character level Popović (2015, 2017) or using edit distance Snover et al. (2006). Many metrics use word- and/or sentence-level embeddings, including ReVal Gupta et al. (2015), RUSE Shimanaka et al. (2018), WMDO Chow et al. (2019), and ESIM Mathur et al. (2019). MEANT Lo and Wu (2011) and MEANT 2.0 Lo (2017) measure similarity between semantic frames and role fillers. State-of-the-art methods including YiSi Lo (2019) and BERTscore Zhang et al. (2019, 2020) rely on contextualized embeddings Devlin et al. (2019) trained on large (non-parallel) corpora. BLEURT Sellam et al. (2020) applies fine tuning of BERT, including training on prior human judgements. In contrast, our work exploits parallel bitext and doesn’t require training on human judgements.

Paraphrase Databases

Prior work explored using parallel bitext to identify phrase level paraphrases Bannard and Callison-Burch (2005); Ganitkevitch et al. (2013) including bitext in multiple language pairs Ganitkevitch and Callison-Burch (2014). Paraphrase tables were, in turn, used in MT metrics to reward systems for paraphrasing words Banerjee and Lavie (2005) or phrases Zhou et al. (2006); Denkowski and Lavie (2010) from the human reference. Our work can be viewed as extending this idea to the sentence level, without having to enumerate the millions or billions of paraphrases Dreyer and Marcu (2012) for each sentence.

Multilingual NMT

Multilingual NMT Dong et al. (2015) has been shown to rival performance of single language pair models in high-resource languages Aharoni et al. (2019); Arivazhagan et al. (2019) while also improving low-resource translation via transfer learning from higher-resource languages Zoph et al. (2016); Nguyen and Chiang (2017); Neubig and Hu (2018). An extreme low-resource setting is where the system translates between languages seen during training, but in a language pair where it did not see any training data, denoted ‘zero-shot’ translation. Despite evidence that intermediate representations are not truly language-agnostic Kudugunta et al. (2019), zero-shot translation has been shown successful, especially between related languages Johnson et al. (2017); Gu et al. (2018); Pham et al. (2019).

Generative Paraphrasing

Sentential paraphrasing can be accomplished by training an MT system on paraphrase examples instead of translation pairs Quirk et al. (2004). While natural paraphrase datasets do exist Quirk et al. (2004); Coster and Kauchak (2011); Fader et al. (2013); Lin et al. (2014); Federmann et al. (2019), they are somewhat limited. An alternative is to start with much more plentiful bitext and back-translate one side into the language of the other to create synthetic paraphrases on which to train Prakash et al. (2016); Wieting and Gimpel (2018); Hu et al. (2019a, b, c). Tiedemann and Scherrer (2019) propose using paraphrasing as a way to measure the semantic abstraction of multilingual NMT. They also propose using a multilingual NMT model as a generative paraphraser.We find that generating from a well trained multilingual NMT system tends to produce copies of the input, as opposed to interesting paraphrases (see Appendix A).

Semantic Similarity

Parallel corpora in many language pairs have been used to produce fixed-size, multilingual sentence representations Schwenk and Douze (2017); Wieting et al. (2017); Artetxe and Schwenk (2018); Wieting et al. (2019); Raganato et al. (2019). LASER Artetxe and Schwenk (2018), for example, trains a variant of NMT with a fixed-size intermediate representation in 93 languages. Embeddings produced by the encoder can be compared to measure intra- or inter-lingual semantic similarity.

Method

We propose using a paraphraser to force-decode and estimate probabilities of MT system outputs, conditioned on their corresponding human references. Let $p(y_{t}|y_{i<t},x)$ be the probability our paraphraser assigns to the $t$ th token in output sequence $y$ , given the previous output tokens $y_{i<t}$ and the input sequence $x$ . Table 1 shows an example of how token-level probabilities from our model (described in § 4) penalize both fluency and adequacy errors given a human reference. We consider two ways of combining token-level probabilities from the model—sequence-level log probability ( $G$ ) and average token-level log probability ( $H$ ):

We postulate that the output sentence that best represents the meaning of an input sentence is, in fact, simply a copy of the input sentence, as precise word order and choice often convey subtle connotations. As such, we seek a model whose output distribution is centered around a copy of the input sentence, which we denote a “lexically/syntactically unbiased paraphraser.” While a standard generative paraphraser is trained to retain semantic meaning, it does not meet our criteria because it is simultaneously trained to produce output which is lexically/syntactically different than its input, a key element in generative paraphrasing Bhagat and Hovy (2013).

We propose using a multilingual NMT system as a lexically/syntactically unbiased paraphraser. A multilingual NMT system consists of an encoder which maps a sentence in to an (ideally) language-agnostic semantic representation, and decoder to map that representation back to a sentence. The model has only seen bitext in training, but we propose to treat paraphrasing as a zero-shot “translation” (e.g., Czech to Czech).

Because our model is multilingual, we can also score MT system output conditioned on the source sentence instead of the human reference. This task is known as “quality estimation (QE) as a metric,” and was part of the WMT19 QE shared task Fonseca et al. (2019). We use “Prism-ref” to denote our reference-based metric and “Prism-src” to denote our system applied as a QE metric.

Our final metric and QE metric are defined based on results on our development set (see § 5.2) as follows:

To obtain system-level scores, we average segment-level scores over all segments in the test set.

Experiments

We train a multilingual NMT model and explore the extent to which it functions as a lexically/syntactically unbiased paraphraser. We then conduct several preliminary experiments on the WMT18 MT metrics data Ma et al. (2018) to determine how to best utilize the token-level probabilities from the paraphraser, and report results on the WMT19 system- and segment-level metric tasks Ma et al. (2019) and QE as a metric task Fonseca et al. (2019).

Our method requires a model, which in turn relies heavily on the data on which it is trained, so we describe here the rationale behind the design decisions made regarding the training data. Full details sufficient for replication are provided in Appendix B.

To encourage our intermediate representation to be as language-agnostic as possible, we choose datasets with as much language pair diversity as possible (i.e., not just en–* and *–en), as Kudugunta et al. (2019) has shown that encoder representation is affected by both the source language and target language. While it is common to append the target language token to the source sentence, we instead prepend it to the target sentence so that the encoder cannot do anything target-language specific with this tag. At test time, we force-decode the desired language tag prior to scoring.

Noise

NMT systems are known to be sensitive to noise, including sentence alignment errors Khayrallah and Koehn (2018), so we perform filtering with LASER Schwenk (2018); Chaudhary et al. (2019). We also perform language ID filtering using FastText Joulin et al. (2016) to avoid training the decoder with incorrect language tags.

Number of Languages

Aharoni et al. (2019) found that performance of zero-shot translation in a related language pair increased substantially when increasing the number of languages from 5 languages and 25, with a performance plateau somewhere between 25 and 50 languages. We view paraphrasing as zero-shot translation between sentences in the same language, so we expect to need a similar number of languages.

Copies

We filter sentence pairs with excessive copies and partial copies, as multiple studies Ott et al. (2018); Khayrallah and Koehn (2018) have noted that MT performance degrades substantially when systems are exposed to copies in training.

2 Model Training

We train a Transformer Vaswani et al. (2017) model with approximately 745M parameters to translate between 39 languages. The full list of languages and data amounts used is provided in Appendix B, and model training details sufficient for replication are given in Appendix C. Training a single large model consumed the majority of our compute budget, thus performing ablations is beyond the scope of this work.

Our data comes primarily from WikiMatrix Schwenk et al. (2019), Global Voices,http://casmacat.eu/corpus/global-voices.html EuroParl Koehn (2005), SETimes,http://nlp.ffzg.hr/resources/corpora/setimes/ and United Nations Eisele and Chen (2010). The data processing described above and in Appendix B results in 99.8M sentence pairs in 39 languages.For every sentence pair (a,b) in our 99.8M examples, we train on both (a,b) and (b,a) The most common language is English, at 16.7% of our data, while the least common 20 languages account for 21.9%.

3 Baselines and Contrastive Methods

We compare to all systems from the WMT19 shared metrics task, as well as BERTscore Zhang et al. (2020) and the recent BLEURT method Sellam et al. (2020). We also explore several contrastive methods. Training details sufficient for replication for each model/baseline are given in Appendix C.

We compare scoring with our Prism model vs a standard, English-only paraphraser trained on the ParaBank 2 dataset Hu et al. (2019c). ParaBank 2 contains $\sim$ 50M synthetic paraphrastic pairs derived from back-translating a Czech–English corpus, and the authors report state-of-the-art paraphrasing results.

Auto-encoder

Auto-encoders provide an alternative means of training seq2seq models, without the need for parallel bitext. We compare to scoring with the “multilingual denoising pre-trained model” (mBART) of Liu et al. (2020), as it works in all languages of interest.

LASER

We explore using the cosine distance between LASER embeddings of the MT output and human reference, using the pretrained 93-language model provided by the authors.https://github.com/facebookresearch/LASER We are particularly interested in LASER as it, like our model, is trained on parallel bitext in many languages.

Language Model

We find qualitatively that LASER is fairly insensitive to disfluencies (see Table 1), so we also explore augmenting it with language model (LM) scores of the system outputs. We train a multilingual language model (see Appendix C) on the same data as our multilingual NMT system.

4 Paraphraser Bias

5 MT Metrics Evaluation

We report results and statistical significance using scripts released with the WMT19 shared task. Segment-level performance is reported as the Kendall’s $\tau$ variant used in the shared task, and system-level performance is reported as Pearson correlation with the mean of the human judgments. Bootstrap resampling Koehn (2004); Graham et al. (2014) is used to estimate confidence intervals for each metric, and metrics with non-overlapping 95% confidence intervals are identified as having a statistically significant difference in performance.

Results

2 Preliminary (Development) Results

3 Segment-Level Metric Results

Segment-level metric results are shown in Table 2. On language pairs into non-English, we outperform prior work by a statistically significant margin in 7 of 11 language pairsIn en–ru, Prism-ref is statistically tied with YiSi-1, ESIM, and BERTscore. and are statistically tied for best in the rest, with the exception of Gujarati (gu) where the model had no training data. Into English, our metric is statistically tied with the best prior work in every language pair. Our metric tends to significantly outperform our contrastive LASER + LM and mBART methods, although LASER + LM performs surprisingly well in en–ru.

4 System-Level Metric Results

Table 3 shows system-level metric performance on the top four systems submitted to WMT19 compared to selected metrics. While correlations are not high in all cases for Prism, they are at least all positive. In contrast, BLEU has negative correlation in 5 language pairs, and BERTscore and YiSi-1 variants are each negative in at least two. BLEURT has positive correlations in all language pairs into English, but is English-only. Note that Pearson’s correlation coefficient may be unstable in this setting Mathur et al. (2020). For full top four system-level results see Appendix F.

We do not find the system-level results computed against all submitted MT systems (see Appendix G) to be particularly interesting; as noted by Ma et al. (2019), a single weak system can result in high overall system-level correlation even for a very poor metric.

5 QE as a Metric Results

We find that our reference-less Prism-src outperforms all QE as a metrics systems from the WMT19 shared task by a statistically significant margin, in every language pair at segment-level human correlation (Table 4), and outperforms or statistically ties at system-level human correlation (Appendix G).

Analysis and Discussion

The fact that our model is multilingual allows us to explore the extent to which the human reference actually improves our model’s ability to judge MT system output, compared to using the source instead. The underlying assumption with any MT metric is that the work done by the human translator makes it easier to automatically judge the quality of MT output. However, if our model or the MT systems being judged were strong enough, we would expect this assumption to break down.

Comparing the performance of our method with access to the human reference (Prism-ref) vs our method with access to only the source (Prism-src), we find that the reference-based method statistically outperforms the source-based method in all but one language pair. We find the case where they are not statistically different, de–cs, to be particularly interesting: de–cs was the only language pair in WMT19 where the systems were unsupervised (i.e., did not use parallel training data). As a result, it is the only language pair where our model outperformed the best WMT system at translation. In most cases, our model is substantially worse at translation than the best WMT systems. For example, in en–de and zh–en, two language pairs where strong NMT systems were especially problematic for MT metrics, the Prism model is 6.8 and 19.2 BLEU points behind the strongest WMT systems, respectively (see Table 5 for the Prism model compared to the best system submitted in each WMT19 language pair). Thus the performance difference between Prism-ref and Prism-src would suggest that the model needs no help in judging MT systems which are weaker than it is, but the human references are assisting our model in evaluating MT systems which are stronger than it is. This means that we have not simply reduced the task of MT evaluation to that of building a state-of-the-art MT system. We see that a good (but not state-of-the-art) multilingual NMT system can be a state-of-the-art MT metric and judge state-of-the-art MT systems.

Finally, with the exception of de–cs discussed above, we see statistically significant improvements for Prism-ref over Prism-src both into English (where human judgments were reference-based) and into non-English (where human judgments were source-based). This suggests that the high correlation of Prism-ref with human judgements is not simply the result of reference bias Fomicheva and Specia (2016).

Does paraphraser bias matter?

Our lexically/syntactically unbiased paraphraser tends to outperforms the generative English-only ParaBank 2 paraphraser, but usually not by a statistically significant margin. Analysis indicate the lexical/syntactic bias is only harmful in somewhat infrequent cases where MT systems match or nearly match the reference, suggesting it would be more detrimental with stronger systems or multiple references. Our multilingual training method is much simpler than the alternative of creating synthetic paraphrases and training individual models in 39 languages, and our model may benefit from transfer learning to lower-resource languages.

Does fluency matter?

Despite NMT being very fluent, our results suggest that fluency is fairly discriminative, especially in non-English: LM scoring outperforms sentenceBLEU at segment-level correlation in 7/10 language pairs to non-English languages (excluding Gujarati), for example. This is consistent with recent findings that LM scores can be used to augment BLEU Edunov et al. (2020).

Can we measure adequacy and fluency separately?

The proposed method significantly outperforms the contrastive LASER-based method in most language pairs, even when LASER is augmented with a language model. This suggests that jointly optimizing a model for adequacy and fluency is better than optimizing them independently and combining after the fact—this is unsurprising given that neural MT has shown significant improvements over statistical MT, where a phrase table and language model were trained separately.

Can we train on monolingual data instead of bitext?

The proposed method significantly outperforms scoring with the mBART auto-encoder, which is trained on large amounts of monolingual data, despite using substantially less compute power (1.3 weeks on 8 V100s for Prism vs 2.5 weeks on 256 V100s for mBART).

Conclusion and Future Work

We show that a multilingual NMT system can be used as a lexically/syntactically unbiased, multilingual paraphraser, and that the resulting paraphraser can be used as an MT metric and QE metric. Our method achieves state-of-the-art performance on the most recent WMT shared metrics and QE tasks, without training on prior human judgements.

We release a single model which supports 39 languages. To the best of our knowledge, we are the first to release a large multilingual NMT system, and we hope others follow suit. We are optimistic our method will improve further as stronger multilingual NMT models become publicly available.

We compare our method to several contrastive methods and present analysis showing that we have not simply reduced the task of evaluation to that of building a state-of-the-art MT system; the work done by the human translator to create references helps the evaluation model to judge systems that are stronger (at translation) than it is.

Nothing in our method is specific to sentence-level MT. In future work, we would like to extend Prism to paragraph- or document-level evaluation by training a paragraph- or document-level multilingual NMT system, as there is growing evidence that MT evaluation would be better conducted at the document level, rather than the sentence level Läubli et al. (2018).

Acknowledgments

Brian Thompson is supported by the National Defense Science and Engineering Graduate (NDSEG) Fellowship.

References

Appendix A Generation Examples

Figure 3 shows sentences generated from both our model and the model trained on ParaBank 2.

We also contrast the conditional probabilities of three outputs for the same input: (1) the sequence generated by the model via beam search; (2) a copy of the input; and (3) a human paraphrase of the input. We use the English side of the zh–en newstest17 Bojar et al. (2017) as input, so that we can use the second human reference released by Hassan et al. (2018) as a human paraphrase. Table 6 shows the results of scoring a copy of the input, a human paraphrase of the input, and a model’s beam search output, for both our multilingual paraphraser and the ParaBank 2 model.

Appendix B Data Details for Replication

Much of our data comes from WikiMatrix Schwenk et al. (2019), a large collection of parallel data extracted from Wikipedia, and for more domain variety, we added Global Voices,http://casmacat.eu/corpus/global-voices.html EuroParl Koehn (2005) (random subset of to 100k sentence pairs per language pair), SETimes,http://nlp.ffzg.hr/resources/corpora/setimes/ United Nations Eisele and Chen (2010) (random sample of 1M sentence pairs per language pair). We also included WMT Kazakh–English and Kazakh–Russian data from WMT, to be able to evaluate on Kazakh.

WMT Kazakh–English and Kazakh–Russian were limited to the best 1M and 200k sentence pairs, respectively, as judged by LASER. We used a margin threshold of 1.05 for WikiMatrix and a threshold of 1.04 for the remaining datasets, as we expect them to be cleaner. We find that FastText classifies many sentences as non-English when they contain mostly English but also contain a few non-English words, especially from lower resource languages. To remedy this, we performed language identification (LID) on 5-grams and filtered out sentences for which LID did not classify at least half of the 5-grams as the expected language.

We filtered out sentences where there was more than 60% overlap in 3-grams or 40% overlap in 4-grams. Via manual inspection, this seemed to provide a good trade-off between allowing numbers and named entities to be copied, and filtering out sentences that were clearly not translated. We perform tokenization with SentencePiece Kudo and Richardson (2018) prior to filtering, using a 200k vocabulary for all language pairs, to account for languages like Chinese which do not denote word boundaries. Note that this vocabulary was used only for filtering, not for training the final model.

We limited training to languages with at least 1M examples, which resulted in 39 languages. Figure 4 shows the languages and amount of data in each language.

Appendix C Model Training Details for Replication

We train a SentencePiece Kudo and Richardson (2018) model with a 64k vocabulary size on the concatenation of all data, and filter sentences with length greater than 200 subwords. Multilingual NMT performance has been found to increase significantly with model size – tor example, the best performance of Huang et al. (2019) is with their largest model which has 6 billion parameters. Training such a model is well beyond the scope of this work, but we train a model as large a feasible given our compute budget constraints. We train a Transformer Vaswani et al. (2017) in fairseq Ott et al. (2019) with eight encoder layers, eight decoder layers, an embedding size of 1280, feed forward layer size of 12288, 20 attention heads, learning rate of $0.0004$ , batch size of 1800 tokens with gradient accumulation over 200 batches, gradient clipping of 1.2, and dropout of 0.1. The model has approximately 745M parameters for 39 languages. We train for 6 epochs, which takes approximately 9 days on a p3.16xlarge instance rented from Amazon AWS, which has 8 Volta V100 GPUs with 16 GB of memory each. No hyperparameters were swept, as training a single model used the majority of our compute budget (the total cost for training this model was approximately $13,000 USD). However, we did restart training after discovering that LID was not performing well and adding the 5-gram LID filtering.

C.2 ParaBank 2 Model

We train a contrastive, English-only paraphraser on the ParaBank 2 dataset Hu et al. (2019c). We train a Transformer with an 8-layer encoder, 8-layer decoder, $1024$ dimensional embeddings, embedding sizes of $1024$ , feed-forward size of $4096$ , and $16$ attention heads. We use a SentencePiece model with a 16k vocabulary size. Dropout is $0.3$ , label smoothing is $0.1$ , and learning rate is $0.0005$ . The model has approximately 253M parameters for 1 language. Batch size is 31200 tokens, and the model trains for approximately 6 weeks (33 epochs) on 4 Nvidia 2080 GPUs.

C.3 Language Model

We train a multilingual language model on the same data as our multilingual NMT system.

The model architecture is based on GPT-2 Radford et al. (2019), and we use the fairseq transformer_lm_gpt2_small implementation. We train for 200k updates (18 epochs) of approximately 131k tokens. The model has 369M parameters for 39 languages. We train with shared embeddings and a learning rate of $0.0005$ , and we stop gradients at sentence boundaries, using --sample-break-mode eos as the model will be used to evaluate individual sentences. Other parameters match the fairseq defaults. The model trained for approximately 4 weeks on 4 Nvidia TITAN RTX GPUs.

C.4 Autoencoder

We use the pretrained “multilingual denoising pre-trained model” (mBART) model of Liu et al. (2020), as it works in all languages of interest. Their model is designed to be fine-tuned to translation tasks, and their fine-tuning introduces subtle changes to the decoder that are required for inference. In order to adapt it to our task, we therefore fine-tune for a single update with a learning rate of 0. We then produce scores with the model in the same manner as Prism-ref. The model has approximately 680M parameters for 25 languages. We did not train this model but note that doing so required substantial compute power – Liu et al. (2020) note that they trained for approximately 2.5 weeks on 256 Nvidia V100 GPUS, each with 32GB of memory.

C.5 Baselines

We compare to BLEURT Sellam et al. (2020) using the authors’ recommended “BLEURT-Base 128”https://github.com/google-research/bleurt We compare to BERTscore F1 Zhang et al. (2020) using the model and code provided by the authors.https://github.com/Tiiiger/bert_score The remaining baseline results are computed using the metric scores as submitted to Ma et al. (2019)http://data.statmt.org/wmt19/translation-task/wmt19-submitted-data-v3.tgz

Appendix D WMT 2018 (Development set) Results: System-level, Segment-level, and Sweeps

Figure 5 shows results on the development set (WMT18) for sweeping various linear combinations.

Table 7, Table 8, Table 9 and Table 10, show full segment- and system- level results, into and out of English, for the WMT 2018 MT metrics shared task, along with all baselines and submitted systems.

Appendix E WMT 2019 Metric and QE as Metric Segment-Level Results

Table 11, Table 12, and Table 13 show segment-level metrics (excluding QE as a metric) results, for language pairs into, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems.

Table 14, Table 15, and Table 16 show segment-level QE as a metric results, for language pairs into, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems.

Appendix F WMT 2019 System-Level results for Top 4 Systems

Table 17 Table 18, and Table 19 show system-level results for just the top 4 systems, for language pairs into, out of, and not including English, for WMT 2019. We show statistical significance following the shared task but note it appears extremely noisy.

Appendix G WMT 2019 Metric and QE as Metric System-Level Results

Table 20, Table 21, and Table 22, show system-level results, for metrics (excludes QE as metric) for language pairs into, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems.

Table 23, Table 24, and Table 25, show system-level results, for QE as metric, for language pairs into, out of, and not including English, for the WMT 2019 MT metrics shared task, along with all baselines and submitted systems.