SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

Masoud Jalili Sabet, Philipp Dufter, François Yvon, Hinrich Schütze

Introduction

Word alignments are essential for statistical machine translation and useful in NMT, e.g., for imposing priors on attention matrices Liu et al. (2016); Chen et al. (2016); Alkhouli and Ney (2017); Alkhouli et al. (2018) or for decoding Alkhouli et al. (2016); Press and Smith (2018). Further, word alignments have been successfully used in a range of tasks such as typological analysis Lewis and Xia (2008); Östling (2015b), annotation projection Yarowsky et al. (2001); Padó and Lapata (2009); Asgari and Schütze (2017); Huck et al. (2019) and creating multilingual embeddings Guo et al. (2016); Ammar et al. (2016); Dufter et al. (2018).

Statistical word aligners such as the IBM models Brown et al. (1993) and their implementations Giza++ Och and Ney (2003), fast-align Dyer et al. (2013), as well as newer models such as eflomal Östling and Tiedemann (2016) are widely used for alignment. With the rise of NMT Bahdanau et al. (2014), attempts have been made to interpret attention matrices as soft word alignments Cohn et al. (2016); Koehn and Knowles (2017); Ghader and Monz (2017). Several methods create alignments from attention matrices Peter et al. (2017); Zenkel et al. (2019) or pursue a multitask approach for alignment and translation Garg et al. (2019). However, most systems require parallel data (in sufficient amount to train high quality NMT systems) and their performance deteriorates when parallel text is scarce (Tables 1–2 in (Och and Ney, 2003)).

Recent unsupervised multilingual embedding algorithms that use only non-parallel data provide high quality static Artetxe et al. (2018); Conneau et al. (2018) and contextualized embeddings Devlin et al. (2019); Conneau et al. (2020). Our key idea is to leverage these embeddings for word alignments – by extracting alignments from similarity matrices induced from embeddings – without relying on parallel data. Requiring no or little parallel data is advantageous, e.g., in the low-resource case and in domain-specific settings without parallel data. A lack of parallel data cannot be easily remedied: mining parallel sentences is possible Schwenk et al. (2019) but assumes that comparable, monolingual corpora contain parallel sentences. Further, we find that large amounts of mined parallel data do not necessarily improve alignment quality.

Our main contribution is that we show that word alignments obtained from multilingual pretrained language models are superior for four and comparable for two language pairs, compared to strong statistical word aligners like eflomal even in high resource scenarios. Additionally, (1) we introduce three new alignment methods based on the matrix of embedding similarities and two extensions that handle null words and integrate positional information. They permit a flexible tradeoff of recall and precision. (2) We provide evidence that subword processing is beneficial for aligning rare words. (3) We bundle the source code of our methods in a tool called SimAlign, which is available.https://github.com/cisnlp/simalign An interactive online demo is available.https://simalign.cis.lmu.de/

Methods

We propose three methods to obtain alignments from similarity matrices. Argmax is a simple baseline, IterMax a novel iterative algorithm, and Match a graph-theoretical method based on identifying matchings in a bipartite graph.

Argmax. A simple baseline is to align $i$ and $j$ when $s^{(e)}_{i}$ is the most similar word to $s^{(f)}_{j}$ and vice-versa. That is, we set $A_{ij}=1$ if

and $A_{ij}=0$ otherwise. In case of ties, which are unlikely in similarity matrices, we choose the smaller index. If all entries in a row $i$ or column $j$ of $S$ are we set $A_{ij}=0$ (this case can appear in Itermax). Similar methods have been applied to co-occurrences Melamed (2000) (“competitive linking”), Dice coefficients Och and Ney (2003) and attention matrices Garg et al. (2019).

Itermax. There are many sentences for which Argmax only identifies few alignment edges because mutual argmaxes can be rare. As a remedy, we apply Argmax iteratively. Specifically, we modify the similarity matrix conditioned on the alignment edges found in a previous iteration: if two words $i$ and $j$ have both been aligned, we zero out the similarity. Similarly, if neither is aligned we leave the similarity unchanged. In case only one of them is aligned, we multiply the similarity with a discount factor $\alpha\in$ . Intuitively, this encourages the model to focus on unaligned word pairs. However, if the similarity with an already aligned word is exceptionally high, the model can add an additional edge. Note that this explicitly allows one token to be aligned to multiple other tokens. For details on the algorithm see Figure 2.

Match. Argmax finds a local, not a global optimum and Itermax is a greedy algorithm. To find global optima, we frame alignment as an assignment problem: we search for a maximum-weight maximal matching (e.g., Kuhn (1955)) in the bipartite weighted graph which is induced by the similarity matrix. This optimization problem is defined by

subject to $A$ being a matching (i.e., each node has at most one edge) that is maximal (i.e., no additional edge can be added). There are known algorithms to solve the above problem in polynomial time (e.g., Galil (1986)).

Note that alignments generated with the match method are inherently bidirectional. None of our methods require additional symmetrization as post-processing.

2 Distortion and Null Extensions

Distortion Correction [Dist]. Distortion, as introduced in IBM Model 2, is essential for alignments based on non-contextualized embeddings since the similarity of two words is solely based on their surface form, independent of position. To penalize high distortions, we multiply the similarity matrix $S$ componentwise with

where $\kappa$ is a hyperparameter to scale the distortion matrix $P$ between $[(1-\kappa),1]$ . We use $\kappa=0.5$ . See supplementary for different values. We can interpret this as imposing a locality-preserving prior: given a choice, a word should be aligned to a word with a similar relative position ( $\left(i/l_{e}-j/l_{f}\right)^{2}$ close to 0) rather than a more distant word (large $\left(i/l_{e}-j/l_{f}\right)^{2}$ ).

Null. Null words model untranslated words and are an important part of alignment models. We propose to model null words as follows: if a word is not particularly similar to any of the words in the target sentence, we do not align it. Specifically, given an alignment matrix $A$ , we remove alignment edges when the normalized entropy of the similarity distribution is above a threshold $\tau$ , a hyperparameter. We use normalized entropy (i.e., entropy divided by the log of sentence length) to account for different sentence lengths; i.e., we set $A_{ij}=0$ if

where $S^{h}_{ik}:=S_{ik}/\sum_{m=1}^{l_{f}}S_{im}$ , and $S^{v}_{kj}:=S_{kj}/\sum_{m=1}^{l_{e}}S_{mj}$ . As the ideal value of $\tau$ depends on the actual similarity scores we set $\tau$ to a percentile of the entropy values of the similarity distribution across all aligned edges (we use the 95th percentile). Different percentiles are in the supplementary.

Experiments

Static. We train monolingual embeddings with fastText Bojanowski et al. (2017) for each language on its Wikipedia. We then use VecMap Artetxe et al. (2018) to map the embeddings into a common multilingual space. Note that this algorithm works without any crosslingual supervision (e.g., multilingual dictionaries). We use the same procedure for word and subword levels. We use the label fastText to refer to these embeddings as well as the alignments induced by them.

Contextualized. We use the multilingual BERT model (mBERT).https://github.com/google-research/bert/blob/master/multilingual.md It is pretrained on the 104 largest Wikipedia languages. This model only provides embeddings at the subword level. To obtain a word embedding, we simply average the vectors of its subwords. We consider word representations from all 12 layers as well as the concatenation of all layers. Note that the model is not finetuned. We denote this method as mBERT[i] (when using embeddings from the $i$ -th layer, where means using the non-contextualized initial embedding layer) and mBERT[conc] (for concatenation).

In addition, we use XLM-RoBERTa base Conneau et al. (2020), which is pretrained on 100 languages on cleaned CommonCrawl data Wenzek et al. (2020). We denote alignments obtained using the embeddings from the $i$ -th layer by XLM-R[i].

2 Word and Subword Alignments

We investigate both alignments between subwords such as wordpiece Schuster and Nakajima (2012) (which are widely used for contextualized language models) and words. We refer to computing alignment edges between words as word level and between subwords as subword level. Note that gold standards are all word-level. In order to evaluate alignments obtained at the subword level we convert subword to word alignments using the heuristic “two words are aligned if any of their subwords are aligned” (see Figure 3). As a result a single word can be aligned with multiple other words.

For the word level, we use the NLTK tokenizer Bird et al. (2009) (e.g., for tokenizing Wikipedia in order to train fastText). For the subword level, we generally use multilingual BERT’s vocabulary3 and BERT’s wordpiece tokenizer. For XLM-R we use the XLM-R subword vocabulary. Since gold standards are already tokenized, they do not require additional tokenization.

3 Baselines

We compare to three popular statistical alignment models that all require parallel training data. fast-align/IBM2 Dyer et al. (2013) is an implementation of an alignment algorithm based on IBM Model 2. It is popular because of its speed and high quality. eflomalgithub.com/robertostling/eflomal (based on efmaral by Östling and Tiedemann (2016)), a Bayesian model with Markov Chain Monte Carlo inference, is claimed to outperform fast-align on speed and quality. Further we use the widely used software package Giza++/IBM4 Och and Ney (2003), which implements IBM alignment models. We use its standard settings: 5 iterations each for the HMM model, IBM Models 1, 3 and 4 with $p_{0}=0.98$ .

Symmetrization. Probabilistic word alignment models create forward and backward alignments and then symmetrize them Och and Ney (2003); Koehn et al. (2005). We compared the symmetrization methods grow-diag-final-and (GDFA) and intersection and found them to perform comparably; see supplementary. We use GDFA throughout the paper.

4 Evaluation Measures

Given a set of predicted alignment edges $A$ and a set of sure, possible gold standard edges $S$ , $P$ (where $S\subset P$ ), we use the following evaluation measures:

where $|\cdot|$ denotes the cardinality of a set. This is the standard evaluation Och and Ney (2003).

5 Data

Our test data are a diverse set of 6 language pairs: Czech, German, Persian, French, Hindi and Romanian, always paired with English. See Table 11 for corpora and supplementary for URLs.

For our baselines requiring parallel training data (i.e., eflomal, fast-align and Giza++) we select additional parallel training data that is consistent with the target domain where available. See Table 11 for the corpora. Unless indicated otherwise we use the whole parallel training data. Figure 5 shows the effect of using more or less training data.

Given the large amount of possible experiments when considering 6 language pairs we do not have space to present all numbers for all languages. If we show results for only one pair, we choose ENG-DEU as it is an established and well-known dataset (EuroParl). If we show results for more languages we fall back to DEU, CES and HIN, to show effects on a mid-resource morphologically rich language (CES) and a low-resource language written in a different script (HIN).

Results

Figure 4 shows a parabolic trend across layers of mBERT and XLM-R. We use layer 8 in this paper because it has best performance. This is consistent with other work Hewitt and Manning (2019); Tenney et al. (2019): in the first layers the contextualization is too weak for high-quality alignments while the last layers are too specialized on the pretraining task (masked language modeling).

2 Comparison with Prior Work

Contextual Embeddings. Table 2 shows that mBERT and XLM-R consistently perform well with the Argmax method. XLM-R yields mostly higher values than mBERT. Our three baselines, eflomal, fast-align and Giza++, are always outperformed (except for RON). We outperform all prior work except for FRA where we match the performance and RON. This comparison is not entirely fair because methods relying on parallel data have access to the parallel sentences of the test data during training whereas our methods do not.

Romanian might be a special case as it exhibits a large amount of many to one links and further lacks determiners. How determiners are handled in the gold standard depends heavily on the annotation guidelines. Note that one of our settings, XLM-R with Itermax at the subword level, has an F1 of .72 for ENG-RON, which comes very close to the performance by Östling (2015a) (see Table 3).

In summary, extracting alignments from similarity matrices is a very simple and efficient method that performs surprisingly strongly. It outperforms strong statistical baselines and most prior work in unsupervised word alignment for CES, DEU, FAS and HIN and is comparable for FRA and RON. We attribute this to the strong contextualization in mBERT and XLM-R.

Static Embeddings. fastText shows a solid performance on word level, which is worse but comes close to fast-align and outperforms it for HIN. We consider this surprising as fastText did not have access to parallel data or any multilingual signal. VecMap can also be used with crosslingual dictionaries. We expect this to boost performance and fastText could then become a viable alternative to fast-align.

Amount of Parallel Data. Figure 5 shows that fast-align and eflomal get better with more training data with eflomal outperforming fast-align, as expected. However, even with 1.9M parallel sentences mBERT outperforms both baselines. When adding up to 37M additional parallel sentences from ParaCrawl Esplà et al. (2019) performance for fast-align increases slightly, however, eflomal decreases (grey area in plot). ParaCrawl contains mined parallel sentences whose lower quality probably harms eflomal. fastText (with distortion) is competitive with eflomal for fewer than 1000 parallel sentences and outperforms fast-align even with 10k sentences. Thus for very small parallel corpora ( $<$ 10k sentences) using fastText embeddings is an alternative to fast-align.

The main takeaway from Figure 5 is that mBERT-based alignments, a method that does not need any parallel training data, outperforms state-of-the-art aligners like eflomal for ENG-DEU, even in the very high resource case.

3 Additional Methods and Extensions

We already showed that Argmax yields alignments that are competitive with the state of the art. In this section we compare all our proposed methods and extensions more closely.

Itermax. Table 4 shows results for Argmax (i.e., 1 Iteration) as well as Itermax (i.e., 2 or more iterations of Argmax). As expected, with more iterations precision drops in favor of recall. Overall, Itermax achieves higher $F_{1}$ scores for the three language pairs (equal for ENG-CES) both for mBERT and fastText embeddings. For Hindi the performance increase is the highest. We hypothesize that for more distant languages Itermax is more beneficial as similarity between wordpieces may be generally lower, thus exhibiting fewer mutual argmaxes. For the rest of the paper if we use Itermax we use 2 Iterations with $\alpha=0.9$ as it exhibits best performance (5 out of 6 wins in Table 4).

Argmax/Itermax/Match. In Table 3 we compare our three proposed methods in terms of $F_{1}$ across all languages. We chose to show the two best performing settings from Table 2: mBERT and XLM-R at the subword level. Itermax performs slightly better than Argmax with 6 wins, 4 losses and 2 ties. Itermax seems to help more for more distant languages such as FAS, HIN and RON, but harms for FRA. Match has the lowest $F_{1}$ , but generally exhibits a higher recall (see e.g., Table 5).

Null and Distortion Extensions. Table 5 shows that Argmax and Itermax generally have higher precision, whereas Match has higher recall. Adding Null almost always increases precision, but at the cost of recall, resulting mostly in a lower $F_{1}$ score. Adding a distortion prior boosts performance for static embeddings, e.g., from .70 to .77 for ENG-CES Argmax $F_{1}$ and similarly for ENG-DEU. For Hindi a distortion prior is harmful. Dist has little and sometimes harmful effects on mBERT indicating that mBERT’s contextualized representations already match well across languages.

Summary. Argmax and Itermax exhibit the best and most stable performance. For most language pairs Itermax is recommended. If high recall alignments are required, Match is the recommended algorithm. Except for HIN, a distortion prior is beneficial for static embeddings. Null should be applied when one wants to push precision even higher (e.g., for annotation projection).

4 Words and Subwords

Table 2 shows that subword processing slightly outperforms word-level processing for most methods. Only fastText is harmed by subword processing. We use VecMap to match (sub)word distributions across languages. We hypothesize that it is harder to match subword than word distributions – this effect is strongest for Persian and Hindi, probably due to different scripts and thus different subword distributions. Initial experiments showed that adding supervision in form of a dictionary helps restore performance. We will investigate this in future work.

We hypothesize that subword processing is beneficial for aligning rare words. To show this, we compute our evaluation measures for different frequency bins. More specifically, we only consider gold standard alignment edges for the computation where at least one of the member words has a certain frequency in a reference corpus (in our case all 1.9M lines from the ENG-DEU EuroParl corpus). That is, we only consider the edge $(i,j)$ in $A,S$ or $P$ if the minimum of the source and target word frequency is in $[\gamma_{l},\gamma_{u})$ where $\gamma_{l}$ and $\gamma_{u}$ are bin boundaries.

Figure 6 shows $F_{1}$ for different frequency bins. For rare words both eflomal and mBERT show a severely decreased performance at the word level, but not at the subword level. Thus, subword processing is indeed beneficial for rare words.

5 Part-Of-Speech Analysis

To analyze the performance with respect to different part-of-speech (POS) tags, the ENG-DEU gold standard was tagged with the Stanza toolkit Qi et al. (2020). We evaluate the alignment performance for each POS tag by only considering the alignment edges where at least one of their member words has this tag. Table 6 shows results for frequent POS tags. Compared to eflomal, mBERT aligns auxiliaries, pronouns and verbs better. The relative position of auxiliaries and verbs in German can diverge strongly from that in English because they occur at the end of the sentence (verb-end position) in many clause types. Positions of pronouns can also diverge due to a more flexible word order in German. It is difficult for an HMM-based aligner like eflomal to model such high-distortion alignments, a property that has been found by prior work as well Ho and Yvon (2019). In contrast, mBERT(Argmax) does not use distortion information, so high distortion is not a problem for it.

Figure 7 gives an example for auxiliaries. The gold alignment (“has” – “hat”) is correctly identified by mBERT (solid line). Eflomal generates an incorrect alignment (“time” – “hat”): the two words have about the same relative position, indicating that distortion minimization is the main reason for this incorrect alignment. Analyzing all auxiliary alignment edges, the average absolute value of the distance between aligned words is $2.72$ for eflomal and $3.22$ for mBERT. This indicates that eflomal is more reluctant than mBERT to generate high-distortion alignments and thus loses accuracy.

Related Work

Brown et al. (1993) introduced the IBM models, the best known statistical word aligners. More recent aligners, often based on IBM models, include fast-align Dyer et al. (2013), Giza++ Och and Ney (2003) and eflomal Östling and Tiedemann (2016). Östling (2015a) showed that Bayesian Alignment Models perform well. Neural network based extensions of these models have been considered Ayan et al. (2005); Ho and Yvon (2019). All of these models are trained on parallel text. Our method instead aligns based on embeddings that are induced from monolingual data only. We compare with prior methods and observe comparable performance.

Prior work on using learned representations for alignment includes Smadja et al. (1996); Och and Ney (2003) (Dice coefficient), Jalili Sabet et al. (2016) (incorporation of embeddings into IBM models), Legrand et al. (2016) (neural network alignment model) and Pourdamghani et al. (2018) (embeddings are used to encourage words to align to similar words). Tamura et al. (2014) use recurrent neural networks to learn alignments. They use noise contrastive estimation to avoid supervision. Yang et al. (2013) train a neural network that uses pretrained word embeddings in the initial layer. All of this work requires parallel data. mBERT is used for word alignments in concurrent work: Libovický et al. (2019) use the high quality of mBERT alignments as evidence for the “language-neutrality” of mBERT. Nagata et al. (2020) phrase word alignment as crosslingual span prediction and finetune mBERT using gold alignments.

Attention in NMT Bahdanau et al. (2014) is related to a notion of soft alignment, but often deviates from conventional word alignments Ghader and Monz (2017); Koehn and Knowles (2017). One difference is that standard attention does not have access to the target word. To address this, Peter et al. (2017) tailor attention matrices to obtain higher quality alignments. Li et al. (2018)’s and Zenkel et al. (2019)’s models perform similarly to and Zenkel et al. (2020) outperform Giza++. Ding et al. (2019) propose better decoding algorithms to deduce word alignments from NMT predictions. Chen et al. (2016), Mi et al. (2016) and Garg et al. (2019) obtain alignments and translations in a multitask setup. Garg et al. (2019) find that operating at the subword level can be beneficial for alignment models. Li et al. (2019) propose two methods to extract alignments from NMT models, however they do not outperform fast-align. Stengel-Eskin et al. (2019) compute similarity matrices of encoder-decoder representations that are leveraged for word alignments, together with supervised learning, which requires manually annotated alignment. We find our proposed methods to be competitive with these approaches. In contrast to our work, they all require parallel data.

Conclusion

We presented word aligners based on contextualized embeddings that outperform in four and match the performance of state-of-the-art aligners in two language pairs; e.g., for ENG-DEU contextualized embeddings achieve an alignment $F_{1}$ that is 5 percentage points higher than eflomal trained on 100k parallel sentences. Further, we showed that alignments from static embeddings can be a viable alternative to statistical aligner when few parallel training data is available. In contrast to all prior work our methods do not require parallel data for training at all. With our proposed methods and extensions such as Match, Itermax and Null it is easy to obtain higher precision or recall depending on the use case.

Future work includes modeling fertility explicitly and investigating how to incorporate parallel data into the proposed methods.

Acknowledgments

We gratefully acknowledge funding through a Zentrum Digitalisierung.Bayern fellowship awarded to the second author. This work was supported by the European Research Council (# 740516) and the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibility for its content. We thank Matthias Huck, Jindřich Libovický, Alex Fraser and the anonymous reviewers for interesting discussions and valuable comments. Thanks to Jindřich for pointing out that mBERT can align mixed-language sentences as shown in Figure 1.

References

Appendix A Additional Non-central Results

A more detailed version of Table 2 from the main paper that includes precision and recall and results on Itermax can be found in Table 7.

A.2 Rare Words

Figure 8 shows the same as Figure 6 from the main paper but now with a reference corpus of 100k/1000k instead of 1920k parallel sentences. The main takeaways are similar.

A.3 Symmetrization

For asymmetric alignments different symmetrization methods exist. Dyer et al. (2013) provide an overview and implementation (fast-align) for these methods, which we use. We compare intersection and grow-diag-final-and (GDFA) in Table 9. In terms of F1, GDFA performs better (Intersection wins four times, GDFA eleven times, three ties). As expected, Intersection yields higher precision while GDFA yields higher recall. Thus intersection is preferable for tasks like annotation projection, whereas GDFA is typically used in statistical machine translation.

A.4 Alignment Examples for Different Methods

We show examples in Figure 10, Figure 11, Figure 12, and Figure 13. They provide an overview how the methods actually affect results.

Appendix B Hyperparameters

We provide a list of customized hyperparameters used in our computations in Table 10. There are three options how we came up with the hyperparameters: a) We simply used default values of 3rd party software. b) We chose an arbitrary value. Usually we fell back to well-established and rather conventional values (e.g., embedding dimension 300 for static embeddings). c) We defined a reasonable but arbitrary range, out of which we selected the best value using grid search. Table 10 lists the final values we used as well as how we came up with the specific value. For option c) the corresponding analyses are in Figure 4 and Table 3 in the main paper as well as in §B.2 in this supplementary material.

B.2 Null and Distortion Extensions

In Figure 9 we plot the performance for different values of $\kappa$ . We observe that introducing distortion indeed helps (i.e., $\kappa>0$ ) but the actual value is not decisive for performance. This is rather intuitive, as a small adjustment to the similarities is sufficient while larger adjustments do not necessarily change the argmax or the optimal point in the matching algorithm. We choose $\kappa=0.5$ .

For $\tau$ in null-word extension, we plot precision, recall and $F_{1}$ in Figure 9 when assigning $\tau$ different percentile values. Note that values for $\tau$ depend on the similarity distribution of all aligned edges. As expected, when using the 100th percentile no edges are removed and thus the performance is not changed compared to not having a null-word extension. When decreasing the value of $\tau$ the precision increases and recall goes down, while $F_{1}$ remains stable. We use the 95th percentile for $\tau$ .

Appendix C Reproducibility Information

We did all computations on up to 48 cores of Intel(R) Xeon(R) CPU E7-8857 v2 with 1TB memory and a single GeForce GTX 1080 GPU with 8GB memory.

Runtimes for aligning 500 parallel sentences on ENG-DEU are reported in Table 12. mBERT and XLM-R computations are done on the GPU. Note that fast-align, GIZA++ and eflomal usually need to be trained on much more parallel data to achieve better performance: this increases their runtime.

All our proposed methods are parameter-free. If we consider the parameters of the pretrained language models and pretrained embeddings then fastText has around 1 billion parameters (up to 500k words per language, 7 languages and embedding dimension 300), mBERT has 172 million, XLM-R 270 million parameters.

C.2 Data

Table 11 provides download links to all data used.