Parameter Sharing Methods for Multilingual Self-Attentional Translation Models

Devendra Singh Sachan, Graham Neubig

Introduction

Neural machine translation (NMT; Sutskever et al. (2014); Cho et al. (2014)) is now the de-facto standard in MT research due to its relative simplicity of implementation, ability to perform end-to-end training, and high translation accuracy. Early approaches to NMT used recurrent neural networks (RNNs), usually LSTMs Hochreiter and Schmidhuber (1997), in their encoder and decoder layers, with the addition of an attention mechanism Bahdanau et al. (2014); Luong et al. (2015) to focus more on specific encoded source words when deciding the next translation target output. Recently, the NMT research community has been transitioning from RNNs to an alternative method for encoding sentences using self-attention Vaswani et al. (2017), represented by the so-called “Transformer” model, which both improves the speed of processing sentences on computational hardware such as GPUs due to its lack of recurrence, and achieves impressive results.

In parallel to this transition to self-attentional models, there has also been an active interest in the multilingual training of NMT systems Firat et al. (2016); Johnson et al. (2017); Ha et al. (2016). In contrast to the standard bilingual models, multilingual models follow the multi-task training paradigm Caruana (1997) where models are jointly trained on training data from several language pairs, with some degree of parameter sharing. The objective of this is two-fold: First, compared to individually training separate models for each language pair of interest, this maintains competitive translation accuracy while reducing the total number of models that need to be stored, a considerable advantage when deploying practical systems. Second, by utilizing data from multiple language pairs simultaneously, it becomes possible to improve the translation accuracy for each language pair.

In multilingual translation, one-to-many translation —translation from a common source language (for example English) to multiple target languages (for example German and Dutch) — is considered particularly difficult. Previous multi-task learning (MTL) models for this task broadly consist of two approaches as shown in Figure 1: (a) a model with a shared encoder and one decoder per target language (Dong et al. (2015), shown in Figure 1). This approach has the advantage of being able to model each target separately but comes with the cost of slower training and increased memory requirements. (b) a single unified model consisting of a shared encoder and a shared decoder for all the language pairs (Johnson et al. (2017), shown in Figure 1). This simple approach is trivially implementable using a standard bilingual translation model and has the advantage of having a constant number of trainable parameters regardless of the number of languages, but has the caveat that the decoder’s ability to model multiple languages can be significantly reduced.

In this paper, we propose a third alternative: (c) a model with a shared encoder and multiple decoders such that some decoder parameters are shared (shown in Figure 1). This hybrid approach combines the advantages from both the approaches mentioned above. It carefully moderates the types of parameters that are shared between the multiple languages to provide the flexibility necessary to decode two different languages, but still shares as many parameters as possible to take advantage of information sharing across multiple languages. Specifically, we focus on the aforementioned self-attentional Transformer models, with the set of shareable parameters consisting of the various attention weights, linear layer weights, or embedding weights contained therein. The full sharing and no sharing of decoder parameters used in previous work are special cases (refer to Section 2.2 for a detailed description).

To empirically examine the utility of this approach, we examine the case of translation from a common source language to multiple target languages, where the target languages can be either related or unrelated. Our work reveals that while full parameter sharing works reasonably well when using target languages from the same family, partial parameter sharing is essential to achieve the best accuracy when translating into multiple distant languages.

Method

In this section, we will first briefly describe the key elements of the Transformer model followed by our proposed approach of parameter sharing.

As is common in sequence-to-sequence (seq2seq) models for NMT, the self-attentional Transformer model (Figure 2; Vaswani et al. (2017)) consists of an embedding layer, multiple encoder-decoder layers, and an output generation layer. Each encoder layer consists of two sublayers in sequence: self-attentional and feed-forward networks. Each decoder layer consists of three sublayers: masked self-attention, encoder-decoder attention, and feed-forward networks. The core building blocks in all these layers consist of different sets of weight matrices that compute affine transforms.

Next, similarity scores ( $e_{ij}$ ) between query and key vectors are computed by performing a scaled dot-product

Next, attention coefficients ( $\alpha_{ij}$ ) are computed by applying softmax function over these similarity values.

Self-attention output ( $z_{i}$ ) is computed by the convex combination of attention weights with value vectors followed by a linear transformation

Residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) are applied on each sublayer and to the output vector from the final encoder and decoder layers.

2 Parameter Sharing Strategies

In this paper, our objective is to investigate effective parameter sharing strategies for the Transformer model using MTL, mainly for one-to-many multilingual translation. Here, we will use the symbol $\boldsymbol{\Theta}$ to denote the set of shared parameters in our model. These parameter sharing strategies are described below:

The base case consists of separate bilingual translation models for each language pair ( $\boldsymbol{\Theta}$ = $\emptyset$ ).

Use of a common embedding layer for all the bilingual models ( $\boldsymbol{\Theta}$ = { $\boldsymbol{W_{\textit{E}}}\}$ ). This will result in a significant reduction of the total parameters by sharing parameters across common words present in the source and target sentences Wu et al. (2016).

Use of a common encoder for the source language and a separate decoder for each target language ( $\boldsymbol{\Theta}$ = { $\boldsymbol{W_{\textit{E}}}$ , $\boldsymbol{\theta}_{\textit{ENC}}\}$ ). This has the advantage that the encoder will now see more source language training data Dong et al. (2015).

Next, we also include the decoder parameters among the set of shared parameters. While doing so, we will assume that the embedding and the encoder parameters are always shared between the bilingual models. Because there can be exponentially many combinations considering all the different feasible sets of shared parameters between the multiple decoders, we only select a subset of these combinations based on our preliminary results. These selected weights are shared in all the layers of the decoder unless stated otherwise. A schematic diagram illustrating the various possible parameter matrices that can be shared in each sublayer of our MTL model is shown in Figure 3.

Sharing the weights of the self-attention sublayer ( $\boldsymbol{\Theta}$ = { $\boldsymbol{W_{\textit{E}}}$ , $\boldsymbol{\theta_{\textit{ENC}}}$ , $\boldsymbol{W_{\textit{K}}^{1}}$ , $\boldsymbol{W_{\textit{Q}}^{1}}$ , $\boldsymbol{W_{\textit{V}}^{1}}$ , $\boldsymbol{W_{\textit{F}}^{1}}$ }).

Sharing the weights of the encoder-decoder attention sublayer ( $\boldsymbol{\Theta}$ = { $\boldsymbol{W_{\textit{E}}}$ , $\boldsymbol{\theta_{\textit{ENC}}}$ , $\boldsymbol{W_{\textit{K}}^{2}}$ , $\boldsymbol{W_{\textit{Q}}^{2}}$ , $\boldsymbol{W_{\textit{V}}^{2}}$ , $\boldsymbol{W_{\textit{F}}^{2}}$ }).

We share all the parameters of the decoder to have a single unified model ( $\boldsymbol{\Theta}$ = { $\boldsymbol{W_{\textit{E}}}$ , $\boldsymbol{\theta}_{\textit{ENC}}$ , $\boldsymbol{\theta}_{\textit{DEC}}$ }). Fewer parameters in the decoder indicates limited modeling ability, and we expect this method to obtain good translation accuracy mainly when the target languages are related Johnson et al. (2017).

Experimental Setup

In this section, first, we describe the datasets used in this work and the evaluation criteria. Then, we describe the training regimen followed in all our experiments. All of our models were implemented in PyTorch framework Paszke et al. (2017) and were trained on a single GPU.

To perform multilingual translation experiments, we select six language pairs from the openly available TED talks dataset Qi et al. (2018) whose statistics are mentioned in Table 1. This dataset already contains predefined splits for training, development, and test sets. Among these languages, Romanian (Ro) and French (Fr) are Romance languages, German (De) and Dutch (Nl) are Germanic languages while Turkish (Tr) and Japanese (Ja) are unrelated languages that come from distant language families. For all language pairs, tokenization was carried out using the Moses tokenizer,https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer except for Japanese, where word segmentation was performed using the KyTea tokenizer Neubig et al. (2011). To select training examples, we filter sentences with a maximum length of $70$ tokens. For evaluation, we report the model’s performance using the standard BLEU score metric Papineni et al. (2002). We use the mtevalv14.pl script from the Moses toolkit to compute the tokenized BLEU scores.

2 Training Protocols

In this work, we follow the same training process for all the experiments. We jointly encode the source and target language words with subword units by applying byte pair encoding Gage (1994) with 32,000 merge operations Sennrich et al. (2016). These subword units restrict the vocabulary size and prevent the need for explicitly handling out-of-vocabulary symbols as the vocabulary can be used to represent any word. We use LeCun uniform initialization LeCun et al. (1998) for all the trainable model parameters. Embedding layer weights are randomly initialized according to truncated Gaussian distribution $\boldsymbol{W_{\textit{E}}}\sim\mathcal{N}(0,{d_{\textit{m}}}^{-1/2})$ .

Each mini-batch consists of approximately $3,000$ source and $3,000$ target tokens such that similar length sentences are bucketed together. We train the models until convergence and save the best checkpoint using development set performance. For model regularization, we use label smoothing $(\epsilon=0.1)$ Pereyra et al. (2017) and apply dropout (with $p_{drop}=0.1$ ) Srivastava et al. (2014) to the word embeddings, attention coefficients, ReLU activation, and to the output of each sublayer before the residual connection. During decoding, we use beam search with beam width $5$ and length normalization with $\alpha=1$ Wu et al. (2016).

3 Multilingual Training

During the multilingual model’s training and inference, we include an additional token representing the desired target language at the start of each source sentence Johnson et al. (2017). The presence of this additional token will help the model learn the target language to translate to during decoding. For preprocessing, we apply byte pair encoding over the combined dataset of all the language pairs. We perform model training using balanced mini-batches i.e. it contains roughly an equal number of sentences for every target language. While training, we compute weighted average cross-entropy loss where the weighting term is proportional to the total word count observed in each of the target language sentences.

Results

In this section, we will describe the results of our proposed parameter sharing techniques and later present the broader context by comparing them with bilingual translation models and previous benchmark methods.

Here, we first analyze the results of one-to-many multilingual translation experiments when there are two target languages and both of them belong to the same language family. The first set of experiments are on Romance languages (En $\rightarrow$ Ro $+$ Fr) and the second set of experiments are on Germanic languages (En $\rightarrow$ De $+$ Nl). We report the BLEU scores in Table 2 when different sets of parameters are shared in these experiments. We observe that sharing only the embedding layer weight between the multiple models leads to the lowest scores. Sharing the encoder weights results in significant improvement for En $\rightarrow$ Ro $+$ Fr but leads to a small decrease in En $\rightarrow$ De $+$ Nl scores.

2 Overall Comparison

In Table 3, we show an overall performance comparison of no parameter sharing, full parameter sharing for both GNMT Wu et al. (2016) and Transformer models, and the best approaches according to maximum BLEU score from our partial parameter sharing strategies. For training the GNMT models, we use its open-source implementationhttps://github.com/tensorflow/nmt Luong et al. (2017) with four layersWe found that the four layer model for GNMT didn’t overfit and obtained the best BLEU scores. and default parameter settings. First, we note that the BLEU scores of the Transformer model are always better than the GNMT model by a significant margin for both bilingual (no sharing) and multilingual (full sharing) translation tasks. This reflects that the Transformer model is well-suited for both multilingual and bilingual translation tasks compared with the GNMT model. We also surprisingly note that the GNMT fully shared model is able to consistently obtain higher BLEU scores compared with its bilingual version irrespective of which families the target languages belong to.

However, for the one-to-many translation task when the target languages are from distant families, we observe that fully shared Transformer model leads to a substantial drop or small gains in the BLEU score compared with the bilingual models. Specifically, for the En $\rightarrow$ De $+$ Tr setting, BLEU drops by 0.6 for En $\rightarrow$ De, while staying even for En $\rightarrow$ Tr. In contrast, our method of sharing embedding, encoder, decoder’s key, and query parameters leads to substantial increases in BLEU scores (1.4 $\uparrow$ for En $\rightarrow$ De and 1.1 $\uparrow$ for En $\rightarrow$ Tr). Similarly, for En $\rightarrow$ De $+$ Ja, using the fully shared Transformer model, we observe small gains of 0.3 and 0.5 BLEU points for En $\rightarrow$ De and En $\rightarrow$ Ja respectively while our partial parameter sharing method again leads to significant improvements (1.5 $\uparrow$ for En $\rightarrow$ De and 1.1 $\uparrow$ for En $\rightarrow$ Ja). This demonstrates the utility of our proposed partial parameter sharing method.

We also note that fully shared Transformer models can be an effective strategy only when both the target languages are from the same family. For the task of En $\rightarrow$ Ro $+$ Fr, the fully shared model performs surprisingly well and yields significant improvements of 1.7 and 1.3 BLEU points compared with bilingual models for En $\rightarrow$ Ro and En $\rightarrow$ Fr respectively. A similar increase in performance can also be observed for the En $\rightarrow$ De $+$ Nl task, although for this task, our partial parameter sharing method (encoder, embedding, decoder’s key, and query weights) obtains even higher BLEU scores. (1.4 $\uparrow$ for En $\rightarrow$ De and 1.6 $\uparrow$ En $\rightarrow$ Nl).

3 Analysis

Here, we analyze the generated translations of the partial sharing and full sharing approaches for En $\rightarrow$ De when one-to-many multilingual model was trained on unrelated target language pairs En $\rightarrow$ De $+$ Tr. These translations were obtained using the test set of En $\rightarrow$ De task. Here partial sharing refers to the specific approach of sharing the embedding, encoder, and decoder’s key and query parameters in the model.

We show example translations in Table 4 where partial sharing method gets a high BLEU score (shown in parentheses) but the full sharing method does not. We see that sentences generated by partial sharing method are both semantically and grammatically correct while the full sharing method generates shorter sentences compared with reference translations. As highlighted in table cells, the partial sharing method is able to correctly translate a mention of relative time “half a year” and a co-reference expression “mich”. In contrast, the fully shared model generates incorrect expressions of time mentions “eineinhalb Jahren” (one and half years) and different verb forms (“schlägt” is generated vs “schlagen” in the reference).

We also perform a comparison of the F-measure of the target words for En $\rightarrow$ De, bucketed by frequency in the training set. As displayed in Figure 4, this shows that the partial parameter sharing approach improves the translation accuracy for the entire vocabulary, but in particular for words that have low-frequency in the dataset.

Related Work

In this section, we will review the prior work related to MTL and multilingual translation.

Ando and Zhang (2005) obtained excellent results by adopting an MTL framework to jointly train linear models for NER, POS tagging, and language modeling tasks involving some degree of parameter sharing. Later, Collobert et al. (2011) applied MTL strategies to neural networks for tasks such as POS tagging, NER, and chunking by sharing the sequence encoder and reported moderate improvements in results. Recently, Luong et al. (2016) investigated MTL for a tasks such as parsing, image captioning, and translation and observed large gains in the translation task. Similarly, for MT tasks, Niehues and Cho (2017) also leverage MTL by using additional linguistic information to improve the translation accuracy of NMT models. They share the encoder representations to perform joint training on translation, POS, and NER tasks. This kind of parameter sharing approach among multiple tasks was further extended in Zaremoodi and Haffari (2018); Zaremoodi et al. (2018) in which they also include semantic and syntactic parsing tasks and control the relative sharing of various parameters among the tasks to obtain accuracy gains in the MT task. MTL has also been widely applied to multilingual translation that will be discussed next.

2 Multilingual Translation

On the multilingual translation task, Dong et al. (2015) obtained significant performance gains by sharing the encoder parameters of the source language while having a separate decoder for each target language. Later, Firat et al. (2016) attempted the more challenging task of many-to-many translation by training a model that consisted of one shared encoder and decoder per language and a shared attention layer that was common to all languages. This approach obtained competitive BLEU scores on ten European language pairs while substantially reducing the total parameters. Recently, Johnson et al. (2017) proposed a unified model with full parameter sharing and obtained comparable or better performance compared with bilingual translation scores. During model training and decoding, target language was specified by an additional token at the beginning of the source sentence. Coming to low-resource language translation, Zoph et al. (2016) used a transfer learning approach of fine-tuning the model parameters learned on a high-resource language pair of French $\rightarrow$ English and were able to significantly increase the translation performance on Turkish and Urdu languages. Recently, Gu et al. (2018) addresses the many-to-one translation problem for extremely low-resource languages by using a transfer learning approach such that all language pairs share the lexical and sentence-level representations. By performing joint training of the model with high-resource languages, large gains in the BLEU scores were reported for low-resource languages.

In this paper, we first experiment with the Transformer model for one-to-many multilingual translation on a variety of language pairs and demonstrate that the approach of Johnson et al. (2017) and Dong et al. (2015) is not optimal for all kinds of target-side languages. Motivated by this, we introduce various parameter sharing strategies that strike a happy medium between full sharing and partial sharing and show that it achieves the best translation accuracy.

Conclusion

In this work, we explore parameter sharing strategies for the task of multilingual machine translation using self-attentional MT models. Specifically, we examine the case when the target languages come from the same or distant language families. We show that the popular approach of full parameter sharing may perform well only when the target languages belong to the same family while a partial parameter sharing approach consisting of shared embedding, encoder, decoder’s key and query weights is generally applicable to all kinds of language pairs and achieves the best BLEU scores when the languages are from distant families.

For future work, we plan to extend our parameter sharing approach in two directions. First, we aim to increase the number of target languages to more than two such that they contain a mix of both similar and distant languages and analyze the performance of our proposed parameter sharing strategies on them. Second, we aim to experiment with additional parameter sharing strategies such as sharing the weights of some specific layers (e.g. the first or last layer) as different layers can encode different morphological information Belinkov et al. (2017) which can be helpful in better multilingual translation.

Acknowledgments

The authors would like to thank Mrinmaya Sachan, Emmanouil Antonios Platanios, and Soumya Wadhwa for useful comments about this work. We would also like to thank the anonymous reviewers for giving us their valuable feedback that helped to improve the paper.