On Optimal Transformer Depth for Low-Resource Language Translation

Elan van Biljon, Arnu Pretorius, Julia Kreutzer

Introduction

Transformers (Vaswani et al., 2017) have shown great promise as an approach to Neural Machine Translation (NMT) for low-resource languages (Abbott & Martinus, 2018; Martinus & Abbott, 2019). However, at the same time, transformer models remain difficult to optimize and require careful tuning of hyper-parameters to be useful in this setting (Popel & Bojar, 2018; Nguyen & Salazar, 2019). Many NMT toolkits come with a set of default hyper-parameters, which researchers and practitioners often adopt for the sake of convenience and avoiding tuning. These configurations, however, have been optimized for large-scale machine translation data sets with several millions of parallel sentences for European languages like English and French.

In this work, we find that the current trend in the field to use very large models is detrimental for low-resource languages, since it makes training more difficult and hurts overall performance, confirming the observations by Murray et al. (2019); Fan et al. (2019). Specifically, we compare shallower networks to larger ones on three translation tasks, namely: translating from English to Setswana, Sepedi (Northern Sotho), and Afrikaans. We achieve a new state-of-the-art BLEU score (Papineni et al., 2002) on some tasks (more than doubling the previous best score for Afrikaans) when using networks of appropriate depth. Furthermore, we provide a preliminary theoretical explanation for this effect on performance as a function of depth. Overall, our findings seem to advocate the use of shallow-to-moderately sized deep transformers for NMT for low-resource language translation.

Our intuition concerning the relationship between performance and depth stems from prior work on signal propagation theory in noise-regularised neural networks (Schoenholz et al., 2016; Pretorius et al., 2018). Specifically, Pretorius et al. (2018) showed that using Dropout (Srivastava et al., 2014) limits the depth to which information can stably propagate through neural networks when using ReLU activations. Since both dropout and ReLU have been core components of the transformer since its inception (Vaswani et al., 2017), this loss of information is likely to be taking place and should be taken into account when selecting the number of transformer layers. Although the architecture of a transformer is far more involved than those analysed by Pretorius et al. (2018), the fundamental building blocks remain the same. Thus, in this paper, we make use of the above theoretical insights as a guide to our analysis of depth’s influence on performance in transformers.

We see our work as complementary to the Masakhane project (“Masakhane” means “We Build Together” in isiZulu.)https://github.com/masakhane-io/ In this spirit, low-resource NMT systems are now being built by the community who needs them the most. However, many in the community still have very limited access to the type of computational resources required for building extremely large models promoted by industrial research. Therefore, by showing that transformer models perform well (and often best) at low-to-moderate depth, we hope to convince fellow researchers to devote less computational resources, as well as time, to exploring overly large models during the development of these systems.

Results

We trained networks of three depths: shallow (2 transformer layers—1 encoder layer and 1 decoder layer), medium (6 transformer layers—3 encoder and 3 decoder), and deep (12 transformer layers—6 encoder and 6 decoder—as is used in Vaswani et al. (2017)), each on English to (1) Setswana, (2) Sepedi, and (3) Afrikaans translation tasks. The model configurations, weights, and code are all available online at https://github.com/ElanVB/optimal_transformer_depth.

For comparability to previous work, all models were trained on the Autshumato data set (Groenewald & du Plooy, 2010) as it was preprocessed by Martinus & Abbott (2019) and hyper-parameter settings have been left as similar as possible to previous work.

Table 1 presents a breakdown of the relevant languages in the Autshumato data set (Martinus & Abbott, 2019). This is done to contrast the small size when compared to many of the corpora that are represented at the Conference on Machine Translation (WMT) (Ng et al., 2019).

Figure 1 shows the quality of test set translations in terms of BLEU. We see that medium-depth models outperform deeper ones, thus we allowed the medium-depth networks to train for longer and report their performance (compared to previous work) in Table 2. The medium-depth networks achieve higher scores for two of the three tasks (English to Setswana and English to Afrikaans translation) than the previous baselines as shown by the test BLEU scores in Table 2. This is preliminary evidence showing that transformers consisting of 3 encoder and 3 decoder layers may outperform the canonical configuration (Vaswani et al., 2017) of 6 encoder and 6 decoder layers.

Note that Table 2 compares our results only to works that have been trained on the same data set and version thereof. Whilst we are aware that better performing models exist for English to Setswana translation (Abbott & Martinus, 2018; Ronald & Barnard, 2007), as best as we can tell those models are either trained on closed data sets or on a different version of the Autshumato data set.

Notably, our English to Afrikaans translation model more than doubles the previous BLEU baseline and seems to also outperform Statistical Machine Translation (SMT) (van Niekerk, 2014). However, the data set used by van Niekerk (2014) is not publicly available for a direct comparison.

The BLEU metric with surface-based n-gram scoring might not be expressive enough for agglutinative languages like Sepedi and Setswana. Therefore we also include example model outputs in Appendix A for qualitative comparison.

Despite our networks of moderate depth outperforming previous Setswana and Afrikaans baselines, our Sepedi model performs significantly worse than the previous baseline. It should be noted that we were unable to reproduce the baseline score obtained in Martinus & Abbott (2019). Even so, our attempts to reproduce their result yielded models with very similar performance to our network of moderate depth with the two approaches usually being within approximately 0.5 test BLEU of each other.

Discussion

Our exploration into networks of moderate depth was largely due to preliminary signal propagation analyses we performed on simplified transformer layers. However, we do not present these preliminary theoretical results here (left to be explored further in future work), but instead refer the reader to very recent and concurrent work done by Bachlechner et al. (2020) for a more complete motivation as well as a proposed solution.

Even though we do not achieve state-of-the-art results on all languages, we come very close, with approximately half the number of parameters and far less training time. We believe there is still some room for improving the performance of our moderate-depth models by more carefully tuning their hyper-parameters. However, we note that (1) finding stable learning rates can be very computationally expensive (and therefore be beyond what the community might currently be able to afford), and (2), in doing so, our work may become less comparable to those that have come before.

References

Appendix A Appendix: qualitative results