A Focus on Neural Machine Translation for African Languages

Laura Martinus, Jade Z. Abbott

Introduction

Africa has over 2000 languages across the continent Eberhard et al. (2019). South Africa itself has 11 official languages. Unlike many major Western languages, the multitude of African languages are very low-resourced and the few resources that exist are often scattered and difficult to obtain.

Machine translation of African languages would not only enable the preservation of such languages, but also empower African citizens to contribute to and learn from global scientific, social and educational conversations, which are currently predominantly English-based Alexander (2010). Tools, such as Google Translate Google (2019), support a subset of the official South African languages, namely English, Afrikaans, isiZulu, isiXhosa and Southern Sotho, but do not translate the remaining six official languages.

Unfortunately, in addition to being low-resourced, progress in machine translation of African languages has suffered a number of problems. This paper discusses the problems and reviews existing machine translation research for African languages which demonstrate those problems. To try to solve the highlighted problems, we train models to perform machine translation of English to Afrikaans, isiZulu, Northern Sotho (N. Sotho), Setswana and Xitsonga, using state-of-the-art neural machine translation (NMT) architectures, namely, the Convolutional Sequence-to-Sequence (ConvS2S) and Transformer architectures.

Section 2 describes the problems facing machine translation for African languages, while the target languages are described in Section 3. Related work is presented in Section 4, and the methodology for training machine translation models is discussed in Section 5. Section 6 presents quantitative and qualitative results.

Problems

The difficulties hindering the progress of machine translation of African languages are discussed below.

Low availability of resources for African languages hinders the ability for researchers to do machine translation. Institutes such as the South African Centre for Digital Language Resources (SADiLaR) are attempting to change that by providing an open platform for technologies and resources for South African languages Bergh (2019). This, however, only addresses the 11 official languages of South Africa and not the greater problems within Africa.

Discoverability: The resources for African languages that do exist are hard to find. Often one needs to be associated with a specific academic institution in a specific country to gain access to the language data available for that country. This reduces the ability of countries and institutions to combine their knowledge and datasets to achieve better performance and innovations. Often the existing research itself is hard to discover since they are often published in smaller African conferences or journals, which are not electronically available nor indexed by research tools such as Google Scholar.

Reproducibility: The data and code of existing research are rarely shared, which means researchers cannot reproduce the results properly. Examples of papers that do not publicly provide their data and code are described in Section 4.

Focus: According to Alexander (2009), African society does not see hope for indigenous languages to be accepted as a more primary mode for communication. As a result, there are few efforts to fund and focus on translation of these languages, despite their potential impact.

Lack of benchmarks: Due to the low discoverability and the lack of research in the field, there are no publicly available benchmarks or leader boards to new compare machine translation techniques to.

This paper aims to address some of the above problems as follows: We trained models to translate English to Afrikaans, isiZulu, N. Sotho, Setswana and Xitsonga, using modern NMT techniques. We have published the code, datasets and results for the above experiments on GitHub, and in doing so promote reproducibility, ensure discoverability and create a baseline leader board for the five languages, to begin to address the lack of benchmarks.

Languages

We provide a brief description of the Southern African languages addressed in this paper, since many readers may not be familiar with them. The isiZulu, N. Sotho, Setswana, and Xitsonga languages belong to the Southern Bantu group of African languages Mesthrie and Rajend (2002). The Bantu languages are agglutinative and all exhibit a rich noun class system, subject-verb-object word order, and tone Zerbian (2007). N. Sotho and Setswana are closely related and are highly mutually-intelligible. Xitsonga is a language of the Vatsonga people, originating in Mozambique Bill (1984). The language of isiZulu is the second most spoken language in Southern Africa, belongs to the Nguni language family, and is known for its morphological complexity Keet and Khumalo (2017); Bosch and Pretorius (2017). Afrikaans is an analytic West-Germanic language, that descended from Dutch settlers Roberge (2002).

Related Work

This section details published research for machine translation for the South African languages. The existing research is technically incomparable to results published in this paper, because their datasets (in particular their test sets) are not published. Table 1 shows the BLEU scores provided by the existing work.

Google Translate Google (2019), as of February 2019, provides translations for English, Afrikaans, isiZulu, isiXhosa and Southern Sotho, six of the official South African languages. Google Translate was tested with the Afrikaans and isiZulu test sets used in this paper to determine its performance. However, due to the uncertainty regarding how Google Translate was trained, and which data it was trained on, there is a possibility that the system was trained on the test set used in this study as this test set was created from publicly available governmental data. For this reason, we determined this system is not comparable to this paper’s models for isiZulu and Afrikaans.

Abbott and Martinus (2018) trained Transformer models for English to Setswana on the parallel Autshumato dataset Groenewald and Fourie (2009). Data was not cleaned nor was any additional data used. This is the only study reviewed that released datasets and code. Wilken et al. (2012) performed statistical phrase-based translation for English to Setswana translation. This research used linguistically-motivated pre- and post-processing of the corpus in order to improve the translations. The system was trained on the Autshumato dataset and also used an additional monolingual dataset.

McKellar (2014) used statistical machine translation for English to Xitsonga translation. The models were trained on the Autshumato data, as well as a large monolingual corpus. A factored machine translation system was used, making use of a combination of lemmas and part of speech tags.

van Niekerk (2014) used unsupervised word segmentation with phrase-based statistical machine translation models. These models translate from English to Afrikaans, N. Sotho, Xitsonga and isiZulu. The parallel corpora were created by crawling online sources and official government data and aligning these sentences using the HunAlign software package. Large monolingual datasets were also used.

Wolff and Kotze (2014) performed word translation for English to isiZulu. The translation system was trained on a combination of Autshumato, Bible, and data obtained from the South African Constitution. All of the isiZulu text was syllabified prior to the training of the word translation system.

It is evident that there is exceptionally little research available using machine translation techniques for Southern African languages. Only one of the mentioned studies provide code and datasets for their results. As a result, the BLEU scores obtained in this paper are technically incomparable to those obtained in past papers.

Methodology

The following section describes the methodology used to train the machine translation models for each language. Section 5.1 describes the datasets used for training and their preparation, while the algorithms used are described in Section 5.2.

The publicly-available Autshumato parallel corpora are aligned corpora of South African governmental data which were created for use in machine translation systems Groenewald and Fourie (2009). The datasets are available for download at the South African Centre for Digital Language Resources website.Available online at: https://repo.sadilar.org/handle/20.500.12185/404 The datasets were created as part of the Autshumato project which aims to provide access to data to aid in the development of open-source translation systems in South Africa.

The Autshumato project provides parallel corpora for English to Afrikaans, isiZulu, N. Sotho, Setswana, and Xitsonga. These parallel corpora were aligned on the sentence level through a combination of automatic and manual alignment techniques.

The official Autshumato datasets contain many duplicates, therefore to avoid data leakage between training, development and test sets, all duplicate sentences were removed.Available online at: https://github.com/LauraMartinus/ukuxhumana These clean datasets were then split into 70% for training, 30% for validation, and 3000 parallel sentences set aside for testing. Summary statistics for each dataset are shown in Table 2, highlighting how small each dataset is.

Even though the datasets were cleaned for duplicate sentences, further issues exist within the datasets which negatively affects models trained with this data. In particular, the isiZulu dataset is of low quality. Examples of issues found in the isiZulu dataset are explained in Table 3. The source and target sentences are provided from the dataset, the back translation from the target to the source sentence is given, and the issue pertaining to the translation is explained.

2 Algorithms

We trained translation models for two established NMT architectures for each language, namely, ConvS2S and Transformer. As the purpose of this work is to provide a baseline benchmark, we have not performed significant hyperparameter optimization, and have left that as future work.

The Fairseq(-py) toolkit was used to model the ConvS2S model Gehring et al. (2017). Fairseq’s named architecture “fconv” was used, with the default hyperparameters recommended by Fairseq documentation as follows: The learning rate was set to 0.25, a dropout of 0.2, and the maximum tokens for each mini-batch was set to 4000. The dataset was preprocessed using Fairseq’s preprocess script to build the vocabularies and to binarize the dataset. To decode the test data, beam search was used, with a beam width of 5. For each language, a model was trained using traditional white-space tokenisation, as well as byte-pair encoding tokenisation (BPE). To appropriately select the number of tokens for BPE, for each target language, we performed an ablation study (described in Section 6.3).

The Tensor2Tensor implementation of Transformer was used Vaswani et al. (2018). The models were trained on a Google TPU, using Tensor2Tensor’s recommended parameters for training, namely, a batch size of 2048, an Adafactor optimizer with learning rate warm-up of 10K steps, and a max sequence length of 64. The model was trained for 125K steps. Each dataset was encoded using the Tensor2Tensor data generation algorithm which invertibly encodes a native string as a sequence of subtokens, using WordPiece, an algorithm similar to BPE Kudo and Richardson (2018). Beam search was used to decode the test data, with a beam width of 4.

Results

Section 6.1 describes the quantitative performance of the models by comparing BLEU scores, while a qualitative analysis is performed in Section 6.2 by analysing translated sentences as well as attention maps. Section 6.3 provides the results for an ablation study done regarding the effects of BPE.

The BLEU scores for each target language for both the ConvS2S and the Transformer models are reported in Table 4. For the ConvS2S model, we provide results for sentences tokenised by white spaces (Word), and when tokenised using the optimal number of BPE tokens (Best BPE), as determined in Section 6.3. The Transformer model uses the same number of WordPiece tokens as the number of BPE tokens which was deemed optimal during the BPE ablation study done on the ConvS2S model.

In general, the Transformer model outperformed the ConvS2S model for all of the languages, sometimes achieving 10 BLEU points or more over the ConvS2S models. The results also show that the translations using BPE tokenisation outperformed translations using standard word-based tokenisation. The relative performance of Transformer to ConvS2S models agrees with what has been seen in existing NMT literature Vaswani et al. (2017). This is also the case when using BPE tokenisation as compared to standard word-based tokenisation techniques Sennrich et al. (2015).

Overall, we notice that the performance of the NMT techniques on a specific target language is related to both the number of parallel sentences and the morphological typology of the language. In particular, isiZulu, N. Sotho, Setswana, and Xitsonga languages are all agglutinative languages, making them harder to translate, especially with very little data Chahuneau et al. (2013). Afrikaans is not agglutinative, thus despite having less than half the number of parallel sentences as Xitsonga and Setswana, the Transformer model still achieves reasonable performance. Xitsonga and Setswana are both agglutinative, but have significantly more data, so their models achieve much higher performance than N. Sotho or isiZulu.

The translation models for isiZulu achieved the worst performance when compared to the others, with the maximum BLEU score of 3.33. We attribute the bad performance to the morphological complexity of the language (as discussed in Section 3), the very small size of the dataset as well as the poor quality of the data (as discussed in Section 5.1).

2 Qualitative Results

We examine randomly sampled sentences from the test set for each language and translate them using the trained models. In order for readers to understand the accuracy of the translations, we provide back-translations of the generated translation to English. These back-translations were performed by a speaker of the specific target language. More examples of the translations are provided in the Appendix. Additionally, attention visualizations are provided for particular translations. The attention visualizations showed how the Transformer multi-head attention captured certain syntactic rules of the target languages.

In Table 5, ConvS2S did not perform the translation successfully. Despite the content being related to the topic of the original sentence, the semantics did not carry. On the other hand, Transformer achieved an accurate translation. Interestingly, the target sentence used an abbreviation, however, both translations did not. This is an example of how lazy target translations in the original dataset would negatively affect the BLEU score, and implore further improvement to the datasets. We plot an attention map to demonstrate the success of Transformer to learn the English-to-Afrikaans sentence structure in Figure 1.

2.2 isiZulu

Despite the bad performance of the English-to-isiZulu models, we wanted to understand how they were performing. The translated sentences, given in Table 6, do not make sense, but all of the words are valid isiZulu words. Interestingly, the ConvS2S translation uses English words in the translation, perhaps due to English data occurring in the isiZulu dataset. The ConvS2S however correctly prefixed the English phrase with the correct prefix “i-”. The Transformer translation includes invalid acronyms and mentions “disease” which is not in the source sentence.

2.3 Northern Sotho

If we examine Table 7, the ConvS2S model struggled to translate the sentence and had many repeating phrases. Given that the sentence provided is a difficult one to translate, this is not surprising. The Transformer model translated the sentence well, except included the word “boithabišo”, which in this context can be translated to “fun” - a concept that was not present in the original sentence.

2.4 Setswana

Table 8 shows that the ConvS2S model translated the sentence very successfully. The word “khumo” directly means “wealth” or “riches”. A better synonym would be “letseno”, meaning income or “letlotlo” which means monetary assets. The Transformer model only had a single misused word (translated “shortage” into “necessity”), but otherwise translated successfully. The attention map visualization in Figure 2 suggests that the attention mechanism has learnt that the sentence structure of Setswana is the same as English.

2.5 Xitsonga

An examination of Table 9 shows that both models perform well translating the given sentence. However, the ConvS2S model had a slight semantic failure where the cause of the economic growth was attributed to unemployment, rather than vice versa.

3 Ablation Study over the Number of Tokens for Byte-pair Encoding

BPE Sennrich et al. (2015) and its variants, such as SentencePiece Kudo and Richardson (2018), aid translation of rare words in NMT systems. However, the choice of the number of tokens to generate for any particular language is not made obvious by literature. Popular choices for the number of tokens are between 30,000 and 40,000: Vaswani et al. (2017) use 37,000 for WMT 2014 English-to-German translation task and 32,000 tokens for the WMT 2014 English-to-French translation task. Johnson et al. (2017) used 32,000 SentencePiece tokens across all source and target data. Unfortunately, no motivation for the choice for the number of tokens used when creating sub-words has been provided.

Initial experimentation suggested that the choice of the number of tokens used when running BPE tokenisation, affected the model’s final performance significantly. In order to obtain the best results for the given datasets and models, we performed an ablation study, using subword-nmt Sennrich et al. (2015), over the number of tokens required by BPE, for each language, on the ConvS2S model. The results of the ablation study are shown in Figure 3.

As can be seen in Figure 3, the models for languages with the smallest datasets (namely isiZulu and N. Sotho) achieve higher BLEU scores when the number of BPE tokens is smaller, and decrease as the number of BPE tokens increases. In contrast, the performance of the models for languages with larger datasets (namely Setswana, Xitsonga, and Afrikaans) improves as the number of BPE tokens increases. There is a decrease in performance at 20 000 BPE tokens for Setswana and Afrikaans, which the authors cannot yet explain and require further investigation. The optimal number of BPE tokens were used for each language, as indicated in Table 4.

Future Work

Future work involves improving the current datasets, specifically the isiZulu dataset, and thus improving the performance of the current machine translation models.

As this paper only provides translation models for English to five of the South African languages and Google Translate provides translation for an additional two languages, further work needs to be done to provide translation for all 11 official languages. This would require performing data collection and incorporating unsupervised Lample et al. (2018); Lample and Conneau (2019), meta-learning Gu et al. (2018), or zero-shot techniques Johnson et al. (2017) .

Conclusion

African languages are numerous and low-resourced. Existing datasets and research for machine translation are difficult to discover, and the research hard to reproduce. Additionally, very little attention has been given to the African languages so no benchmarks or leader boards exist, and few attempts at using popular NMT techniques exist for translating African languages.

This paper reviewed existing research in machine translation for South African languages and highlighted their problems of discoverability and reproducibility. In order to begin addressing these problems, we trained models to translate English to five South African languages, using modern NMT techniques, namely ConvS2S and Transformer. The results were promising for the languages that have more higher quality data (Xitsonga, Setswana, Afrikaans), while there is still extensive work to be done for isiZulu and N. Sotho which have exceptionally little data and the data is of worse quality. Additionally, an ablation study over the number of BPE tokens was performed for each language. Given that all data and code for the experiments are published on GitHub, these benchmarks provide a starting point for other researchers to find, compare and build upon.

The source code and the data used are available at https://github.com/LauraMartinus/ukuxhumana.

Acknowledgements

The authors would like to thank Reinhard Cromhout, Guy Bosa, Mbongiseni Ncube, Seale Rapolai, and Vongani Maluleke for assisting us with the back-translations, and Jason Webster for Google Translate API assistance. Research supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).

References

Appendix A Appendix

Additional translation results from ConvS2S and Transformer are given in Table 10 along with their back-translations for Afrikaans, N. Sotho, Setswana, and Xitsonga. We include these additional sentences as we feel that the single sentence provided per language in Section 6.2, is not enough demonstrate the capabilities of the models. Given the scarcity of research in this field, researchers might find the additional sentences insightful into understanding the real-world capabilities and potential, even if BLEU scores are low.