A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Oluwaseyi Ajayi, Yvonne Wambui Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Koffi Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, Sam Manthalu

cs.CL

Introduction

Enormous efforts have been invested in making language and translation models more multilingual while leveraging the maximal amount of data for training, most prominently large crawls of monolingual and parallel data from the web El-Kishky et al. (2020); Schwenk et al. (2021b, a); Xue et al. (2021b). The resulting models are now capable of translating between hundreds of languages, including language pairs that in isolation do not have large collections of parallel data Tang et al. (2020); Xue et al. (2021a); Fan et al. (2021b). For example, M2M-100 (Goyal et al., 2021) can translate (with low accuracy) between Hausa and Yorùbá, two of the most widely spoken languages in Nigeria, even though there is barely any parallel data available for training. For languages that are not included in the set of training languages, the model would have no knowledge on how to generate translations. Does this mean there is no hope for languages that do not have large presence on the web and are therefore not included in these pre-trained models?

We investigate how large-scale pre-trained models can be leveraged for the translation of unseen low-resource languages and domains. We address this question by studying 16 African languages that are largely underrepresented in NLP research Joshi et al. (2020); $\forall$ et al. (2020) and further have little to no training data available (§3). These languages provide an ideal testbed for two challenging knowledge transfer tasks: (1) How can pre-trained models create translations for languages unseen at training time? and (2) Since training data may only exist in single domain (i.e. religious texts), how can a model be trained in one domain and translate another effectively at test time?

These questions are extremely relevant for our chosen languages because they all have millions of native speakers and a massive need for translation technologies. For example, news concerning the African continent are almost exclusively published in English, French, or Arabic, and thereby inaccessible for speakers of only native African languages. This creates a bottleneck for information transmission, which becomes even more critical in times of crises (Öktem et al., 2020; Anastasopoulos et al., 2020; Öktem et al., 2021). Furthermore, the task of translating news has historically played a central role in translation research, e.g. in shared tasks since 2008 (Callison-Burch et al., 2008) and as a test for determining human parity (Hassan et al., 2018; Läubli et al., 2018; Toral et al., 2018). To spur the development of dedicated news translation models for Africa, we construct a benchmark of news translation for translating between 16 native African languages and English or French (§4).

This allows us to compare three approaches to leveraging large-scale multilingual models for the translation of previously unseen languages: (1) zero-shot transfer, (2) continual pre-training on monolingual data, and (3) multi-domain fine-tuning on parallel data (§5). We find that fine-tuning pre-trained models on a few thousand sentences of high quality bitext is remarkably effective, and can be further augmented with continual pre-training on African languages and fine-tuning on news domain data (§6). Our contributions are the following:All data, models and code are publicly available on https://github.com/masakhane-io/lafand-mt under academic license.

We create a new African news corpus for machine translation (following principles of participatory research $\forall$ et al. (2020)) covering 16 African languages.

We adapt several multilingual pre-trained models (MT5, ByT5, mBART, M2M-100) to these largely unseen languages, and evaluate their quality on news translation.

We quantify the effectiveness of small in-domain translation sets by measuring domain transfer effects and comparing fine-tuning strategies.

We find that having a targeted collection of translations is surprisingly effective, showcasing the power of local knowledge in so-called “zero-resource” scenarios (Bird, 2020). This paints a promising picture for the development of NLP technology for understudied languages: being able to customize these models for new language of interest with as little as 2k sentences and a few fine-tuning steps, MT developers and users from any language community are less dependent on choices and monetary interest of industry powerhouses from the Global North (Paullada, 2020).

Related Work

One of the major challenges of developing MT models for African languages is lack of data. There are many attempts to automatically crawl and align sentences from the web Schwenk et al. (2021a, b). Nevertheless, the resulting corpora for many African languages are typically small and of poor quality Kreutzer et al. (2021). Other cleaner parallel sources are mostly from religious sources, like the Bible covering over 1600 languages McCarthy et al. (2020) and JW300 Agić and Vulić (2019) from JW.org with over 343 languages, including over 100 African languages. Apart from the training dataset, evaluation datasets are needed to test the performance of multilingual MT models. The FLORES-101 Goyal et al. (2021) evaluation set, sourced from Wikipedia and manually translated, covers the largest number of languages, including 20 African languages. Finally, while other evaluation datasets for translating into or from African languages have been developed Siminyu et al. (2021); Emezue and Dossou (2020); Azunre et al. (2021b); Nyoni and Bassett (2021); Gezmu et al. (2021); Ali et al. (2021), unfortunately there are only a few African languages with evaluation datasets in the news domain Adelani et al. (2021a); Mabuya et al. (2021); Ezeani et al. (2020) but ours covers 11 African languages (§4).

Low-resource MT.

Interest in low-resource MT has been increasing both within the MT research community Haddow et al. (2021), as well as in native speaker communities $\forall$ et al. (2020); Azunre et al. (2021a); Mager et al. (2021). On the modeling side, many techniques have been developed: unsupervised MT Lample et al. (2018) leverages monolingual data, single multilingual models capable of translating between many languages Firat et al. (2016); Johnson et al. (2017); Aharoni et al. (2019); Fan et al. (2021a), multilingual unsupervised models leverage a related language (with parallel data) to assist translating the low-resource language that might not even have any monolingual data Ko et al. (2021). Unfortunately, unsupervised MT typically performs poorly on low-resource languages Marchisio et al. (2020).

Transfer learning from high-resource languages has achieved more promising results: Transfer from multilingual pre-trained language models (PLM), like mBART50 Tang et al. (2020) and MT5 Xue et al. (2021b), and large-scale multilingual MT often outperforms bilingual MT Tran et al. (2021); Yang et al. (2021). For low-resource languages this strategy outperforms the baseline (Transformer) models Birch et al. (2021); Adelani et al. (2021a); Lee et al. (2022). The performance can be further improved by large scale pre-training Reid et al. (2021); Emezue and Dossou (2021).

Focus Languages and Their Data

We focus on 16 African languages with varying quantities of available data Joshi et al. (2020), including moderately low-resource languages such as Swahili and Hausa, and very low-resource languages such as Ghomálá’ Spoken by an estimated 1.1M people in Cameroon with the Bible being its largest available corpus. Table 1 provides an overview of the focus languages, including the language families, location and number of speakers, and the source and original language for our corpus. The languages are from four language families: Afro-Asiatic (e.g. Hausa), Nilo-Saharan (e.g. Luo), English Creole (e.g. Nigerian-Pidgin/Naija) and Niger-Congo. Most of the languages (13 out of 16) are from the Niger-Congo family, which is the largest language family in Africa. Six of the languages are predominantly spoken in Francophone countries of Africa, while the remainder are predominantly spoken in Anglophone countries of Africa. In contrast to previous work ( $\forall$ et al., 2020; Gowda et al., 2021), we do not focus exclusively on translation to/from English since this is not the primary language of the Francophone Africa community. All languages are spoken by at least one million speakers.

Language Characteristics.

All languages are written in Latin script, using letters of the basic Latin alphabet with a few omissions (e.g “c”, “q”, “x”, “z”) and additions (e.g. “\textepsilon”, “\textopeno”, “\tipaencodingŋ”, “ọ”, including digraphs like “gb”, “kp”, “gh”, and sometimes more than two-character letters). 13 of the languages are tonal, and about nine make use of diacritics. Many African languages are morphologically rich. For example, all Bantu languages are agglutinative. Fon, Mossi, and Yorùbá are highly isolating. All languages follow the Subject-Verb-Object sentence structure like English and French. Appendix C provides more details.

Existing Parallel Corpora.

We curate publicly available parallel data for our focus languages, which consists primarily of text in the religious domain. For most African languages, the largest available parallel corpora is JW300 Agić and Vulić (2019), sourced from jw.org, which publishes biblical texts as well as lifestyle and opinion columns. Varying quantities of data are available for 11 of the 16 focus languages. Éwé, Igbo, Swahili, Setswana, Twi, Yorùbá, and isiZulu have over 400K parallel sentences. Hausa and Mossi have slightly more than 200K parallel sentences, while Fon and Naija have around 30K sentences. For the remaining five languages that are not in the JW300 corpus,Some languages like Luo and Luganda are covered by JW300 but are no longer available at the time of paper writing. we make use of the Bible.Crawled/downloaded from https://ebible.org/, except for Bambara that we obtained from https://live.bible.is/ and Ghomálá’ from www.beblia.com We aligned the sentences automatically by the verses (around 31k in total). Ghomálá’ only has the New Testament with 8k verses. Bambara and Wolof are missing some verses and books, leading to a total size of 28K and 22K. Table 1 summarizes this information about the religious (REL) corpora.

MAFAND-MT African News Corpus

We introduce our newly translated news corpus; MAFAND-MT — Masakhane Anglo & Franco Africa News Dataset for Machine Translation. Table 1 gives the news source and data splits for 11 African languages which includes six languages (bam, bbj, ewe, fon, mos, wol) spoken predominantly in Francophone Africa and five languages (lug, luo, pcm, tsn, twi) spoken predominantly in Anglophone Africa. The MAFAND-MT corpus was created in three steps:

Crawling and preprocessing of news websites from local newspapers that are publishing in English and French. Raw texts from the web were segmented into sentences. Most languages were crawled from one or two sites, except for Wolof and Fon that were crawled from four and seven news websites respectively due to local French language newspapers having very few articles. We also ensured that the articles came from a variety of topics e.g. politics, sports, culture, technology, society, religion, and education. This was carried out by native speakers of the target language with source language proficiency.

Translation of 5k–8k sentences by professional translators.The translation process took one to four months depending on the availability of the translators.

Quality control was provided by native speakers, who discussed and, if possible, fixed problematic translations and ran automatic checks to detect misspellings, duplicated sentences, and alignment problems.

Following the recommendations of $\forall$ et al. (2020), we design the process to be participatory: Everyone involved in the corpus creation is a native speaker of the respective target languages and has societal knowledge about the communities that speak those languages. This is particularly important for curation and quality control to ensure that the resulting material is appropriate and relevant for stakeholders of the final MT models ( $\forall$ et al., 2020; Kreutzer et al., 2021). Furthermore, everyone received appropriate remuneration. To enable cross-disciplinary knowledge transfer between participants in the individual steps, every language was assigned a coordinator. The coordinator conducted the initial curation in the first step, and communicated with translators and quality checkers throughout the following steps.

We found five African languages with available parallel texts in the news domain: Hausahttps://www.statmt.org/wmt21/translation-task.html, Igbo Ezeani et al. (2020), Swahilihttps://sw.globalvoices.org/, Yorùbá Adelani et al. (2021a), and isiZulu Mabuya et al. (2021). Table 1 provides news source, the TRAIN, DEV and TEST splits. Appendix B provides details on the pre-processing of the available news corpora.

2 Monolingual News Corpus

To adapt available multilingual pre-trained models via continued pre-training to African languages, we curated texts from the 17 highest-resourced African languages and three non-native African languages that are widely spoken on the continent (Arabic, English, and French). The selection of African languages is based on their coverage in mC4 Xue et al. (2021b), AfriBERTa corpora Ogueji et al. (2021), and other publicly available news websites like VOA and BBC. We limited the size of the corpus extracted from mC4 to the first 30 million sentences (roughly 1GB of data) for Afrikaans, Amharic, Arabic, English, French, and Swahili. In total, we collected about 12.3 GB of data. Appendix C provides more details about the pre-training corpus.

Models and Methods

We experiment with pre-trained multilingual models and our own bilingual MT baselines. We focus on pre-trained models that are approximately 500M parameters, both for computational feasibility and comparability across various different models.

We train Transformer Vaswani et al. (2017) sequence-to-sequence models from scratch for each language pair using JoeyNMT Kreutzer et al. (2019). We tokenize the bitext using a joint SentencePiecehttps://github.com/google/sentencepiece unigram model Kudo (2018), with a character coverage of 1.0 and a maximum sentence length of 4096 tokens and create a vocabulary of 10K subwords. Models are trained on the concatenation of REL and NEWS corpora for each language.

Pre-trained Models.

We consider three language models, MT5 Xue et al. (2021b), ByT5 (a token-free T5) Xue et al. (2021a), mBART50 Tang et al. (2020), and the multilingual translation model M2M-100 Fan et al. (2021b) for our experiments. We use MT5-base and ByT5-base, and M2M-100 with 418M parameters. Table 2 gives the pre-trained model size, number of African languages covered, and the focus languages supported.

2 Transfer Learning Across Languages

We describe two methods for adding new languages to existing models: continual pre-training and many-to-many multilingual translation.

The effectiveness of PLMs is limited on extremely low-resource languages because they rarely, if ever, occur in the pre-training corpus Wang et al. (2020); Liu et al. (2021). As shown in Table 2, even for MT5 and M2M-100, which cover 100 languages, less than half of the African languages under study are included. To adapt the existing PLMs to our languages corpora and domains, we apply continual pre-training Gururangan et al. (2020); Liu et al. (2021) using our collected monolingual corpus. Specifically, before fine-tuning on the parallel MT data, models are pre-trained with their original training objective and vocabularyChanging the vocabulary Gururangan et al. (2020) to fit the languages, or adding MT-focused training objectives for word alignment Liu et al. (2021) can potentially improve the performance further, which we leave for future work. on the monolingual corpus. Pre-training parameters can be found in the appendix. We refer to the models adapted to African languages as AfriMT5, AfriByT5, and AfriMBART.

Many-to-Many Translation.

We fine-tuned M2M-100 for African multilingual translation to create English- and French-centric models. For the English-centric model, the M2M-100 model was fine-tuned on the news data for en–{hau, ibo, lug, luo, pcm, swa, tsn, twi, yor, zul} while the French-centric model is trained on fr–{bam, bbj, ewe, fon, mos, wol}. Languages not included in the pre-trained M2M-100 model were assigned the language code of a language included in M2M-100 but excluded from our study.

3 Transfer Learning Across Domains

As there is very limited MT data on the news domain, we compare different methods that combine the large data from the religious domain (REL) and the small data from the NEWS domain (NEWS) to fine-tune M2M-100:

REL+NEWS: Fine-tuning on the aggregation of REL and NEWS.

REL $\rightarrow$ NEWS: Training on REL, followed by fine-tuning on NEWS.

REL+NEWS $\rightarrow$ NEWS: REL+NEWS, followed by additional fine-tuning on NEWS.

Each fine-tuning stage lasts for three epochs. We evaluate translation quality with BLEU Papineni et al. (2002) using SacreBLEU (Post, 2018)“intl” tokenizer, all data comes untokenized. and ChrF Popović (2015).

Results and Discussion

We successfully adapt several multilingual pre-trained models to previously unseen African languages and quantify the effectiveness of small in-domain translation datasets. We discuss the effects of domain shift and analyze mitigation strategies.

We demonstrate that fine-tuning with a few thousand high-quality bitext is effective for adding new languages to pre-trained models. Further, continuing to pre-train to specialize models to African languages further improves performance.

Table 3 and Table 4 gives the result of zero-shot evaluation on NEWS. We evaluate only on the M2M-100 dataset because it has been pre-trained on parallel texts with a few of our focus languages. We observe very poor performance ( $<5$ BLEU) on the languages except for zul ( $>13$ BLEU) and swa ( $>20$ BLEU) in both translation directions. For swa, its likely that the performance is reasonable because M2M-100 has seen more bitext during pre-training ( $2.4$ M sentences in CCAligned El-Kishky et al. (2020)). Other African languages except for Afrikaans have less than 600K sentences in CCAligned, and are also of a lower quality (Kreutzer et al., 2021) which affect overall zero-shot performance.

Performance after Fine-tuning.

We found impressive performance after fine-tuning PLMs and M2M-100 on few thousand sentences (mostly 2K–7K sentences, except for swa with 30K sentences), including languages not seen during pre-training. For en/fr-xx, MT5 has a poor transfer performance with average BLEU of $7.2$ , despite being pre-trained on 101 languages. ByT5 outperforms MT5 by over $3$ BLEU on average, even though their performances were reported to be similar in previous work (Xue et al., 2021a). This indicates that ByT5 might be preferable over MT5 when translating low-resource languages. Surprisingly, mBART50 that was only pre-trained on 50 languages and 2 African languages outperformed MT5 and ByT5 which are pre-trained on 101 languages. Overall, we found M2M-100 to be the best model, most likely because it was pre-trained on a translation task. In general, BLEU scores are relatively low ( $<15$ BLEU for 9 out of 16 languages for en/fr-xx and 7 in xx-en/fr) even when fine-tuning M2M-100 on in-domain data, which suggests that developing more effective methods for fine-tuning might be a promising future direction. The languages with the best quality according to BLEU on the target side are pcm, swa and tsn, and pcm, zul, and swa on the source side.

BLEU scores are higher when translating from an African language, which is expected due to the more frequent exposure to English and French on the target side during pre-training, and BLEU being penalized more for morphologically rich languages like bbj, lug, swa, tsn, and zul). The ChrF metric works better for them. For example, fine-tuning M2M-100 on NEWS and evaluating on zul has a BLEU of $21.0$ in en/fr-xx, and BLEU of $37.8$ in the xx-en/fr showing a large gap in performance in both directions. However, with the ChrF, we find a smaller performance gap ( $51.2$ in en/fr-xx and $55.5$ in the xx-en/fr.

Continual Pre-training.

We observe an improvement in BLEU when we utilize AfriMT5 and AfriByT5, for languages included in our continual pre-training corpus (Appendix C). Other languages also benefit despite not being seen during continual pre-training, possibly due to language similarity. For example, AfriByT5 on fr-bam improved by $1.9$ BLEU over ByT5 and AfriMT5 on en-tsn improved by $3.6$ BLEU over MT5. On average, AfriMT5 improved over MT5 by $1.3$ BLEU in en/fr-xx and $2.4$ BLEU in the xx-en/fr. The improvement for AfriByT5 was much smaller: $0.6$ and $0.9$ BLEU in en/fr-xx and xx-en/fr translation directions. For AfriMBART, we did not see any improvement on average, only the performance of hau ( $1.5$ BLEU) and ibo ( $0.7$ BLEU) improved in en/fr-xx direction. However, in the xx-en/fr direction, fon, tsn, twi, and zul improved by 2.7–6.0 BLEU.

Many-to-Many Multilingual MT.

Training on the combined news corpus from all languages that use French or English separately does not appear to help much. We see slight improvements for most languages only in the xx-en/fr direction.

2 Adaptation to the News Domain

To improve over the baseline performance on NEWS, we train bilingual Transformer models (as a baseline) and M2M-100 on a combination of REL and NEWS. We chose M2M-100 because it was the best performing model. Table 5 gives the BLEU on three settings: REL+NEWS, REL $\rightarrow$ NEWS, and REL+NEWS $\rightarrow$ NEWS. In general, the improvement depends on the size of REL corpus. For languages trained on the Bible such as bbj, bam, lug, luo, and wol, the improvement is minimal. For M2M-100, the REL+NEWS performance does not improve over NEWS despite the larger quantity of training data. This demonstrates that increasing the size in the target domain is the most helpful strategy (see Figure 2). Similarly, combining REL+NEWS is not very helpful for xx-en/fr.An alternative approach is REL $\rightarrow$ NEWS, which allows the model to develop a good understanding of the desired language before adapting to the news domain. We observe an increase on $1.1$ BLEU over REL+NEWS in the en/fr-xx direction. However, the best strategy is REL+NEWS $\rightarrow$ NEWS, especially for xx-en/fr where it yields an improvement over NEWS and REL+NEWS by $2.0$ and $1.5$ BLEU, respectively.

3 Analysis of Domain Shift

If we train models only on previously available religious data, they are not capable of translating news well due to the strong domain bias. This is illustrated in Figure 1: All models perform much worse on NEWS than on the REL domain. When the quantity of religious training data is small, the loss in translation performance on the news test set is largest, c.f. bbj (8k of REL data) with a drop of -95.5% BLEU or bam (-93.5%, 28k) and luo (-93.5%, 31k). This indicates that when the REL training data is sparse, it is insufficient to teach the M2M-100 model a more general understanding required for translating NEWS. However, when the religious training data is larger, this loss is reduced, c.f. when translating to zul (667k, -67%), swa (-69.3%, 872k), and tsn (-71%, 870k). While this is the general trend, pcm, whose religious training data is small (23k), has the lowest drop in performance (-59.3%), which may be due to the strong similarity to its source language.

How many sentences in the target domain are required?

Figure 2 shows how for three selected language pairs with a large (fr-bam), medium (eng-ibo) and relatively small (eng-swa) domain gap, the quality of target domain translations improves as we increase the size of the target domain corpus. For all three pairs, fine-tuning M2M-100 or ByT5 on 2.5 $k$ sentence pairs of in-domain data (NEWS) is sufficient to outperform the bilingual Transformer baselines that were additionally trained on larger amounts of out-of-domain data (REL). Surprisingly, this procedure not only works for languages included during pre-training (swa), but also for previously unseen languages (ibo, bam). M2M-100 tends to adapt to the new data more quickly than ByT5, but in all cases, models continue to learn with additional in-domain data. This shows how much more effectively a small number of in-domain translations can be used when they serve for fine-tuning multilingual pre-trained models rather than training bilingual MT models from scratch.

Examples of Domain Bias.

To illustrate the challenge of overcoming domain bias, we show examples translating from bam and lug in Table 7. The M2M-100 model fine-tuned only on REL succeeds in roughly capturing the meaning of the sources, but using biblical terms, such as “scroll” instead of “novel”. Adding our news corpus to fine-tuning resolves these issues (e.g. “book”).

How general is our news corpus?

Table 8 shows the zero-shot evaluation of M2M-100 fine-tuned on our small NEWS corpora on other domains: religious (REL) and Wikipedia (FLORES). We evaluated the Wikipedia domain on the FLORES devtest and the REL domain on either JW300 or Bible (lug, luo, wol). As a baseline, we evaluated the zero-shot performance of M2M-100 (not fine-tuned, ✗) on FLORESexcept for Luo which is not supported using spBLEU (i.e. sentencepiece BLEU Goyal et al. (2021)). We noticed very poor performance except for Swahili — as discussed in §6.1. After fine-tuning on our new data (✓), transfer is largely improved across the bench (up to +17 BLEU for en-ibo). The same trend holds for the religious domain. This shows that even though our data comes from the news domain, it helped the model generalize to other domains. Hence, expanding African news corpora and developing better MT models for news pays off even for other domains of interest.

Conclusion

We have created MAFAND-MT, a corpus of 16 African languages to study translation systems for low-resource languages in the news domain. We investigate how to most effectively adapt large-scale pre-trained models to incorporate new languages and new domains. Our findings suggest that as little as 2k sentences are sufficient for fine-tuning, with an improved performance, paving the way for others to create new translation systems without relying on large collections of web-sourced text. This has strong implications for languages that are spoken by millions but lack presence on the web.

In the future, we hope to expand our coverage to additional under-resourced languages, and to develop even more effective fine-tuning objectives. Currently, we are extending our corpus to Amharic, Chichewa, Kinyarwanda, Shona, and isiXhosa, including an expansion of the Hausa corpus, they will be released under MAFAND-MT dataset nameWe provide details on the evaluation datasets in Appendix H.

Acknowledgment

This work was carried out with support from Lacuna Fund, an initiative co-founded by The Rockefeller Foundation, Google.org, and Canada’s International Development Research Centre. David Adelani acknowledges the EU-funded Horizon 2020 projects: COMPRISE (http://www.compriseh2020.eu/) under grant agreement No. 3081705 and ROXANNE under grant number 833635. We thank Chester Chester Palen-Michel and Constantine Lignos for providing the VOA corpus for this research, and Google for providing GCP credits to run some of the experiments. Finally, we thank Davor Orlič and Knowledge4All for their administrative support throughout the project.

References

Appendix A Language Characteristics

Table 9 provides the details about the language characteristics.

Appendix B Available Parallel Corpora

We found Five African languages with publicly available parallel texts in the news domain: Hausa, Igbo, Swahili, Yorùbá, and isiZulu. Table 1 provides news source, the TRAIN, DEV and TEST splits.

The Hausa Khameneihttps://www.statmt.org/wmt21/translation-task.html corpus contains 5,898 sentences, we split them into TRAIN (3,098), DEV (1,300), and TEST split (1,500).

Igbo

The Igbo corpus Ezeani et al. (2020) has 9,998 sentences, we extract 6,998 sentences for TRAIN, and the remaining for DEV and TEST splits.

Swahili

The Global Voiceshttps://sw.globalvoices.org/ corpus contains 30,782 sentences, which we use for the TRAIN split. We additionally crawled newer (2019–2021) publications of Swahili articles from the Global Voices website, this gives a total of 3,626 sentences, they were aligned and manually verified by Swahili speakers. They are split into the DEV and TEST splits.

Yorùbá

The MENYO-20k Adelani et al. (2021a) corpus contains sentences from different domains (TED talks, books, software localization, proverbs, and news), from which we select the news domain sentences for the TRAIN, DEV and TEST splits.

isiZulu

The Umsuka corpus Mabuya et al. (2021) contains 9,703 training sentences and 1,984 evaluation sentences. 4,739 training sentences were translated from English-isiZulu, and the remaining from isiZulu-English. We only keep the training sentences translated into isiZulu, and split them into 3,500 for TRAIN and 1,239 sentences for DEV. From the existing evaluation set we select only the 998 English-isiZulu translations for TEST. Umsuka provides two translations for each English sentence, but we use only the first.

Appendix C Monolingual Corpus PLMs adaptation

Table 10 provides the details about the Monolingual corpus used to adapt the pre-trained language models (PLMs), their size and source of corpora. The African languages pre-trained are: Afrikaans, Amharic, Hausa, Igbo, Malagasy, Chichewa, Oromo, Naija, Kinyarwanda, Kirundi, Shona, Somali, Sesotho, Swahili, isiXhosa, Yorùbá, and isiZulu.

Appendix D Model Hyper-parameters and Reproducibility of Results

For the pre-trained models, we fine-tune the models using HuggingFace transformer tool Wolf et al. (2020) with the default learning rate ( $5e-5$ ), batch size of $10$ , maximum source length & maximum target length of $200$ , beam size of $10$ , and number of epochs is $3$ except for models trained on only NEWS which we set to $10$ . We make All the experiments were performed on a single GPU (Nvidia V100).

For fine-tuning pre-trained models, especially for mBART50 that only supports two African languages, the target language is required to be specified during decoding from among those that the model has seen during pre-training, we follow past works Madaan et al. (2020); Cahyawijaya et al. (2021); Lee et al. (2022) in selecting another closely-related language that is represented in the pre-trained model. For convenience, we make use of Swahili (sw) as the target language when an African language is not represented since Swahili is represented in all the pre-trained models. The only exception is Nigerian-Pidgin, where we make use of French (fr) since it is closely related to English. When a language is represented in the pre-trained model like M2M-100 has seen Yorùbá (yo), we make use of the correct language code.

To train AfriMT5 and ByT5, we start with MT5 and ByT5. We pre-train with the learning rate $1e-4$ , $10,000$ warm up steps and a batch size of $2048$ for one epoch. For mBART50, we pre-train with learning rate of $5e-5$ for $50,000$ steps using Fairseq Ott et al. (2019) without modifying the mBART50 vocabulary. Table 11 has the names of all the models that are publicly available on HuggingFace Model Hub https://huggingface.co/masakhane. In total, we have 357 models from 22 x 16 bilingual models, two English/French-centric models, and three adapted models to African languages (i.e AfriMT5, AfriByT5, and AfriMBART).

Appendix E BLEU vs spBLEU

Table 12 and Table 13 compares BLEU and spBLEU metric for the domain transfer experiments. We observe that spBLEU gives higher scores than BLEU especially in the direction of en/fr-xx, which shows that it may be better for evaluating African languages. Although, further analysis and human evaluation are still needed to show that spBLEU is generally better. On the other hand, in the xx-en/fr, there is no much difference in the scores between BLEU and spBLEU.

Appendix F Qualitative Analysis

The following examples from the Fon-to-French translations of the test set illustrate the advantage of multilingual modeling and its limitations:

Source (fon): Louis Guy Alimanyi\textrtaildokpo kpódÍssa Etchlekoun kpó \textopeno, sín azǎn m\textopenokpán \textrtaildye \textopeno, ye \textrtaildò wǔv\textepsilon sè w\textepsilon tawun \textrtaildò agbaza m\textepsilon, có ye ká tuun fí é az\textopenon n\textepsilon l\textepsilon\textepsilon gosin é \textopeno ǎ.

Reference (fr): Les faits Louis Guy Alimagnidokpo et Issa Etchlekoun se plaignent depuis quelques jours de multiples douleurs, ignorant l’origine réelle de leurs maux.

Bilingual Transformer (REL+NEWS, fon $\rightarrow$ fr): on ne peut pas avoir une trentaine d’années ni un jeune homme ni un jeune homme d’âge pour un jeune homme qui soit 12 ans.

M2M-100 (REL+NEWS $\rightarrow$ NEWS, fon $\rightarrow$ fr): Louis Guy Alimanyion et Issa Etchlekoun ont depuis plusieurs jours souffert d’une maladie grave malgré les conséquences de cette maladie qu’ils ne connaissent pas.

M2M-100 (REL+NEWS $\rightarrow$ NEWS, fr $\rightarrow$ fon): Sín azǎn y\textopenoyw\textepsilonyw\textepsilon \textrtaildé \textrtaildye \textrtaildokpóo w\́textepsilon nǔ è kàn Louis Guy Alimagnidokpo kpódó Issa Etchl\textepsilonk\́textepsilonn kpán \textrtaildè \́textopeno \textrtaildò xó \textrtaild\textopeno w\́textepsilon \textrtaild\́textopeno wǔv\́textepsilon gege w\́textepsilon, ye ká tuun nǔ è wú wǔv\́textepsilon yet\textopenon \textrtaildè \́textopeno ǎ.

The translation of the bilingual Transformer model is very poor and far from the Fon source, highlighting how poorly the model generalized from the few thousand training sentences. The M2M-100 model gives a more meaningful and adequate translation. M2M-100 makes a surprising but beautiful move, switching se plaignent depuis quelques jours de multiples douleurs (sín azǎn m\textopenokpán \textrtaildye \textopeno, ye \textrtaildò wǔv\textepsilon sè w\textepsilon tawun \textrtaildò agbaza m\textepsilon) to ont depuis plusieurs jours souffert d’une maladie grave. The BLEU score here might be low but the meaning is conserved and even more detailed than the French reference. In fact, in this source context, wǔve means souffrir, souffrance (suffer, suffering): the French reference made use of se plaignent (complaining) which makes less sense than souffert used in the M2M-100 prediction. M2M-100 also learned the style of the sentence: có ye ká tuun fí é az\textopenon n\textepsilon l\textepsilon\textepsilon gosin (but they do know the origin of their sufferings) é \textopeno ǎ (NOT) - this last part is crucial for the meaning of the entire sentence. Given the structural and morphological differences between Fon and French, we expected it to be more complicated to predict. However, this translation is structurally wrong even though any French native speaker would understand the conveyed message quickly and easily. In the M2M-100 translation, the word malgré is at the wrong place, corrupting syntax and logic of the second clause. A perfect translation (in the idea to be expressed) would be: "Louis Guy Alimanyion et Issa Etchlekoun ont depuis plusieurs jours souffert d’une maladie grave malgré (dont) ils ne connaissent pas les conséquences (causes/raisons) de cette maladie qu’ils ne connaissent pas."

In the opposite translation direction, fr $\rightarrow$ fon, M2M-100 (REL+NEWS $\rightarrow$ NEWS) still preserved some sense of logical reasoning and predicted the last part right ye ká tuun nǔ è wú wǔv\́textepsilon yet\textopenon (they do know why they are suffering) \textrtaildè \́textopeno ǎ (NOT). However, the model had some limitations: the names which are part of the translation are not spelled correctly. Some expressions are incomplete: For instance sín azǎn + number means since xxx days but y\textepsilonyw\textepsilon is not a number, and do not have any meaning in this context.

Appendix G Limitations and Risks

Despite the promising results, our work has the following limitations:

Translation quality: Even the best model scores low BLEU on some of the reported languages (bbj, mos, zul), in particular when translating into them.

Evaluation: Our evaluation is focused on BLEU. We report ChrF results as well, but without a deeper human evaluation, we cannot make claims about the absolute quality of the translations. Manual inspections of translations like the example discussed in Section F gave us the impression that translations are surprisingly fluent and make good use of language-specific expressions when translating into English or French, but that errors in grammar and logic can be easily overlooked. Automatic reference-based metrics like BLEU and ChrF might not be able to capture the semantic relatedness to the reference sufficiently, as well potentially being tricked by word matches in incoherent phrases.

Language bias: We have shown that even when not included in pre-training, and without large out-of-domain data, significant gains in translation quality can be achieved. However, language-specific biases, in terms of resourcedness, morphology, standardization, inclusion in pre-trained models and available corpora, or relatedness to other languages, still affect the relative quality of translations, and require more efforts to be overcome.

Domain limitations: While we showed a rapid adaptation to the news domain and the auxiliary benefit of the religious domain, our study also revealed how automatically estimated translation quality drops when the test domain is narrow. Therefore, future work should aim to expand the study to multiple test domains and develop systematic methods for distilling knowledge from multiple narrow domains.

Language coverage: Africa has thousands of other languages that are not covered in our study but deserve the same attention. We hope that our work is encouraging enough to inspire native speakers of those languages not covered here to collect translations, run our code, and report their findings to the NLP research community, so that we can make joint progress in developing language technology for more people.

We believe that our translation models carry similar risks of causing harm by inaccurate and biased translations as the underlying large pre-trained models. M2M-100 is trained on large collections of texts crawled from the web, and the quality for most of the languages studied here is questionable (Kreutzer et al., 2021). Our fine-tuning successes show that some obvious biases can be overcome when the quality of the fine-tuning set is controlled (see the examples in Section 6.3), but we cannot guarantee that biases prevailing in the pre-training corpus or more subtle biases will not occur with other inputs. Together with a careful human evaluation, this should be the main concern for future work on the produced models. The methodology of rapid fine-tuning might also be misused to tune the models towards harmful content or purposes that harm the speakers of the languages presented here.

Appendix H New Evaluation Datasets

We translated about 1500 English sentences selected from the Voice of America (VOA) news platform to four more African languages: Kinyarwanda (kin), Shona (sna), Chichewa (nya), and IsiXhosa (xho). The 1500 sentences were divided into DEV and TEST split. Although the news articles are from VOA based in US, we ensured that the articles are related to events in Africa. Our choice of VOA is because it has an open license. We also added new evaluation dataset for Amharic (amh), and increase the training data for Hausa (hau) by over 2K sentences. Table 14 provides the data splits of the new evaluation data. We provide more details for Amharic and Hausa below.

We combined the Global Voices corpushttps://opus.nlpl.eu/GlobalVoices.php on OPUS (Tiedemann, 2012) with new articles from the Global Voices websitehttps://am.globalvoices.org/. In total, we have 1,936 parallel sentences that we divide into DEV and TEST splits.

Hausa (hau)

The Hausa Khameneihttps://www.statmt.org/wmt21/translation-task.html corpus contains 5,898 sentences, we split them into TRAIN (3,098), DEV (1,300), and TEST split (1,500). We noticed that this dataset was created in Iran, which is not the geographical location of Hausa speakers. To diversify the texts, we decided to add 2767 newly translated sentences from Global Voices and Premium times news websites which covers more Nigerian and West African news – which is the location of native speakers of Hausa. In total, the training sentences increased to 5,865.

H.2 Additional experiments

We extended the domain shift analysis on Figure 1 to the new languages. The results are quite similar. Figure 3 shows the new result.

Generalization of Hausa news corpus

In Table 8, we show the generalization of the M2M-100 model trained on the NEWS domain to other domains like REL and FLORES (Wikipedia domain). We observe a poor generalization for the Hausa news corpus based on the Khamenei corpus to other domains. Table 15 shows that by adding more contemporary news articles (2,767 sentences) from Premium times and Global Voices, we improved the spBLEU by large points especially in the EN-HAU direction ( $4.0\rightarrow 13.0$ ) for FLORES and ( $3.7\rightarrow 8.8$ ) for the REL domain (based on JW300). Although, we experienced slight drop in the xx-en/fr direction for FLORES.