CoVoST 2 and Massively Multilingual Speech-to-Text Translation

Changhan Wang, Anne Wu, Juan Pino

Introduction

The development of benchmark datasets, such as MuST-C Di Gangi et al. (2019), Europarl-ST Iranzo-Sánchez et al. (2020) or CoVoST Wang et al. (2020a), has greatly contributed to the increasing popularity of speech-to-text translation (ST) as a research topic. MuST-C provides TED talks translations from English into 8 European languages, with data amounts ranging from 385 hours to 504 hours, thereby encouraging research into end-to-end ST Berard et al. (2016) as well as one-to-many multilingual ST Di Gangi et al. (2019). Europarl-ST offers translations between 6 European languages, with a total of 30 translation directions, enabling research into many-to-many multilingual ST Inaguma et al. (2019). The two corpora described so far involve European languages that are in general high resource from the perspective of machine translation (MT) and speech. CoVoST is a multilingual and diversified ST corpus from 11 languages into English, based on the Common Voice project Ardila et al. (2020). Unlike previous corpora, it involves low resource languages such as Mongolian and it also enables many-to-one ST research. Nevertheless, for all corpora described so far, the number of languages involved is limited.

In this paper, we describe CoVoST 2, an extension of CoVoST (Wang et al., 2020a) that provides translations from English (En) into 15 languages—Arabic (Ar), Catalan (Ca), Welsh (Cy), German (De), Estonian (Et), Persian (Fa), Indonesian (Id), Japanese (Ja), Latvian (Lv), Mongolian (Mn), Slovenian (Sl), Swedish (Sv), Tamil (Ta), Turkish (Tr), Chinese (Zh)—and from 21 languages into English, including the 15 target languages as well as Spanish (Es), French (Fr), Italian (It), Dutch (Nl), Portuguese (Pt), Russian (Ru). The overall speech duration is extended from 700 hours to 2880 hours. The total number of speakers is increased from 11K to 78K. We make data available at https://github.com/facebookresearch/covost under CC0 license.

Dataset Creation

Translations are collected from professional translators the same way as for CoVoST. We then conduct sanity checks based on language model perplexity, LASER Artetxe and Schwenk (2019) scores and a length ratio heuristic in order to ensure the quality of the translations. Length ratio and LASER score checks are conducted as in the original version of CoVoST. For language model perplexity checks, 20M lines are sampled from the OSCAR corpus Ortiz Suárez et al. (2020) for each CoVoST 2 language, except for English, Russian for which pre-trained language models Ng et al. (2019) are utilizedhttps://github.com/pytorch/fairseq/tree/master/examples/language_model. 5K lines are reserved for validation and the rest for training. BPE vocabularies of size 20K are then built on the training data, with character coverage 0.9995 for Japanese and Chinese and 1.0 for other languages. A Transformer base model (Vaswani et al., 2017) is then trained for up to 800K updates. Professional translations are ranked by perplexity and the ones with the lowest perplexity are manually examined and sent for re-translation as appropriate. In the data release, we mark out the sentences that cannot be translated properlyThey are mostly extracted from articles without context, which lack clarity for appropriate translations..

2 Dataset Splitting

Original Common Voice (CV) dataset splits utilize only one sample per sentence, while there are potentially multiple samples (speakers) available in the raw dataset. To allow higher data utilization and speaker diversity, we add part of the discarded samples back while keeping the speaker set disjoint and the same sentence assignment across different splits. We refer to this extension as CoVoST splits. As a result, data utilization is increased from 44.2% (1273 hours) to 78.8% (2270 hours). We by default use CoVoST train split for model training and CV dev (test) split for evaluation. The complementary CoVoST dev (test) split is useful in the multi-speaker evaluation (Wang et al., 2020a) to analyze model robustness, but large amount of repeated sentences (e.g. on English and German) may skew the overall BLEU (WER) scores.

3 Statistics

Basic statistics of CoVoST 2 are listed in Table 1, including speech duration, speaker counts as well as token counts for both transcripts and translations. As we can see, CoVoST 2 is diversified with large sets of speakers even on some of the low-resource languages (e.g. Persian, Welsh and Dutch). Moreover, they are distributed widely across 66 accent groups, 8 age groups and 3 gender groups.

Models

Our speech recognition (ASR) and ST models share the same Transformer encoder-decoder architecture Vaswani et al. (2017); Synnaeve et al. (2020), where there are 12 encoder layers and 6 decoder layers. A convolutional downsampler is applied to reduce the length of speech inputs by $\frac{3}{4}$ before they are fed into the encoder. In the multilingual setting (En $\rightarrow$ All and All $\rightarrow$ All), we follow Inaguma et al. (2019) to force decoding into a given language by using a target language ID token as the first token during decoding.

For MT, we use a Transformer base architecture Vaswani et al. (2017) with $l_{e}$ encoder layers, $l_{d}$ decoder layers, 0.3 dropout, and shared embeddings for encoder/decoder inputs and decoder outputs. For multilingual models, encoders and decoders are shared as preliminary experimentation showed that this approach was competitive.

Experiments

We provide MT, cascaded ST and end-to-end ST baselines under bilingual settings as well as multilingual settings: All $\rightarrow$ En (A2E), En $\rightarrow$ All (E2A) and All $\rightarrow$ All (A2A). Similarly for ASR, we provide both monolingual and multilingual baselines. We implement all models in fairseq Ott et al. (2019); Wang et al. (2020b) and open-source the training recipes at https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text.

For all texts, we normalize the punctuation and build vocabularies with SentencePiece Kudo and Richardson (2018) without pre-tokenization. For ASR and ST, character vocabularies with 100% coverage are used. For bilingual MT models, BPE Sennrich et al. (2016) vocabularies of size 5k are learned jointly on both transcripts and translations. For multilingual MT models, BPE vocabularies of size 40k are created jointly on all available source and target text. For MT and language pair $s$ - $t$ , we also contrast using only $s$ - $t$ training data and both $s$ - $t$ and $t$ - $s$ training data (we also remove any overlap between training data from $t$ - $s$ and development or test set from $s$ - $t$ ; this is also done for the A2A multilingual MT setting). The latter setting is referred to as +Rev subsequently.

We extract 80-dimensional log mel-scale filter bank features (windows with 25ms size and 10ms shift) using Kaldi Povey et al. (2011), with per-utterance CMVN (cepstral mean and variance normalization) applied. We remove training samples having more than 3,000 frames or more than 512 characters for GPU memory efficiency.

For ASR and ST, we set $d_{model}=256$ for bilingual models and set $d_{model}=512$ or $1024$ (denoted by a suffix “-M”/“-L” in the tables) for multilingual models. We adopt SpecAugment Park et al. (2019) (LB policy without time warping) to alleviate overfitting. To accelerate model training, we pre-train non-English ASR as well as bilingual ST models with English ASR encoder, and pre-train multilingual ST models with multilingual ASR encoder. For MT, we set $l_{e}=l_{d}=3$ for bilingual models and $l_{e}=l_{d}=6$ for multilingual models.

We use a beam size of 5 for all models and length penalty 1. We use the best checkpoint by validation loss for MT, and average the last 5 checkpoints for ASR and ST. For MT and ST, we report case-sensitive detokenized BLEU Papineni et al. (2002) using sacreBLEU Post (2018) with default options, except for English-Chinese and English-Japanese where we report character-level BLEU. For ASR, we report character error rate (CER) on Japanese and Chinese (no word segmentation) and word error rate (WER) on the other languages using VizSeq (Wang et al., 2019). Before calculating WER (CER), sentences are tokenized by sacreBLEU tokenizers, lowercased and with punctuation removed (except for apostrophes and hyphens).

2 Monolingual and Bilingual Baselines

Table 2 reports monolingual baselines for ASR and bilingual MT, cascaded ST (C-ST), end-to-end ST trained from scratch (E-ST) and end-to-end ST pre-trained on ASR. As expected, the quality of transcriptions and translations is very dependent on the amount of training data per language pair. The poor results obtained on low resource pairs can be improved by leveraging training data from the opposite direction for MT and C-ST. These results serve as baseline for the research community to improve upon, including methods such as multilingual training, self-supervised pre-training and semi-supervised learning.

3 Multilingual Baselines

A2E, E2E and A2A baselines are reported in Table 3 for language pairs into English and in Table 4 for language pairs out of English. Multilingual modeling is shown to be a promising direction for improving low-resource ST.

Conclusion

We introduced CoVoST 2, the largest speech-to-text translation corpus to date for language coverage and total volume, with 21 languages into English and English into 15 languages. We also provided extensive monolingual, bilingual and multilingual baselines for ASR, MT and ST. CoVoST 2 is free to use under CC0 license and enables the research community to develop methods including, but not limited to, massive multilingual modeling, ST modeling for low resource languages, self-supervision for multilingual ST, semi-supervised modeling for multilingual ST.