CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

Changhan Wang, Juan Pino, Anne Wu, Jiatao Gu

Introduction

End-to-end speech-to-text translation (ST) has attracted much attention recently [Berard et al., 2016, Duong et al., 2016, Weiss et al., 2017, Bansal et al., 2017, Bérard et al., 2018] given its simplicity against cascading automatic speech recognition (ASR) and machine translation (MT) systems. The lack of labeled data, however, has become a major blocker for bridging the performance gaps between end-to-end models and cascading systems. Several corpora have been developed in recent years. ?) introduced a 180-hour Spanish-English ST corpus by augmenting the transcripts of the Fisher and Callhome corpora with English translations. ?) created the largest ST corpus to date from TED talks but the language pairs involved are out of English only. ?) created a 110-hour German-English ST corpus from LibriVox audiobooks. ?) created a Moboshi-French ST corpus as part of a rare language documentation effort. ?) provided an Amharic-English ST corpus in the tourism domain. ?) created a multilingual ST corpus involving 8 languages from a multilingual speech corpus based on Bible readings [Black, 2019]. Previous work either involves language pairs out of English, very specific domains, very low resource languages or a limited set of language pairs. This limits the scope of study, including the latest explorations on end-to-end multilingual ST [Inaguma et al., 2019, Gangi et al., 2019]. Our work is mostly similar and concurrent to ?) who created a multilingual ST corpus from the European Parliament proceedings. The corpus we introduce has larger speech durations and more translation tokens. It is diversified with multiple speakers per transcript/translation. Finally, we provide additional out-of-domain test sets.

In this paper, we introduce CoVoST, a multilingual ST corpus based on Common Voice [Ardila et al., 2019] for 11 languages into English, diversified with over 11,000 speakers and over 60 accents. It includes a total 708 hours of French (Fr), German (De), Dutch (Nl), Russian (Ru), Spanish (Es), Italian (It), Turkish (Tr), Persian (Fa), Swedish (Sv), Mongolian (Mn) and Chinese (Zh) speeches, with French and German ones having the largest durations among existing public corpora. We also collect an additional evaluation corpus from Tatoebahttps://tatoeba.org/eng/downloads for French, German, Dutch, Russian and Spanish, resulting in a total of 9.3 hours of speech. Both corpora are created at the sentence level and do not require additional alignments or segmentation. Using the official Common Voice train-development-test split, we also provide baseline models, including, to our knowledge, the first end-to-end many-to-one multilingual ST models. CoVoST is released under CC0 license and free to use. The Tatoeba evaluation samples are also available under friendly CC licenses. All the data can be obtained at https://github.com/facebookresearch/covost.

Data Collection and Processing

Common Voice [Ardila et al., 2019, CoVo] is a crowdsourcing speech recognition corpus with an open CC0 license. Contributors record voice clips by reading from a bank of donated sentences. Each voice clip was validated by at least two other users. Most of the sentences are covered by multiple speakers, with potentially different genders, age groups or accents.

Raw CoVo data contains samples that passed validation as well as those that did not. To build CoVoST, we only use the former one and reuse the official train-development-test partition of the validated data. As of January 2020, the latest CoVo 2019-06-12 release includes 29 languages. CoVoST is currently built on that release and covers the following 11 languages: French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese.

Validated transcripts were sent to professional translators. Note that the translators had access to the transcripts but not the corresponding voice clips since clips would not carry additional information. Since transcripts were duplicated due to multiple speakers, we deduplicated the transcripts before sending them to translators. As a result, different voice clips of the same content (transcript) will have identical translations in CoVoST for train, development and test splits.

In order to control the quality of the professional translations, we applied various sanity checks to the translations [Guzmán et al., 2019]. 1) For German-English, French-English and Russian-English translations, we computed sentence-level BLEU [Chen and Cherry, 2014] with the NLTK [Bird et al., 2009] implementation between the human translations and the automatic translations produced by a state-of-the-art system [Ng et al., 2019] (the French-English system was a Transformer big [Vaswani et al., 2017] separately trained on WMT14). We applied this method to these three language pairs only as we are confident about the quality of the corresponding systems. Translations with a score that was too low were manually inspected and sent back to the translators when needed. 2) We manually inspected examples where the source transcript was identical to the translation. 3) We measured the perplexity of the translations using a language model trained on a large amount of clean monolingual data [Ng et al., 2019]. We manually inspected examples where the translation had a high perplexity and sent them back to translators accordingly. 4) We computed the ratio of English characters in the translations. We manually inspected examples with a low ratio and sent them back to translators accordingly. 5) Finally, we used VizSeq [Wang et al., 2019] to calculate similarity scores between transcripts and translations based on LASER cross-lingual sentence embeddings [Artetxe and Schwenk, 2019]. Samples with low scores were manually inspected and sent back for translation when needed.

We also checked the overlap between train, development and test sets in terms of transcripts and voice clips (via MD5 file hashing), and confirmed they are disjoint.

2. Tatoeba (TT)

Tatoeba (TT) is a community built language learning corpus having sentences aligned across multiple languages with the corresponding speech partially available. Its sentences are on average shorter than those in CoVoST (see also Table 1) given the original purpose of language learning. Sentences in TT are licensed under CC BY 2.0 FR and part of the audio is available under various CC licenses.

We construct an evaluation set from TT (for French, German, Dutch, Russian and Spanish) as a complement to CoVoST development and test sets. We collect (speech, transcript, English translation) triplets for the 5 languages and do not include those whose speech has a broken URL or is not CC licensed. We further filter these samples by sentence lengths (minimum 4 words including punctuations) to reduce the portion of short sentences. This makes the resulting evaluation set closer to real-world scenarios and more challenging.

We run the same quality checks for TT as for CoVoST but we do not find poor quality translations according to our criteria. Finally, we report the overlap between CoVo transcripts and TT sentences in Table 2. We found a minimal overlap, which makes the TT evaluation set a suitable additional test set when training on CoVoST.

Data Analysis

Basic statistics for CoVoST and TT are listed in Table 1 including (unique) sentence counts, speech durations, speaker demographics (partially available) as well as vocabulary and token statistics (based on Moses-tokenized sentences by sacreMoseshttps://github.com/alvations/sacremoses) on both transcripts and translations. We see that CoVoST has over 327 hours of German speeches and over 171 hours of French speeches, which, to our knowledge, corresponds to the largest corpus among existing public ST corpora (the second largest is 110 hours [Beilharz et al., 2019] for German and 38 hours [Iranzo-Sanchez et al., 2019] for French). Moreover, CoVoST has a total of 18 hours of Dutch speeches, to our knowledge, contributing the first public Dutch ST resource. CoVoST also has around 27-hour Russian speeches, 37-hour Italian speeches and 67-hour Persian speeches, which is 1.8 times, 2.5 times and 13.3 times of the previous largest public one [Black, 2019]. Most of the sentences (transcripts) in CoVoST are covered by multiple speakers with potentially different accents, resulting in a rich diversity in the speeches. For example, there are over 1,000 speakers and over 10 accents in the French and German development / test sets. This enables good coverage of speech variations in both model training and evaluation.

Speaker Diversity

As we can see from Table 1, CoVoST is diversified with a rich set of speakers and accents. We further inspect the speaker demographics in terms of sample distributions with respect to speaker counts, accent counts and age groups, which is shown in Figure 1, 2 and 3. We observe that for 8 of the 11 languages, at least 60% of the sentences (transcripts) are covered by multiple speakers. Over 80% of the French sentences have at least 3 speakers. And for German sentences, even over 90% of them have at least 5 speakers. Similarly, we see that a large portion of sentences are spoken in multiple accents for French, German, Dutch and Spanish. Speakers of each language also spread widely across different age groups (below 20, 20s, 30s, 40s, 50s, 60s and 70s).

Baseline Results

We provide baselines using the official train-development-test split on the following tasks: automatic speech recognition (ASR), machine translation (MT) and speech translation (ST).

We convert raw MP3 audio files from CoVo and TT into mono-channel waveforms, and downsample them to 16,000 Hz. For transcripts and translations, we normalize the punctuation, we tokenize the text with sacreMoses and lowercase it. For transcripts, we further remove all punctuation markers except for apostrophes. We use character vocabularies on all the tasks, with 100% coverage of all the characters. Preliminary experimentation showed that character vocabularies provided more stable training than BPE. For MT, the vocabulary is created jointly on both transcripts and translations. We extract 80-channel log-mel filterbank features, computed with a 25ms window size and 10ms window shift using torchaudiohttps://github.com/pytorch/audio. The features are normalized to 0 mean and 1.0 standard deviation. We remove samples having more than 3,000 frames or more than 256 characters for GPU memory efficiency (less than 25 samples are removed for all languages).

Model Training

Our ASR and ST models follow the architecture in ?), but have 3 decoder layers like that in ?). We pretrain their encoders on 120-hour English ASR data from Common Voice (2019-06-12 release). For MT, we use a Transformer base architecture [Vaswani et al., 2017], but with 3 encoder layers, 3 decoder layers and 0.3 dropout. We use a batch size of 10,000 frames for ASR and ST, and a batch size of 4,000 tokens for MT. We train all models using Fairseq [Ott et al., 2019] for up to 200,000 updates. We use SpecAugment [Park et al., 2019] for ASR and ST (LB policy without time warping) to alleviate overfitting.

Inference and Evaluation

We use a beam size of 5 for all models. We use the best checkpoint by validation loss for MT, and average the last 5 checkpoints for ASR and ST. For MT and ST, we report case-insensitive tokenized BLEU [Papineni et al., 2002] using sacreBLEU [Post, 2018]. For ASR, we report word error rate (WER) and character error rate (CER) using VizSeq where both the hypothesis and reference are tokenized, lowercased and with punctuation removed.

2. Automatic Speech Recognition (ASR)

For simplicity, we use the same model architecture for ASR and ST. Table 3 shows the word error rate (WER) and character error rate (CER) for ASR models. We see that French and German perform the best given they are the two highest resource languages in CoVoST. Italian is among the best as well, which is mid-resource and has limited accents. Persian is also mid-resource but is challenging because of rich speaker diversity. Most of the other languages are low resource (especially Swedish and Mongolian) and the ASR models are having difficulties to learn from this data even with pre-trained encoders.

3. Machine Translation (MT)

MT models take transcripts (without punctuation) as inputs and outputs translations (with punctuation). For simplicity, we do not change the text preprocessing methods for MT to correct this mismatch. Moreover, this mismatch also exists in cascading ST systems, where MT model inputs are the outputs of an ASR model. Table 4 shows the BLEU scores of MT models. We notice that the results are consistent with what we see from ASR models. For example thanks to abundant training data, French has a decent BLEU score of 29.8/25.4. German doesn’t perform well, because of less richness of content (transcripts). The other languages are relatively low resource in CoVoST and it is difficult to train decent models without additional data or pre-training techniques.

4. Speech Translation (ST)

CoVoST is a many-to-one multilingual ST corpus. While end-to-end one-to-many and many-to-many multilingual ST models have been explored very recently [Inaguma et al., 2019, Gangi et al., 2019], many-to-one multilingual models, to our knowledge, have not. We hence use CoVoST to examine this setting. Table 5 and 6 show the BLEU scores for both bilingual and multilingual end-to-end ST models trained on CoVoST. We observe that combining speeches from multiple languages brings gains to high-resource languages (Fr and De) consistently. Some mid-resource/low-resource languages (Ru, It and Zh) are improved as well. This includes combinations of distant languages, such as Ru+Fr and Zh+Fr. We simply provide the most basic many-to-one multilingual baselines here, and leave the full exploration of the best configurations to future work. Finally, we note that for some language pairs, absolute BLEU numbers are relatively low as we restrict model training to the supervised data. We encourage the community to improve upon those baselines, for example by leveraging semi-supervised training.

5. Multi-Speaker Evaluation

In CoVoST, large portion of transcripts are covered by multiple speakers with different genders, accents and age groups. Besides the standard corpus-level BLEU scores, we also want to evaluate model output variance on the same content (transcript) but different speakers. We hence propose to group samples (and their sentence BLEU scores) by transcript, and then calculate average per-group mean and average coefficient of variation defined as follows:

where $G$ is the set of sentence BLEU scores grouped by transcript and $G^{\prime}=\{g|g\in G,|g|>1,\textrm{Mean}(g)>0\}$ .

$\textrm{BLEU}_{MS}$ provides a normalized quality score as oppose to corpus-level BLEU or unnormalized average of sentence BLEU. And $\textrm{CoefVar}_{MS}$ is a standardized measure of model stability against different speakers (the lower the better). Table 7 shows the $\textrm{BLEU}_{MS}$ and $\textrm{CoefVar}_{MS}$ of our ST models on CoVoST test set. We see that German and Persian have the worst $\textrm{CoefVar}_{MS}$ (least stable) given their rich speaker diversity in the test set and relatively small train set (see also Figure 1 and Table 1). Dutch also has poor $\textrm{CoefVar}_{MS}$ because of the lack of training data. Multilingual models may improve $\textrm{BLEU}_{MS}$ but have comparable $\textrm{CoefVar}_{MS}$ .

Conclusion

We introduce a multilingual speech-to-text translation corpus, CoVoST, for 11 languages into English, diversified with over 11,000 speakers and over 60 accents. We also provide baseline results, including, to our knowledge, the first end-to-end many-to-one multilingual model for spoken language translation. CoVoST is free to use with a CC0 license, and the additional Tatoeba evaluation samples are also CC-licensed.