Multilingual End-to-End Speech Translation

Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

Introduction

Breaking the language barrier for communication is one of the most attractive goals. For several decades, the speech translation (ST) task has been designed by processing speech with automatic speech recognition (ASR), text normalization (e.g. punctuation restoration, case normalization etc.), and machine translation (MT) components in a cascading manner . Recently, end-to-end speech translation (E2E-ST) with a sequence-to-sequence model has attracted attention for its extremely simplified architecture without complicated pipeline systems . By directly translating speech signals in a source language to text in a target language, the model is able to avoid error propagation from the ASR module, and also leverages acoustic clues in the source language, which have shown to be useful for translation . Moreover, it is more memory- and computationally efficient since complicated decoding for the ASR module and the latency occurring between ASR and MT modules can be bypassed.

Although end-to-end optimization demonstrates competitive results compared to traditional pipeline systems and even outperforms them in some corpora , these models are usually trained with a single language pair only (i.e. bilingual translation). There is a realistic scenario in the applications of ST models when a speech utterance is translated to multiple target languages in a lecture, news reading, and conversation domains. For example, TED talks are mostly conducted in English and translated to more than 70 languages in the official website . In these cases, it is a natural choice to support translation of multiple language pairs from speech.

A practical approach for multilingual ST is to construct (mono- or multi-lingual) ASR and (bi- or multi-lingual) MT systems separately and combine them as in the conventional pipeline system . Thanks to recent advances in sequence-to-sequence modeling, we can build strong multilingual ASR , and MT systems even with a single model. However, when speech utterances come from multiple languages, mis-identification of the the source language by the ASR system disables the subsequent MT system from translating properly since it is trained to consume text in the correct source languageIn case of one-to-many situation, this does not occur since only the monolingual ASR is required. However, error propagation from the ASR module and latency between the ASR and MT modules is still problematic.. In addition, text normalization, especially punctuation restoration, must be conducted for ASR outputs in each source language, from which additional errors could be propagated.

In this paper, we propose a simple and effective approach to perform multilingual E2E-ST by leveraging a universal sequence-to-sequence model (see Figure 1). Our framework is inspired by , where all parameters are shared among all language pairs, which also enables zero-shot translations. By building the multilingual E2E-ST system with a universal architecture, it is free from the source language identification and the complexities of training and decoding pipelines are drastically reduced. Furthermore, we do not have to care about which parameters to share among multiple language pairs, which can be learned automatically from training data. To the best of our knowledge, this is the first attempt to investigate multilingual training for the E2E-ST task.

We conduct experimental evaluations with three publicly available corpora: Fisher-CallHome Spanish (Es $\to$ En) , Librispeech (En $\to$ Fr) , and Speech-Translation TED corpus (En $\to$ De) . We evaluate one-to-many (O2M) and many-to-many (M2M) translations by combining these corpora and confirm significant improvements by multilingual training in both scenarios. Next, we evaluate the generalization of multilingual E2E-ST models by performing transfer learning to a very low-resource ST task: Mboshi (Bantu C25) $\to$ Fr corpus (4.4 hours) . We show that multilingual pre-training of the seed E2E-ST models improves the performance in the low-resource language pair unseen during training, compared to bilingual pre-training. Our codes are put to the public project so that results can be reproducible and strictly compared in the same pre-processing (e.g., data split, text normalization, and feature extraction etc.), model implementation, and evaluation pipelines.

Background: Speech Translation

In this section, we describe the architecture of the pipeline and end-to-end speech translation (ST) system. Our ASR, MT, and ST systems are all based on attention-based RNN encoder-decoder modelsWe leave to investigate Transformer architectures for future work. However, our framework is model agnostic and can be applied to any sequence-to-sequence models. . Let ${\bm{x}}^{\rm src}$ be the input speech features in a source language, ${\bm{y}}^{\rm src}$ and ${\bm{y}}^{\rm tgt}$ be the corresponding reference transcription and translation, respectively. In this work, we adopt a character-level unit both for source and target referencesAlthough we also conducted experiments with byte-pair-encoding (BPE) , the character unit is better than BPE in all settings due to the data sparseness issue. Therefore, we only report results on the character-level unit..

The pipeline ST model is composed of three modules: automatic speech recognition (ASR), text normalization, and neural machine translation (NMT) models .

We build the ASR module based on hybrid CTC/attention framework , where the attention-based encoder-decoder is enforced to learn monotonic alignments by jointly optimizing with Connectionist Temporal Classification (CTC) objective function . Our ASR model consists of three modules: the speech encoder, transcription decoder, and the softmax layer for calculating the CTC loss. The speech encoder transforms input speech features ${\bm{x}}^{\rm src}$ into a high-level continuous representation, and then the transcription decoder generates a probability distribution $P_{\rm asr}({\bm{y}^{\rm src}}|{\bm{x}}^{\rm src})=\prod_{i}{P_{\rm asr}({y^{\rm src}_{i}}|y^{\rm src}_{<i},{\bm{x}}^{\rm src})}$ conditioned over all previously generated tokens. We adopt a location-based scoring function . During training, parameters are updated so as to minimize the linear interpolation of the negative log-likelihood $\mathcal{L}_{\rm att}=-\log P_{\rm att}({\bm{y}^{\rm src}}|{\bm{x}}^{\rm src})$ and the CTC loss $\mathcal{L}_{\rm ctc}=-\log P_{\rm ctc}({\bm{y}^{\rm src}}|{\bm{x}}^{\rm src})$ with a tunable parameter $\lambda\ (0\leq\lambda\leq 1)$ : $\mathcal{L}_{\rm asr}=(1-\lambda)\mathcal{L}_{\rm att}+\lambda\mathcal{L}_{\rm ctc}$ . During the inference, left-to-right beam search decoding is performed jointly with scores from both an external recurrent neural network language model (RNNLM) (referred to as shallow fusion) and the CTC outputs. We refer the readers to for more details.

For multilingual ASR models, we prepend the corresponding language ID to reference labels so that the decoder can jointly identify the target language while recognizing speech explicitly, which can be regarded as multi-task learning with ASR and language identification tasks .

1.2 Text normalization

In this work, we skip punctuation restoration for the simplicityIn this paper, we use lowercased references. Therefore, we do not consider truecasing as text normalization.. Instead, we train the MT model so that it translates source references without punctuation marks to target references with them, where text normalization task is jointly conducted with the MT task and it can be seen as multi-task learning. During inference, the MT model consumes hypotheses from the ASR model.

1.3 Neural machine translation (NMT)

Our NMT model consists of the source embedding, text encoder, and translation decoder. The text encoder maps a sequence of source tokens ${\bm{y}}^{\rm src}$ into the distributed representation following the source embedding layer. The translation decoder generates a probability distribution $P({\bm{y}^{\rm tgt}}|{\bm{y}^{\rm src}})$ . The only differences between the transcription and translation decoders are the score function for the attention mechanism. We adopt an additive scoring function . Optimization is performed so as to minimize the negative log-likelihood $-\log P({\bm{y}^{\rm tgt}}|{\bm{y}}^{\rm src})$ .

2 End-to-end speech translation (E2E-ST)

Our end-to-end speech translation (E2E-ST) model is composed of the speech encoder and translation decoder. To compare strictly, we use the same speech encoder and translation decoder as ASR and NMT tasks, respectively. Parameters are updated so as to minimize the negative log-likelihood $-\log P({\bm{y}^{\rm tgt}}|{\bm{x}}^{\rm src})$ .

Multilingual E2E speech translation

We now propose an efficient framework that extends the bilingual E2E-ST model described previously to a multilingual one.

We adopt a universal sequence-to-sequence architecture instead of preparing separate parameters per language pair for four reasons. First, E2E-ST can be generally considered as a more challenging task than MT due to its more complex encoder, which requires more parameters (e.g., VGG+BLSTM). In addition, training sentences in standard ST corpora are much smaller than MT tasks ( $<$ 300k) although input speech frames are much longer than text. Therefore, by sharing all parts, the total number of parameters are also reduced considerably and the E2E-ST model can have more training samples for better translation performance. Furthermore, it is not necessary to change the existing architecture. Second, we do not have to carefully pre-define a mini-batch scheduler for the language cycle as in (see Section 3.3). Third, translation performance in low-resource directions can be improved by the aid of high-resource language pairs. Fourth, we can realize zero-shot translation in a direction which has never been seen during training .

2 Target language biasing

To perform translations for multiple target languages with a single decoder, we have to specify a target language to translate to. In , an artificial token to represent the target language (target language ID) is prepended in the source sentence. However, this is not suitable for the ST task since the ST encoder directly consumes speech features. Instead, we replace a start-of-sentence ( $\langle sos\rangle$ ) token in the decoder with a target language ID $\langle 2lang\rangle$ (see Figure 1). For example, when English speech is translated to French text, $\langle sos\rangle$ is replaced with French ID token $\langle 2fr\rangle$ .

3 Mixed data training

We train multilingual models with mixed training data from multiple languages. Thus, each mini-batch may contain utterances from different language pairs. We bucket all samples so that each mini-batch contains utterances of speech frames of the similar lengths regardless of language pairs. As a result, we can use the same training scheme as the conventional ASR and bilingual ST tasks.

Data

We build our systems on three speech translation corpora: Fisher-CallHome Spanish, Librispeech, and Speech-Translation TED (ST-TED) corpus. To the best of our knowledge, these are the only public available corpora recorded with a reasonable size of real speech dataWe noticed publicly available one-to-many multilingual ST corpus right before submission. However, this dataset has English speech only.. The data statistics are summarized in Table 1.

This corpus contains about 170-hours of Spanish conversational telephone speech, the corresponding transcription, and the English translationshttps://github.com/joshua-decoder/Fisher-CallHome-corpus . Following , we report results on the five evaluation sets: dev, dev2, and test in Fisher corpus (with four references), and devtest and evltest in CallHome corpus (with a single reference). We use the Fisher/train as the training set and Fisher/dev as the validation set. All punctuation marks except for apostrophe are removed during evaluation in ST and MT tasks to compare with previous works .

(B) Librispeech: En→→\toFr

This corpus is a subset of the original Librispeech corpus and contains 236-hours of English read speech, the corresponding transcription, and the French translations . We use the clean 99-hours of speech data for the training set . Translation references in the training set are augmented with Google Translate following , so we have two French references per utterance. We use the dev set as the validation set and report results on the test set.

(C) Speech-Translation TED (ST-TED): En→→\toDe

This data contains 271-hours of English lecture speech, the corresponding transcription, as well as the German translationhttps://sites.google.com/site/iwsltevaluation2018/Lectures-task. Since the original training set includes a lot of noisy utterances due to low alignment quality, we take a data cleaning strategy. We first force-aligned all training utterances with a Gentle forced alignerhttps://github.com/lowerquality/gentle based on Kaldi , then excluded all utterances where all words in the transcription were not perfectly aligned with the corresponding audio signal . This process reduced from 171,121 to 137,660 utterances. We sampled two sets of 2k utterances from the cleaned training data as the validation and test sets, respectively (totally 4k utterances). Note that all sets have no text overlap and are disjoint regarding speakers, and data splits are available in our codes. We report results on this test set and tst2013. tst2013 is one of the test sets provided in IWSLT2018 evaluation campaign. Since there are no human-annotated time alignment provided in these test sets, we decided to sample the disjoint test set from the training data with alignment information.

2 Multilingual translation

We perform experiments in two scenarios: one-to-many (O2M) and many-to-many (M2M)For many-to-one (M2O) scenario, none of the corpora combinations exists in publicly available corpora, therefore we leave the exploration of this task for future work. However, O2M and M2M are the realistic scenarios for multilingual speech translation as mentioned in Section 1..

For one-to-many (O2M) translation, speech utterances in a source language are translated to multiple target languages. We concatenate Librispeech (En $\to$ Fr) and ST-TED (En $\to$ De), and build models for En $\to$ {Fr, De} translations (see Table 1).

Many-to-many (M2M)

For many-to-many (M2M) translation, speech utterances in multiple source languages are translated to all target languages given in training. We can regard this task as a more challenging optimization problem than O2M and M2O translations. We concatenate Librispeech (En $\to$ Fr) and Fisher-CallHome Spanish (Es $\to$ En), then build models for {En, Es} $\to$ {Fr, En} translations (M2Ma)Readers might think that this scenario is not suitable for the M2M evaluation since French does not appear in source side as in the multilingual MT task . However, such public corpora are not currently available.. Other combinations such as Fisher-CallHome Spanish and ST-TED ({En, Es} $\to$ {De, En}, M2Mb), and all three directions ({En, Es} $\to$ {Fr, De, En}, M2Mc) are also investigated.

Experimental evaluations

For data pre-processing of references in all languages, we lowercased and normalized punctuation, followed by tokenization with the tokenizer.perl script in the Moses toolkithttps://github.com/moses-smt/mosesdecoder. For source references, we further removed all punctuation marks except for apostrophe. We report case-insensitive BLEU with the multi-bleu.perl script in Moses. The character vocabulary was created jointly with both source and target languages.

We used 80-channel log-mel filterbank coefficients with 3-dimensional pitch features, computed with a 25ms window size and shifted every 10 ms using Kaldi , resulting 83-dimensional features per frame. The features were normalized by the mean and the standard deviation for each training set. We augmented speech data by a factor of 3 by speed perturbation . We removed utterances having more than 3000 frames or more than 400 characters due to the GPU memory efficiency.

The speech encoders in ASR and ST models were composed of two VGG blocks followed by 5-layers of 1024-dimensional (per direction) bidirectional long short-term memory (LSTM) . Each VGG-like block composed of 2-layers of CNN having a $3\times 3$ filter followed by a max-pooling layer with a stride of $2\times 2$ , which resulted in 4-fold time reduction. The text encoders in MT models were composed of 2-layers of 1024-dimensional (per direction) BLSTM. Both transcription and translation decoders were two layers of unidirectional LSTM with 1024-dimensional memory cells. The dimensions of the attention layer and embeddings for decoders were set to 1024. We used 2-layers of LSTM LM with 1024 memory cells for shallow fusion as discussed in Section 2.1.1.

Training was performed using Adadelta for sequence-to-sequence models and Adam for RNNLM. For regularization, we adopted dropout , label smoothing , scheduled sampling , and weight decay. Beam search decoding was performed with a beam width of 20 with CTC and LM scores in the ASR task as shown in Section 2.1.1, and a beam width of 10 with a length penalty in ST and MT tasks. Detailed hyperparameter settings during training and decoding are available in our codes.

2 Baseline results: Bilingual systems

First, we evaluate baseline bilingual MT and ST systems. Bilingual E2E-ST and pipeline-ST models are labeled (E-B-1) and (P-B) in each table, respectively.

We present our results on Fisher-CallHome Spanish (hereafter, Fisher-CallHome) in Table 3. ASR and NMT results were competitive to the previous work while the E2E-ST and pipeline-ST models underperformed it. Note that our translation decoders in E2E-ST and NMT models were trained so as to predict lowercased references with punctuation marks to compare with multilingual models, unlike previous works , where all punctuation marks except for apostrophe are removed. For the comparison of our E2E-ST and pipeline-ST models, the baseline bilingual E2E-ST model (E-B-1) outperformed the pipeline-ST model (P-B) in the Fisher sets but underperformed it in the CallHome sets. To investigate this discrepancy, we evaluated them with a single reference in the Fisher tests, which results in 26.4/28.2/27.7 (Pipe-ST) vs. 23.5/25.2/24.8 (E2E-ST) and the pipeline system was shown to be better. This is intuitive since the E2E-ST model skipped the ASR decoder, RNNLM in the source language, and MT encoder parts.

In our preliminary experiments, we confirmed the E2E-ST model can outperform the pipeline system by stacking more BLSTM layers on top of the speech encoder to match the number of parameters between them. Moreover, pre-training the speech encoder and translation decoder with the corresponding ASR encoder and NMT decoder also drastically improved the performances (see Table 5 in Section 5.3). However, it is worth noting that our goal in this paper is to show the effectiveness of multilingual training for E2E-ST models and therefore we will not seek these directions here.

(B) Librispeech: En→→\toFr

Next, results on Librispeech are shown in Table 3. Monolingual ASR, bilingual E2E-ST (E-B-1), and pipeline-ST (P-B) models outperformed the previous work . The baseline bilingual E2E-ST model (E-B-1) showed the competitive performance compared to the pipeline-ST model (P-B).

(C) ST-TED: En→→\toDe

Results on ST-TED are shown in Table 4. Contrary to the above results, there is a large gap between the bilingual E2E-ST (E-B-1) and pipeline-ST (P-B) models in this corpus.

3 Main results: Multilingual systems

We now test multilingual models trained in two scenarios: many-to-many (M2M) and one-to-many (O2M) translations.

Results of M2M models on Fisher-CallHome, Librispeech, ST-TED are shown at the (*-Ma/Mb/Mc-1) lines in Table 3, Table 3, and Table 4, respectively. Ma, Mb, and Mc represent M2Ma, M2Mb, and M2Mc, respectively (see Table 1).

In Fisher-CallHome (Table 3), our M2M multilingual E2E-ST models (E-Mb/Mc-1) significantly outperformed the bilingual one (E-B-1) while (E-Ma-1) slightly outperformed (E-B-1) except for Fisher/test. Among three M2M E2E-ST models, (E-Mc-1) showed the best performance, from which we can confirm that additional training data from other language pairs is effective. Multilingual ASR models slightly outperformed the monolingual ASR model. Performances of the MT models were degraded by multilingual training due to the domain mismatch especially for punctuation marks (see Table 1). In contrast, multilingual E2E-ST models were not affected by the domain mismatch issue since they are not conditioned on the source language text, which is one for the advantages of the end-to-end models.

In all pipeline systems in Fisher-CallHome, we used the bilingual MT model since it showed the best performance. Pipeline systems with the multilingual ASR (P-M $\ast$ ) were consistently improved even though WER improvements were very small. Our multilingual E2E-ST models significantly outperformed all the pipeline models in the Fisher sets.

In Librispeech (Table 3), all M2M E2E-ST models (E-Ma/Mc-1) outperformed the bilingual one (E-B-1). Multilingual ASR models also outperformed the monolingual one. Pipeline systems (P-Ma/Mc) are improved in proportion to the WER improvements. However, E2E-ST models got more gains from multilingual training.

In ST-TED (Table 4), we also confirmed the consistent BLEU improvements by the proposed multilingual framework. The similar trends can be seen as in Fisher-CallHome and Librispeech.

One-to-many (O2M)

Results of O2M models on Librispeech and ST-TED are shown in Table 3 and Table 4, respectively. We also obtained significant improvements of the E2E-ST models from multilingual training as well as in the M2M scenario on both corpora. Since the amount of additional training data for O2M and M2Mb from ST-TED is 99-hours (+Librispeech) and 170-hours (+Fisher-CallHome), respectively, and the O2M E2E-ST model is better than the M2Mb E2E-ST model in ST-TED (see Table 4), we can conclude that O2M training is more effective than M2M training in terms of data efficiency. However, the combination of all training data (M2Mc) got a further small gain. We can confirm the effectiveness of O2M training from WER improvements in the ASR task (6.6 vs. 8.6 at the second and third lines from bottom in Table 3). Thus, further additional multilingual training data could lead to the improvement. Gains from multilingual training were larger in the E2E-ST model (E-O-1) than in the best pipeline model (P-O)The best monolingual ASR $\to$ the best bilingual NMT in Table 4. Considering the fact that the O2M NMT model underperformed the bilingual one, O2M multilingual training benefits from not only additional English speech data but also the direct optimization, which is one of our motivations in this work.

Pre-training with the ASR encoder

Finally, we show results of pre-training with the ASR encoder in Table 5. We observed improvements by pre-training both in bilingual and multilingual cases, similar to . Pre-training with the NMT decoder was not necessarily effective. The best multilingual E2E-ST with pre-training (E-Mc-2) outperformed the corresponding best pipeline system in most test sets.

In summary, the proposed multilingual framework has shown to be effective regardless of the language combination, corpus domain, and data size. Although it is possible to improve the pipeline systems by carefully designing the source representations between ASR and MT modules (e.g., adding punctuation restoration module), it can be overcome by simply optimizing the direct mapping from source speech to target text with punctuation marks as we have shown.

Transfer learning for a very low-resource language speech translation

In this section, we evaluate generalization of multilingual ST models by performing transfer learning to a very low-resource ST task. We used Mboshi-French corpushttps://github.com/besacier/mboshi-french-parallel-corpus , which contains 4.4-hours of spoken utterances and the corresponding Mboshi transcriptions and French translations. Mboshi is a Bantu C25 language spoken in Congo-Brazzaville and does not have standard orthography. We sampled 100 utterances from the training set as the validation set, and report results on the dev set (514 utterances) as in .

We tried four different ways to transfer a non-Mboshi E2E-ST model to this task. In the bilingual case, we used the bilingual ST model in Librispeech ((E-B-1) in Table 3) as seed, then fine-tuned on the Mboshi-French data. In the multilingual case, we tried seeding with multilingual ST models in M2Ma (E-Ma-1), M2Mc (E-Mc-1), and O2M (E-O-1) settings. All parameters including the output layer are transferred from pre-trained ST models and we do not include any characters in Mboshi transcriptions in the vocabularies. Note that French references appear in the target side of all seed models during the pre-training stage.

Results are shown in Table 6. Multilingual E2E-ST models are more effective than the bilingual one, and O2M showed the best performance among three models. Although our transferred models underperformed , it is worth mentioning that they used other English ASR data (Switchboard corpus) and initialized the decoder with the French ASR decoder. Further improvements could be possible by leveraging Mboshi transcriptions, but we did not use any prior knowledge about Mboshi characters. This is a desired scenario for endangered language documentation and quite useful for automatic word discovery .

Related work

In , the E2E-ST model is simultaneously optimized with an auxiliary ASR task by sharing the whole encoder parameters. Pre-training approaches from the ASR encoder and MT decoder are also investigated in . proposed a data augmentation strategy, where weakly-supervised paired data is generated from monolingual source text data with text-to-speech (TTS) and MT systems (similar to back translation ) and speech data with a pipeline ST system (similar to knowledge distillation ). proposed an efficient framework to better leverage higher-level intermediate representations by jointly attending to speech encoder and transcription decoder states. The most relevant work to ours is , where well-trained ASR parameters from the other language are used to initialize ST models and improve the ST performance in low-resource scenarios. Our work is distinct in that we focus on exploiting corpora in the multilingual setting and show that it outperforms the bilingual setting.

Multilingual ASR

In the multilingual ASR study, the language-independent acoustic representations can be obtained by sharing parameters, and then adapted to low-resource languages . Recently, this approach is extended to end-to-end ASR paradigms: Connectionist Temporal Classification (CTC) , and attention-based encoder-decoder . Our work adopts this multilingual ASR in the pipeline system.

Multilingual NMT

Crosslingual parameter sharing approaches are investigated by tying a part of parameters , and even all parameters with a shared vocabulary among multiple languages. Since the main drawback of the shared vocabulary is that the size of the vocabulary grows rapidly in proportion to the number of language pairs or the capacity per language shrinks when using BPE units , fully character-level multilingual framework is proposed to overcome the issue to some extent . Our work is along with this trend of utilizing a universal translation model in one-to-many and many-to-many ST scenarios.

Conclusion and future work

We performed multilingual training and end-to-end speech translation jointly, which has not yet been investigated before. We proposed a universal sequence-to-sequence framework and it outperformed the bilingual end-to-end, and the gap between strong pipeline systems became smaller. Its effectiveness was also confirmed by performing transfer learning to a very low-resource speech translation task. To encourage further research in this topic, we will place our codes to the public project. In future work, we will support more languages on our codebase and investigate multilingual training with non-related languages such as Chinese and Japanese.

Introduction

Background: Speech Translation

1.2 Text normalization

1.3 Neural machine translation (NMT)

2 End-to-end speech translation (E2E-ST)

Multilingual E2E speech translation

2 Target language biasing

3 Mixed data training

Data

(B) Librispeech: En→→\toFr

(C) Speech-Translation TED (ST-TED): En→→\toDe

2 Multilingual translation

Many-to-many (M2M)

Experimental evaluations

2 Baseline results: Bilingual systems

(B) Librispeech: En→→\toFr

(C) ST-TED: En→→\toDe

3 Main results: Multilingual systems

One-to-many (O2M)

Pre-training with the ASR encoder

Transfer learning for a very low-resource language speech translation

Related work

Multilingual ASR

Multilingual NMT

Conclusion and future work

References