Curriculum Pre-training for End-to-End Speech Translation

Chengyi Wang, Yu Wu, Shujie Liu, Ming Zhou, Zhenglu Yang

Introduction

Speech-to-Text translation (ST) is essential to breaking the language barrier for communication. It aims to translate a segment of source language speech to the target language text. To perform this task, prior works either employ a cascaded method where an automatic speech recognition (ASR) model and a machine translation (MT) model are chained together or an end-to-end approach where a single model converts the source language audio sequence to the target language text sequence directly Berard et al. (2016).

Due to the alleviation of error propagation and lower latency, the end-to-end ST model has been a hot topic in recent years. However, large paired data of source audios and target sentences is required to train such a model, which is not easy to satisfy for most language pairs. To address this issue, previous works resort to pre-training technique Berard et al. (2018); Bansal et al. (2019), where they leverage the available ASR and MT data to pre-train an ASR model and an MT model respectively, and then initialize the ST model with the ASR encoder and the MT decoder. This strategy can bring faster convergence and better results.

The end-to-end ST encoder has three inherent roles: transcribe the speech, extract the syntactic and semantic knowledge of the source sentence and then map it to a semantic space, based on which the decoder can generate the correct target sentence. This poses a heavy burden to the encoder, which can be alleviated by pre-training. However, we argue that the current pre-training method restricts the power of pre-trained representations. The encoder pre-trained on the ASR task mainly focuses on transcription, which learns the alignment between the acoustic feature with phonemes or words, and has no ability to capture linguistic knowledge or understand the semantics, which is essential for translation.

In order to teach the model to understand the sentence and incorporate the required knowledge, extra courses should be taken before learning translation. Motivated by this, we propose a curriculum pre-training method for end-to-end ST. As shown in Figure 1, we first teach the model transcription through ASR task. After that, we design two tasks, named frame-based masked language model (FMLM) task and frame-based bilingual lexicon translation (FBLT) task, to enable the encoder to understand the meaning of a sentence and map words in different languages. Finally, we fine-tune the model on ST data to obtain the translation ability.

For the FMLM task, we mask several segments of the input speech feature, each of which corresponds to a complete word. Then we let the encoder predict the masked word. This task aims to force the encoder to recognize the content of the utterance and understand the inner meaning of the sentence. In FBLT, for each speech segment that aligns with a complete word, whether or not it is masked, we ask the encoder to predict the corresponding target word. In this task, we give the model more explicit and strong cross-lingual training signals. Thus, the encoder has the ability to perform simple word translation and the burden on the ST decoder is greatly reduced. Besides, we adopt a hierarchical manner where different layers are guided to perform different tasks (first 8 layers for ASR and FMLM pre-training, and another 4 layers for FBLT pre-training). This is mainly because the three pre-training tasks have different requirements for language understanding and different output spaces. The hierarchical pre-training method can make the division of labor more clear and separate the incorporation of source semantic knowledge and cross-lingual alignments.

We conduct experiments on the LibriSpeech En-Fr and IWSLT18 En-De speech translation tasks, demonstrating the effectiveness of our pre-training method. The contributions of our paper are as follows: (1) We propose a novel curriculum pre-training method with three courses: transcription, understanding and mapping, in order to force the encoder to have the ability to generate necessary features for the decoder. (2) We propose two new tasks to learn linguistic features, FMLM and FBLT, which explicitly teach the encoder to do source language understanding and target language meaning mapping. (3) Experiments show that both the proposed courses are helpful for speech translation, and our proposed curriculum pre-training leads to significant improvements.

Related Work

Early work on speech translation used a cascade of an ASR model and an MT model Ney (1999); Matusov et al. (2005); Mathias and Byrne (2006), which makes the MT model access to ASR errors. Recent successes of end-to-end models in the MT field Bahdanau et al. (2015); Luong et al. (2015); Vaswani et al. (2017) and the ASR fields Chan et al. (2016); Chiu et al. (2018) inspired the research on end-to-end speech-to-text translation system, which avoids error propagation and high latency issues.

In this research line, Berard et al. (2016) give the first proof of the potential for an end-to-end ST model. After that, pre-training, multitask learning, attention-passing and knowledge distillation have been applied to improve the ST performance Anastasopoulos et al. (2016); Duong et al. (2016); Berard et al. (2018); Weiss et al. (2017); Bansal et al. (2018, 2019); Sperber et al. (2019); Liu et al. (2019); Jia et al. (2019). However, none of them attempt to guide the encoder to learn linguistic knowledge explicitly. Recently, Wang et al. (2019b) propose to stack an ASR encoder and an MT encoder as a new ST encoder, which incorporates acoustic and linguistic knowledge respectively. However, the gap between these two encoders is hard to bridge by simply concatenating the encoders. Kano et al. (2017) propose structured-based curriculum learning for English-Japanese speech translation, where they use a new decoder to replace the ASR decoder and to learn the output from the MT decoder (fast track) or encoder (slow track). They formalize learning strategies from easier networks to more difficult network structures. In contrast, we focus on the curriculum learning in pre-training and increase the difficulty of pre-training tasks.

2 Curriculum Learning

Curriculum learning is a learning paradigm that starts from simple patterns and gradually increases to more complex patterns. This idea is inspired by the human learning process and is first applied in the context of machine learning by Bengio et al. (2009). The study shows that this training approach results in better generalization and speeds up the convergence. Its effectiveness has been verified in multiple tasks, including shape recognition Bengio et al. (2009), object classification Gong et al. (2016), question answering Graves et al. (2017), etc. However, most studies focus on how to control the difficulty of the training samples and organize the order of the learning data in the context of single-task learning.

Our method differs from previous works in two ways: (1) We leverage the idea of curriculum learning for pre-training. (2) We do not train the model on the ST task directly with more and more difficult training examples or use more and more difficult structures. Instead, we design a series of tasks with increased difficulty to teach the encoder to incorporate diverse knowledge.

Method

The overview of our training process is shown in Figure 2. It can be divided into three steps: First, we train the model towards the ASR objective $L_{ASR}$ to learn transcription. We note this as the elementary course. Next, we design two advanced courses (tasks) to teach the model understanding a sentence and mapping words in two languages, named Frame-based Masked Language Model (FMLM) task and Frame-based Bilingual Lexicon Translation (FBLT) task. In the FMLM task, we mask some speech segments and ask the encoder to predict the masked words. In the FBLT task, we ask the encoder to predict the target word for each speech segment which corresponds to a complete source word. In this stage, the encoder is updated by $L_{ADV}$ . We adopt a hierarchical training manner where $N$ encoder blocks are used to perform ASR and FMLM tasks, since they both require outputs in source word space, and $N_{e}$ blocks are used in FBLT task. After the two-phases pre-training, the encoder is finally combined with a new decoder or a pre-trained MT decoder to perform the ST task towards $L_{ST}$ .

The speech translation corpus usually contains speech-transcription-translation triples, denoted as $\mathcal{S}=\{(\bm{x},\bm{y^{s}},\bm{y^{t}})\}$ . Specially, $\bm{x}=(x_{1},\cdots,x_{T_{x}})$ is a sequence of acoustic features which are extracted from the speech signals. $\bm{y^{s}}=(y_{1}^{s},\cdots,y_{T_{s}}^{s})$ and $\bm{y^{t}}=(y_{1}^{t},\cdots,y_{T_{t}}^{t})$ represent the corresponding transcription in source language and the translation in target language respectively. To pre-train the encoder, an extra ASR dataset $\mathcal{A}=\{(\bm{x},\bm{y^{s}})\}$ can be leveraged . Finally, the data for encoder pre-training is denoted as $\{(\bm{x},\bm{y^{s}})|(\bm{x},\bm{y^{s}})\in\mathcal{A}\vee(\bm{x},\bm{y^{s}},\bm{y^{t}})\in\mathcal{S}\}$

After the encoder is pre-trained, we fine-tune the model using only $\mathcal{S}$ , to enable it generate $\bm{y^{t}}$ from $\bm{x}$ directly. The model is updated using cross-entropy loss $\mathcal{L}_{ST}=-\log P(\bm{y^{t}}|\bm{x})$ .

In this work, we adopt the architecture of Transformer as in Karita et al. (2019). The encoder is a stack of two $3\times 3$ 2D CNN layers with stride 2 and $N_{e}$ Transformer encoder blocks. The CNN layers result in downsampling by a factor of 4. The decoder is a stack of $N_{d}$ Transformer decoder blocks.

2 Elementary Course: Transcription

In the elementary course, we train an end-to-end ASR model, which has the similar architecture as the ST model. The ASR encoder consists of $N$ blocks, and these blocks are used to initialize the bottom $N$ blocks of the ST encoder. For the ASR task, we follow Karita et al. (2019), to employ a multi-task learning strategy, that is, both the E2E decoder and a CTC module predict the source sentence. Offline experiments indicate that the CTC objective is crucial for attentional encoder-decoder based ASR models. The final objective combines the CTC loss $\mathcal{L}_{ctc}$ and the cross-entropy loss $\mathcal{L}_{CE}$ :

In this work, we set $\alpha$ to 0.3. The CTC loss works on the encoder output and it pushes the encoder to learn frame-wise alignment between speech with words. The cross-entropy loss works on both the encoder and the ASR decoder.

3 Advanced Courses: Understanding and Word Mapping

With the ability of transcription, we further propose two new tasks for the advanced courses.

The design of the Frame-based Masked Language Model task is inspired by the Masked Language Model (MLM) objective of BERT Devlin et al. (2019), and semantic mask for ASR task Wang et al. (2019a). This task enables the encoder to understand the inner meaning of a segment of speech.

After that, for a masked piece $[{x}_{s_{j}}:{x}_{e_{j}}]$ , we average the corresponding output hidden states $[h_{\lfloor\frac{s_{j}}{4}\rfloor}:h_{\lceil\frac{e_{j}}{4}\rceil}]$ The position indexs are divided by 4 due to downsampling., and compute the distribution probability over source words as shown in follows:

In practice, the sentence is represented in BPE tokens and $W\in\mathcal{R}^{d_{model}\times|V_{s}|}$ , where $|V_{s}|$ is the size of source vocabulary. In this way, a speech piece can be aligned with one or more tokens. We compute KL-Divergence loss as:

$q({y}^{s}_{i})\in\mathcal{R}^{|V_{s}|}$ is a distribution over all BPE tokens in source vocabulary $V_{s}$ and defined as:

where $pos$ represents the dimension index and $n_{j}$ is the total number of BPE tokens contained in word ${y}^{s}_{j}$ .

In this work, we use a mask ratio of 15% following BERT and the masked speech piece is filled with the mean value of the whole utterance following Park et al. (2019). Because FMLM focuses on the understanding of source language, we computes its loss at the $N$ -th layer of encoder (same with ASR loss), in the hope that the bottom $N$ layers are only concerned with source language.

3.2 Frame-based Bilingual Lexicon Translation

We compute the $\mathcal{L}_{FBLT}$ at the top layer of the encoder, indicating that the top $N_{e}-N$ layers are duty on bilingual word mapping. The final training objective in the advanced course combines FMLM and FBLT losses

Experiments

We conduct experiments on two publicly available speech translation datasets: the LibriSpeech En-Fr Corpus Kocabiyikoglu et al. (2018) and the IWSLT En-De Corpus Niehues et al. (2018).

This corpus is a subset of the LibriSpeech ASR corpus Panayotov et al. (2015) and aligned with French e-books, which contains 236 hours of speech in total. Following previous works, we use the 100 hours clean training set and double the ST size by concatenating the aligned references with the provided Google Translate references, resulting in 90k training instances. We validate on the dev set and report results on the test set (2048 utterances).

The corpus contains 271 hours of data, with English wave, English transcription, and German translation in each example. We follow Inaguma et al. (2019) to remove utterances of low alignment quality, resulting in 137k utterances. We sample 2k segments from the ST-TED corpus as dev set and tst2013 is used as the test set (993 utterances).

We run ESPnethttps://github.com/espnet/espnet Watanabe et al. (2018) recipes to perform data pre-processing. For both tasks, our acoustic features are 80-dimensional log-Mel filterbanks stacked with 3-dimensional pitch features extracted with a step size of 10ms and window size of 25ms. The features are normalized by the mean and the standard deviation for each training set. Utterances of more than 3000 frames are discarded. We perform speed perturbation with factors 0.9 and 1.1. The alignment results between speech and transcriptions are obtained by Montreal Forced Aligner McAuliffe et al. (2017).

For references pre-processing, we tokenize and lowercase all the text with the Moses scripts. For pre-training tasks, the vocabulary is generated using sentencepiece Kudo and Richardson (2018) with a fixed size of 5k tokens for all languages, and the punctuation is removed. For ST task, we normalize the punctuation using Moses and use the character-level vocabulary due to its better performance Berard et al. (2018). Since there is no human-annotated segmentation provided in the IWSLT tst2013, we use two methods to segment the audios: 1) Following ESPnet, we segment each audio with the LIUM SpkDiarization tool Meignier and Merlin (2010). For evaluation, the hypotheses and references are aligned using the MWER method with RWTH toolkit Bender et al. (2004). 2) We perform sentence-level force-alignment between audio and transcription using aeneashttps://www.readbeyond.it/aeneas tool and segment the audio according to alignment results.

2 Baselines

Experiments are conducted in two settings: base setting and expanded setting. In base setting, only the corpus described in Section 4.1 is used for each task. In the expanded setting, additional ASR and/or MT data can be used. All results are reported on case-insensitive BLEU with the multi-bleu.perl script unless noted.

We mainly compare our method with the conventional encoder pre-training method which uses only the ASR task to pre-train the encoder. Besides, we also compare with the results of the other works in the literature by copying their numbers.

In the context of base setting, Berard et al. (2018) and ESPnet have reported results on a LSTM-based ST model with pre-training and/or multi-task learning strategy. Liu et al. (2019) use a Transformer ST model and knowledge distillation method. Wang et al. (2019b) stack an ASR encoder and an MT encoder for final ST task, named as TCEN. Regarding the expanded setting, Bahar et al. (2019) apply the SpecAugment on ST task. They use the total 236h of speech for ASR pre-training. Inaguma et al. (2019) combine three ST datasets of 472h training data LibriSpeech En-Fr, IWSLT En-De and Fisher-CallHome Es-En to train a multilingual ST model. In our work, we use the LibriSpeech ASR corpus as additional pre-training data, including 960h of speech. As the $dev$ and $test$ set of LibriSpeech ST task are extracted from the 960h corpus, we exclude all training utterances with the same speaker that appear in $dev$ or $test$ sets .

Since previous works use different segmentation methods and BLEU-score scripts, it is unfair to copy their numbers. In our work, we choose the ESPnet results as base setting baseline, the multilingual model and TCEN-LSTM model as expanded baselines. Inaguma et al. (2019) use the same multilingual model as described in LibriSpeech baselines. And Wang et al. (2019b) use an additional 272h TEDLIUM2Rousseau et al. (2014) ASR corpus and 41M parallel data from WMT18 and WIT3https://wit3.fbk.eu/mt.php?release=2017-01-trnted. All of them use ESPnet code, LIUM segmentaion method and multi-bleu.perl script. We follow Wang et al. (2019b) to use another 272h ASR data for encoder pre-training and a subset of WMT18Europarl v7, Common Crawl, News Comentary v13 and Rapid corpus of EU press releases. for decoder pre-training. We use the same processing method for MT data, resulting in 4M parallel sentences in total. We also re-implement the CL-fast track of Kano et al. (2017) using our model architecture and data as another baseline.

2.2 Cacased Baselines

For LibriSpeech ST task, we use results of Berard et al. (2018), Inaguma et al. (2019) and Liu et al. (2019) as base cascaded baselines. The first two use LSTM models for ASR and MT. While the last work trains Transformer ASR and MT models. We build an expanded cascaded system with the pre-trained Transformer ASR model and a LSTM MT model with the default setting in ESPnet recipe. For IWSLT ST task, we use Inaguma et al. (2019) as base cascaded baseline, which is based on LSTM architecture. And we implement a Transformer-based baseline using our pre-trained ASR and MT models in the expanded setting.

3 Implementation Details

All our models are implemented based on ESPnet. We set the model dimension $d_{model}$ to 256, the head number $H$ to 4, the feed forward layer size $d_{ff}$ to 2048. For LibriSpeech expanded setting, $d_{model}=512$ and $H=8$ . For all the ST models, we set the number of encoder blocks $N_{e}=12$ and the number of decoder blocks $N_{d}=6$ . Unless noted, we use $N=8$ encoder blocks to perform the ASR and the FMLM pre-training tasks. For MT model used in IWSLT expanded setting, we use the Transformer architecture in Vaswani et al. (2017) with $N_{e}=6,N_{d}=6,H=4,d_{model}=256$ .

We train the model with 4 Tesla P40 GPUs and batch size is set to 64 per GPU. The pre-training takes 50 and 20 epochs for each phase and the final ST task takes another 50 epochs (a total of 120 epochs). We use the Adam optimizer with warmup steps 25000 in each phase. The learning rate decays proportionally to the inverse square root of the step number after 25000 steps. We save checkpoints every epoch and average the last 5 checkpoints as the final model. To avoid over-fitting, SpecAugment strategy Park et al. (2019) is used in ASR pre-training with frequency masking (F = 30, mF = 2) and time masking (T = 40, mT=2). The decoding process uses a beam size of 10 and a length penalty of 0.2.

4 Experimental Results

LibriSpeech En-Fr: The results on LibriSpeech En-Fr test set are listed in Table 1. In base setting, our method improves the “Transformer+ASR pre-train” baseline by 1.7 BLEU and beats all the previous works, even though we do not pre-train the decoder. It indicates that through a well-designed learning process, the encoder has a strong potential to incorporate large amount of knowledge. Our method beats a knowledge distillation baseline, where an MT model is utilized to teach the ST model. The reason, we believe, is that our method gives the model more training signals and makes it easier to learn. We also outperform a TCEN baseline which includes two encoders. Compared to them, our method is more flexible and incorporates all information into a single encoder, which avoids the representation gap between the two encoders.

As the ASR data size increases, the model performs better. In the expanded setting, we find the FBLT task performs poorly compared with the base setting. This is because the target word prediction task is dictionary-supervised in expanded setting rather than reference-supervised as in base setting. However, our method still outperforms the simple pre-training method by a large margin. Besides, it is surprising to find that the end-to-end ST model is approaching the performance of an MT model, which is the upper bound of the ST model since it accepts golden source sentence without any ASR errors. This further verifies the effectiveness of our method.

IWSLT En-De: The results on IWSLT tst2013 are listed in Table 2, showing a similar trend as in LibriSpeech dataset. We find that the segmentation methods have a big influence on the final results. In the base setting, our method can improve the ASR pre-training baseline by 0.9 to 2.2 BLEU scores, depending on the segmentation methods. In the expanded setting, we find when combined with decoder pre-train, the performance is further improved and beats other expanded baselines.

4.2 Comparison with Cascaded Baselines

Table 3 shows comparison with cascaded ST systems. For the base setting of two tasks, our end-to-end model can achieve comparable or better results with cascaded methods. This shows the end-to-end model has powerful learning capabilities and combines the functions of two models. In the LibriSpeech expanded setting, when more ASR data is available, we also obtain a competitive performance. This indicates our method can make a good use of ASR corpus and learn valuable linguistic knowledge other than simple acoustic information. However, when additional MT data is used, there is still a gap between the end-to-end method and the cascaded method. How to utilize bilingual parallel sentences to improve the E2E ST model is worth further studying.

5 Analysis and Discussion

Ablation Study To better understand the contribution of each component, we perform an ablation study on LibriSpeech expanded setting. The results are shown in Table 4. On the one hand, we show that both of our proposed pre-training tasks are beneficial: In “-FMLM task” and “-FBLT task”we use 12-layer encoder for ASR and FMLM pre-training for a fair comparison., we perform single-task pre-training for advanced course. The performance drops when we remove either one of them. On the other hand, we show the two-phases pre-training paradigm is necessary: The “- phase 2” experiment degenerates to the simple ASR pre-training baseline. In “-phase 1” setting, we find that without the ASR pre-training, the training accuracy on FMLM task and FBLT task drops a lot, which further affects the ST performance. This means the ASR task is necessary for both the advanced courses and ST. In “Multi3” setting, we pre-train the model on ASR, FMLM and FBLT tasks in one phase. In this setting, we observe multi-task learning also decrease individual task performances (ASR, FMLM and FBLT) compared to curriculum learning. One reasonable expanation is that it is hard to train on the FMLM and FBLT tasks which takes masked input from randomly initialized parameters, which also leads to performance degradation on the ST task.

Hyper-parameter $\mathbf{N}$ During pre-training, which layer conducts ASR pre-training and FMLM loss is an important hyper-parameter. We conduct experiments on LibriSpeech base setting to explore the influence of different choices. We keep $N_{e}=12$ unchanged and always use the top layer to perform the FBLT task. Then we alter the hyperparameter $N$ . We find if $N=6$ , the model finds it difficult to converge during ST training. That may be because the distance between the decoder and the bottom 6 encoder layers is too far so that the valuable source linguistic knowledge can not be well utilized. Moreover, the model performs undesirable if the choice is 10 or 12, which results in 16.47 and 16.14 BLEU score respectively, since the number of blocks for FBLT task is not enough. The model achieves the best performance when we choose $N=8$ . Thus, we use this strategy in our main experiments.

Unlabeled Speech Data In this work, we also explore how to utilize the unlabeled speech data in pre-training, but only get negative results. We conduct exploratory experiments on the LibriSpeech ST task. Assume that the $(\bm{x},\bm{y^{s}})$ from 100h ST corpus as labeled pre-training data and $(\bm{x})$ from 960h LibriSpeech ASR corpus as unlabeled data. Following Jiang et al. (2019), we design an unsupervised pre-training task for elementary course, in which we randomly mask 15% of fbank features and let the bottom 4 encoder layers predict the masked part. We compute the L1 loss between the prediction and groundtruth filterbanks. However, we find that this method is not helpful for the final ST task, which results in 16.85 BLEU score, lower than our base setting model (without extra data pre-training). It is still an open question about how to use unlabeled speech data.

Conclusion and Future Work

This paper investigates the end-to-end method for ST. We propose a curriculum pre-training method, consisting of an elementary course with an ASR loss, and two advanced courses with a frame-based masked language model loss and a bilingual lexicon translation loss, in order to teach the model syntactic and semantic knowledge in the pre-training stage. Empirical studies have demonstrated that our model significantly outperforms baselines. In the future, we will explore how to leverage unlabeled speech data and large bilingual text data to further improve the performance. Besides, we expect the idea of curriculum pre-training can be adopted on other NLP tasks.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No.U1636116 and the Ministry of education of Humanities and Social Science project under grant 16YJC790123.