Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Renjie Zheng, Junkun Chen, Mingbo Ma, Liang Huang

Introduction

In recent years, task-agnostic text representation learning (Peters et al., 2018; Devlin et al., 2019; Sun et al., 2019) has attracted much attention in the NLP community due to its strong performance to many downstream tasks. More recently, unsupervised speech representation learning (Baevski et al., 2020; Chen et al., 2020; Liu et al., 2020a) also successfully improved many speech related tasks, such as speech recognition and speech translation.

However all these existing methods can only handle one modality, either text or speech, while joint acoustic and text representation is desired for many end-to-end spoken language processing tasks, such as spoken question answering (Chuang et al., 2019) and end-to-end speech-to-text translation (Liu et al., 2020b). For example, end-to-end speech translation (ST) is desired due to its advantages over the pipeline paradigm, such as low latency, alleviation of error propagation, and fewer parameters (Weiss et al., 2017; Bérard et al., 2018; Jia et al., 2019; Sperber et al., 2017; Zheng et al., 2020; Chen et al., 2021). However, its translation quality is limited by the scarcity of large-scale parallel speech translation data while there exists sufficient data for speech recognition and text machine translation (Fig. 1). It would be helpful if source speech and bilingual text can be encoded into a unified representation via abundant speech recognition and text machine translation data. Liu et al. (2020b) show that jointly training a multi-modal ST encoder can largely improve the translation quality. However, their proposed representation learning method is constrained to the sequence-to-sequence framework and there is no experiment showing whether their proposed method can benefit from extra speech recognition and machine translation data.

Inspired by recent cross-lingual language model pre-training work (Lample & Conneau, 2019) which shows the potential to unify the representations of different languages into one encoder, we propose a Fused Acoustic and Text Masked Language Model (FAT-MLM). This model jointly learns a unified representation for both acoustic and text input. In this way, we extend the masked language model’s input from only acoustic or text data to multimodal corpora containing both acoustic and text data, such as speech recognition and speech translation for the first time (Fig. 1).

We further extend this Fused Acoustic and Text encoder to a sequence-to-sequence framework and present an end-to-end Speech Translation model (FAT-ST). This enables the model to be trained from both speech and text machine translation data into one single encoder-decoder model. Meanwhile, this model can also learn from speech recognition data using an extra FAT-MLM loss. This resolves the limitation of existing single encoder and decoder speech translation models, which can only learn from scarce parallel speech translation data, but neglects much larger scale speech recognition and text machine translation data (Fig. 1).

We propose the Fused Acoustic and Text Masked Language Model (FAT-MLM), which can learn a unified acoustic and text representation.

Based on FAT-MLM, we propose the Fused Acoustic and Text Speech Translation model (FAT-ST), which can do speech recognition and machine translation in a single encoder-decoder framework.

Spontaneous speech translation experiments on three language pairs show that by finetuning FAT-MLM, the accuracy of FAT-ST improves end-to-end speech translation model by $+4.65$ BLEU in average and achieves state-of-the-art. This is the first time that an end-to-end speech translation model achieves similar performance with the strong cascaded system in these three translation directions of this dataset, while still maintaining a smaller model size and faster decoding time.

We show that FAT-MLM trained with additional speech recognition, machine translation, and monolingual text data can improve FAT-ST by $+1.25$ BLEU. FAT-ST can be further improved by using additional speech recognition and machine translation data.

Previous Work

Radford et al. (2018), Howard & Ruder (2018) and Devlin et al. (2019) investigate language modeling for pretraining Transformer encoders. Unlike Radford et al. (2018) using unidirectional language models for pretraining, Devlin et al. (2019) proposes BERT which enables deep bidirectional representation pretraining by a masked language modeling (MLM) objective inspired by the Cloze task (Taylor, 1953) which randomly masks some of the tokens from the input, with an objective to recover the masked word based only on its context. Their approaches lead to drastic improvements on several natural language understanding tasks including text classification (Wang et al., 2018),and question answering (Rajpurkar et al., 2016).

2 Translation Language Modeling

Lample & Conneau (2019) extend MLM to cross-lingual pretraining by proposing two methods: one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective which is called Translation Language Model (TLM). As shown in Fig. 2(b), TLM encodes both source and target sentences from a parallel data after masking several tokens with [MASK], and then learn to recover the masked tokens. Experiments show that TLM achieves state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation.

3 Masked Acoustic Model

Recently, Chen et al. (2020) propose to learn a speech encoder in a self-supervised fashion on the speech side, which can utilize speech data without transcription. This technique termed Masked Acoustic Modeling (MAM), can also perform pretraining on any acoustic signals (including non-speech ones) without annotation. Fig. 2(c) demonstrate the architecture of MAM. Similar with MLM, MAM replaces a span of speech spectrogram with mask tokens [MASK]. After a 2D Convolution layer and a Transformer Encoder, MAM learns to recover the masked spectrogram via a 2D De-convolution layer during training. Chen et al. (2020) shows that MAM can improve end-to-end speech translation as either an additional loss or a pretraining model. Parallel to MAM, Baevski et al. (2020) proposes the wav2vec 2.0 pretraining model, which masks the speech input in the latent space and pretrains the model via a contrastive task defined over a quantization of the latent representations.

Fused Acoustic and Text Masked Language Model (FAT-MLM)

Although existing pretraining models show a strong representation learning ability and significantly improve upon many down-streaming tasks, they all can only learn the representation for either text or speech. However, a unified speech and text multi-modal representation is useful for many end-to-end spoken language processing tasks.

To address this problem, we propose the Fused Acoustic and Text Masked Language Model (FAT-MLM), a multimodal pretraining model which encodes acoustic, text into a unified representation. The idea is similar with Lample & Conneau (2019) who propose to learn a unified representation of different languages. They first propose a method relying on the shared sub-word vocabulary to align different languages’ representation. However this is unapplicable in our case because of the modality difference. Thus we propose a method similar to their second approach TLM which uses parallel speech recognition data. In the following sections, we first introduce the monolingual FAT-MLM and then show how to extend it to translation scenario.

As shown in Fig. 3(c), we first randomly mask several spans of $\mathbf{s}$ by a random masking function over the input $\mathbf{s}$ :

where $g$ is a reconstruction function (we use 2D deconvolution in this work) which tries to recover the original signal from encoded representation $f([e_{\hat{\mathbf{s}}};\hat{\mathbf{x}}])$ . We use mean squared error for measuring the difference between $s$ and the reconstructed spectrogram. For transcription input $\mathbf{x}$ , following Devlin et al. (2019) we use cross entropy loss , denoted as

to reconstruct the masked token. The final loss for monolingual FAT-MLM is:

2 Translation FAT-MLM

To support multimodal crosslingual tasks such as speech translation, We propose Translation FAT-MLM which extends Monolingual FAT-MLM by using additional target language translation of the source language transcription as input. Formally Translation FAT-MLM takes $D_{\mathbf{s},\mathbf{x},\mathbf{y}}=\{(\mathbf{s},\mathbf{x},\mathbf{y})\}$ as input, where $\mathbf{y}=[y_{1},...,y_{|y|}]$ denotes the sequence of target language translation. This kind of triplet input is very common in speech translation corpus.

As shown in Fig. 3(d), we incorporate source language embedding $e_{\text{src}}$ and target language embedding $e_{\text{tgt}}$ for different languages to show the language difference. Similar to Monolingual FAT-MLM, Translation FAT-MLM randomly masks the translation input $\hat{\mathbf{y}}\sim\text{Mask}_{\text{token}}(\mathbf{y},\lambda)$ and concatenate it with another two embeddings:

Then we reconstruct masked input from concatenated embeddings $\mathbf{h}_{\mathbf{s},\mathbf{x},\mathbf{y}}$ via a Transformer encoder. The reconstruction loss for different masked input is:

We sum these loss functions for the final loss function of Translation FAT-MLM:

To fully utilize the corpora for different tasks, FAT-MLM can take any combination of speech, transcription, translation triplets $D_{2^{\{\mathbf{s},\mathbf{x},\mathbf{y}\}}}$ as input. $2^{\{\mathbf{s},\mathbf{x},\mathbf{y}\}}$ is the power set of $\{\mathbf{s},\mathbf{x},\mathbf{y}\}$ triplets. Specifically, these combinations include speech only data $\{\mathbf{s}\}$ , monolingual text data, $\{\mathbf{x}\}$ or $\{\mathbf{y}\}$ , speech and transcription tuple $\{(\mathbf{s},\mathbf{x})\}$ for speech recognition, transcription and translation tuple $\{(\mathbf{x},\mathbf{y})\}$ for machine translation, speech and translation tuple $\{(\mathbf{s},\mathbf{y})\}$ for direct speech translation and speech transcription translation triplets $\{(\mathbf{s},\mathbf{x},\mathbf{y})\}$ . For different combinations of input, FAT-MLM encodes the full concatenation of their embeddings and recover the masked portion. The loss function is:

3 Attention Visualization

To demonstrate FAT-MLM’s ability to unify the representation of different modality and language, we show the self-attention layers of a translation FAT-MLM in Fig. 4 and 5. The clear monotonic attention in Fig. 4 shows that our proposed method can learn good representation for speech (Chen et al., 2020). Fig. 5(a) shows that FAT-MLM can learn a good crosslingual alignment between two languages, such as and to Und and you to Sie. Fig. 5(b) shows that FAT-MLM is able to learn a clear monotonic speech-to-text crossmodal attention like many speech recognition models.

Fused Acoustic and Text Speech Translation (FAT-ST)

In this section, we present how to adapt FAT-MLM to speech translation and enable speech translation models to learn from speech recognition and text machine translation.

At training time, we maximize the conditional probability of each ground-truth target sentence $\mathbf{y}^{\star}$ given input $\mathbf{x}$ over the whole training data $D_{\mathbf{x},\mathbf{y}}$ , or equivalently minimizing the following loss:

Different from text machine translation, speech translation takes speech features $\mathbf{s}=(s_{1},...,s_{|\mathbf{s}|})$ as input. Same as the speech input portion of FAT-MLM, these speech features are converted from the speech signals (e.g. spectrogram). Formally, the decoding and training of speech translation models can be defined as follows:

2 FAT-ST

To boost the performance of end-to-end speech translation, we propose to enable speech translation to encode both acoustic and text features as input by simply adapting the architecture of monolingual FAT-MLM to a Fused Acoustic and Text Speech Translation model (FAT-ST).

Please note that the speech recognition and machine translation data can either be included in speech translation data or additional datasets. Meanwhile, in practice, we find that CTC loss (Graves et al., 2006) is useful to improve the translation quality so that we include it in all the experiments.

3 Finetuning FAT-ST from Translation FAT-MLM

Similar to Lample & Conneau (2019) we can further improve FAT-ST by finetuning from FAT-MLM. Since the FAT-ST decoder predicts text only, we initialize it from the acoustic and text shared Transformer encoder. Although Transformer decoder is unidirectional which is different from bidirectional FAT-MLM, it can still benefit from FAT-MLM in our experiments, This is also observed by Lample & Conneau (2019) and Devlin et al. (2019).

Experiments

We conducted speech translation experiments in 3 directions: English to German (En $\to$ De), English to Spanish (En $\to$ Es), and English to Dutch (En $\to$ Nl) to show the translation quality of baselines and our proposed methods.

We use 5 corpora with different modalities and languages: speech translation data $D_{\mathbf{s},\mathbf{x},\mathbf{y}}$ Must-C (Di Gangi et al., 2019), speech recognition data $D_{\mathbf{s},\mathbf{x}}$ Librispeech (Panayotov et al., 2015), machine translation and monolingual text data $D_{\mathbf{x},\mathbf{y}},D_{\mathbf{x}},D_{\mathbf{y}}$ Europarl V7 (Koehn, 2005), speech only data $D_{\mathbf{s}}$ Libri-Light (medium version) (Kahn et al., 2020) and monolingual text data Wiki Text (only for Nl). The statistical results of the dataset are shown in Table. 1. We evaluate our models on Must-C dev and test set. Note that Must-C is collected based on spontaneous speeches (TED) which are very different from other audiobook speech dataset used in our experiments. Spontaneous speeches are much harder for speech translation than audiobook dataset such as Libri-trans (Kocabiyikoglu et al., 2018). That is one of the reasons why the translation accuracy of end-to-end speech translation is much worse than cascaded systems on Must-C than other speech translation corpus.

2 Training Detail

Raw audio files are processed by Kaldi (Povey et al., 2011) to extract 80-dimensional log-Mel filterbanks stacked with 3-dimensional pitch features using a window size of 25 ms and step size of 10 ms. We train sentencepiece (Kudo & Richardson, 2018) models with a joint vocabulary size of 8K for text in each dataset. Training samples that have more than 3000 frames have been ignored for GPU efficiency. Our basic Transformer-based E2E-ST framework has similar settings with ESPnet-ST(Inaguma et al., 2020). the speech input is first down-sampled the speech input with 2 layers of 2D convolution of size 3 with stride size of 2. Then there is a standard 12-layers Transformer with feed-forward layer of 2048 hidden size to bridge the source and target side. We only use 4 attention heads on each side of the transformer and each of them has a dimensionality of 256. We also show the results of FAT-ST big model with 4096 hidden size for feed-forward layers of all transformer layer. For speech reconstruction module, we simply linearly project the outputs of the Transformer encoder to another latent space, then upsample the latent representation with 2-layers deconvolution to match the size of the original input signal. We choose 30% for the random masking ratio $\lambda$ across all the experiments including pre-training. During inference, we do not perform any masking over the speech input. We average the last 5 checkpoints for testing. For decoding, we use a beam search with beam-size 5 and length penalty 0.6 for German, 0.0 for Spanish and 0.3 for Dutch.

3 Translation Quality Comparisons

We showcase the translation accuracy of FAT-ST comparing against to the baselines in Table 2 and Table 3:

ST: this is the vanilla speech translation system which does not use transcriptions.

ST + ASR MTL: ST model with an additional ASR decoder and is trained with ASR multi-task learning using the transcriptions.

ST + ASR & MT MTL: ST model with an additional ASR decoder and a MT encoder. It is trained with ASR and MT multi-task learning.

ST + MAM + ASR MTL: ST trained with MAM loss and ASR multi-task learning.

Liu et al. (2020b): An end-to-end ST system with a multimodal encoder.

Le et al. (2020): The state-of-the-art end-to-end ST model with an extra ASR decoder.

Cascade: cascaded model which first transcribes the speech into transcription then passes the results to a machines translation system.

ST + ASR & MT pretraining: the encoder of ST is initialized by a pretrained ASR encoder and decoder initialized by a pretrained MT decoder

Pino et al. (2020): They propose to leverage additional speech data by generating pseudo-translations using a cascaded or an end-to-end speech translation model.

Table 4 shows the number of parameters of different pretraining models. We can see that our FAT-MLM base model is a little bit larger than the MAM pretraining model, and the FAT-MLM big model is much larger than the base model.

In Table 2, with no pretraining, we can see that our proposed FAT-ST base model achieves the best results except Le et al. (2020) and the cascaded model. However, our base model has much less parameters than both of them. Models with ASR or MT MTL and Liu et al. (2020b) all use the transcription data in Must-C dataset but show worse performance, thus our model can use transcription data more efficiently. Similar to other open source ST implementation results on Must-C ESPnet: https://github.com/espnet/espnet, our implementation of ST + ASR & MT MTL is worse than ST + ASR.

We also compare the performance of models pretrained from different pretraining models. With pretrained on Must-C, FAT-ST (base) is improved by 0.85 BLEU by being finetuned from FAT-MLM, while it’s performance drops by finetuning from MAM. Meanwhile, our proposed methods achieve much better performance compared with ASR & MT pretraining baselines. We also note that our FAT-ST base model for the first time achieves similar performances compared with Cascade baselines in these three translation directions of Must-C, while comparing with the cascaded model, our our base model is much smaller in size and faster in inference (see Fig. 7).

3.3 Pretraining with Additional Data

Table 3 shows that FAT-MLM can further improve FAT-ST by simply adding speech recognition data $D_{\mathbf{s},\mathbf{x}}$ (Librispeech) text machine translation data $D_{\mathbf{x},\mathbf{y}}$ (Europarl) and even speech only data $D_{\mathbf{s}}$ (Libri-light) and monolingual text data $D_{\mathbf{x}}\cup D_{\mathbf{y}}$ . This shows good representation learning ability of our proposed FAT-MLM models. We can see that using larger data, the performance of our big model is increased much faster than the base model. That’s because the number of parameters of the base model is too limited to learn from such big data.

3.4 Finetuning with Additional Data

The last part of Table 2 show that FAT-ST can be improved by learning from extra speech recognition and machine translation data. This is promising because speech translation data is very limited compared with much more abundant speech recognition and machine translation data. Different from Pino et al. (2020) who propose to leverage additional speech data by generating pseudo-translations, our method doesn’t use any pseudo-labels. Our best model outperforms their result on En $\to$ De by using much $7\times$ smaller model size and almost $10\times$ smaller speech data.

3.5 Performance of Auxiliary MT Task

Table 5 shows the translation quality of auxiliary MT task of FAT-ST. Although our models trained with Must-C are worse than the MT baseline, by using FAT-MLM trained with more data, our proposed methods can easily outperform the MT baseline. Note that these models’ parameters are tuned to optimize speech translation task and MT is just an auxiliary task.

3.6 Ablation Study

Table 6 shows an ablation study of our proposed method. we can see that all the components contribute to the final performance.

3.7 English→→\toChinese Speech Translation

We also compare several models in TED English $\to$ Chinese speech translation task (Liu et al., 2019) with 524 hours speech in training set, 1.5 hours validation set (dev2010) and 2.5 hours test set (tst2015). We follow our previous experiments to preprocess the data. Same with previous work, we evaluate the performance with character-level BLEU. Table 8 shows that our proposed model can largely outperform other baselines. Table 7 shows one example in this dataset. The translation of the cascaded model is wrong because of the errors in the its ASR (their $\to$ their, of who $\to$ to do), while our FAT-ST produces the right translation.

3.8 Decoding Speed

Fig. 7 shows the decoding speed comparison between the Cascade model and our proposed FAT-ST. Our proposed FAT-ST model is almost $2\times$ faster than the Cascade system which needs to wait for the speech recognition module to finish before starting to translate. The decoding time of FAT-ST (big) is almost the same as FAT-ST (base) because we only increase the feedforward network in Transformers.

Conclusion

In this paper, we propose Fused Acoustic and Text Masked Language Model (FAT-MLM) which learns a unified representation for text and speech from any data that combines speech and text. We further extend this framework to a sequence-to-sequence speech translation model which enables learning from speech recognition and text-based machine translation data at the first time. Our results show significant improvement on three translation directions of the Must-C dataset and outperform the cascaded baseline.

Acknowledgements

We thank Kenneth Church and Jiahong Yuan for discussions, and Juneki Hong for proofreading, and the anonymous reviewers for suggestions.