Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent Besacier
Introduction
While cascade speech-to-text translation (ST) systems operate in two steps: source language automatic speech recognition (ASR) and source-to-target text machine translation (MT), recent works have attempted to build end-to-end ST without using source language transcription during decoding [Bérard et al., 2016, Weiss et al., 2017, Bérard et al., 2018]. After two years of extensions to these pioneering works, the last results of the IWSLT 2020 shared task on offline speech translation [Ansari et al., 2020] demonstrate that end-to-end models are now on par (if not better) than their cascade counterparts. Such a finding motivates even more strongly the works on multilingual (one-to-many, many-to-one, many-to-many) ST [Gangi et al., 2019, Inaguma et al., 2019, Wang et al., 2020a] for which end-to-end models are well adapted by design. Moreover, of these two approaches: cascade proposes a very loose integration of ASR and MT (even if lattices or word confusion networks were used between ASR and MT before end-to-end models appeared) while most end-to-end approaches simply ignore ASR subtask, trying to directly translate from source speech to target text. We believe that these are two edge design choices and that a tighter coupling of ASR and MT is desirable for future end-to-end ST applications, in which the display of transcripts alongside translations can be beneficial to the users [Sperber et al., 2020].
This paper addresses multilingual ST and investigates more closely the interactions between speech transcription (ASR) and speech translation (ST) in a multilingual end-to-end architecture based on Transformer. While those interactions were previously investigated as a simple multi-task framework for a bilingual case [Anastasopoulos and Chiang, 2018], we propose a dual-decoder with an ASR decoder tightly coupled with an ST decoder and evaluate its effectiveness on one-to-many ST. Our model is inspired by ?), but the interaction between ASR and ST decoders is much tighter.The model of ?) does not have interaction between internal hidden states of the decoders (c.f. Section 3.4). Finally, experiments show that our model outperforms theirs on the MuST-C benchmark [Di Gangi et al., 2019].
Our contributions are summarized as follows: (1) a new model architecture for joint ASR and multilingual ST; (2) an integrated beam search decoding strategy which jointly transcribes and translates, and that is extended to a wait- strategy where the ASR hypothesis is ahead of the ST hypothesis by tokens and vice-versa; and (3) competitive performance on the MuST-C dataset in both bilingual and multilingual settings and improvements on previous joint ASR/ST work.
Related Work
Multilingual translation [Johnson et al., 2016] consists in translating between different language pairs with a single model, thereby improving maintainability and the quality of low resource languages. ?) adapt this method to one-to-many multilingual speech translation by adding a language embedding to each source feature vector. They also observe that using the source language (English) as one of the target languages improves performance. ?) simplify the previous approach by prepending a target language token to the decoder and apply it to one-to-many and many-to-many speech translation. They do not investigate many-to-one due to the lack of a large corpus for this. To fill this void, ?) release the CoVoST dataset for ST from 11 languages into English and demonstrate the effectiveness of many-to-one ST.
Joint ASR and ST
Joint ASR and ST decoding was first proposed by ?) through a multi-task learning framework. ?) improve multitask ST by using word embedding as an intermediate level instead of text. A two-stage model that performs first ASR and then passes the decoder states as input to a second ST model was also studied previously [Anastasopoulos and Chiang, 2018, Sperber et al., 2019]. This architecture is closer to cascaded translation while maintaining end-to-end trainability. ?) introduce the notion of consistency between transcripts and translations and propose metrics to gauge it. They evaluate different model types for the joint ASR and ST task and conclude that end-to-end models with coupled inference procedure are able to achieve strong consistency. In addition to existing models having coupled architectures, they also investigate a model where the transcripts are concatenated to the translations, and the shared encoder-decoder network learns to predict this concatenated outputs. It should be noted that our models have lower latency compared to this approach since the concatenation of outputs makes the two tasks sequential in nature. Our work is closely related to that of ?) who propose an interactive attention mechanism which enables ASR and ST to be performed synchronously. Both ASR and ST decoders do not only rely on their previous outputs but also on the outputs predicted in the other task. We highlight three differences between their work and ours: (a) we propose a more general framework in which [Liu et al., 2020] is a special case; (b) tighter integration of ASR and ST is proposed in our work; and (c) we experiment in a multilingual ST setting while previous works on joint ASR and ST only investigated bilingual ST.
Dual-decoder Transformer for Joint ASR and Multilingual ST
We now present the proposed dual-decoder Transformer for jointly performing ASR and multilingual ST. Our models are based on the Transformer architecture [Vaswani et al., 2017] but consist of two decoders. Each decoder is responsible for one task (ASR or ST). The intuition is that the problem at hand consists in solving two different tasks with different characteristics and different levels of difficulty (multilingual ST is considered more difficult than ASR). Having different decoders specialized in different tasks may thus produce better results. In addition, since these two tasks can be complementary, it is natural to allow the decoders to help each other. Therefore, in our models, we introduce a dual-attention mechanism: in addition to attending to the encoder, the decoders also attend to each other.
The model takes as input a sequence of speech features in a specific source language (e.g. English) and outputs a transcription in the same language as well as translations in different target languages (e.g. French, Spanish, etc.). When , this corresponds to joint ASR and bilingual ST [Liu et al., 2020]. For simplicity, our presentation considers only a single target language with output . All results, however, apply to the general multilingual case. In the sequel, denote and ( is included if “” and “” are replaced by “” and “” respectively). In addition, assume that is ignored if is outside of the interval . Notations apply to as well.
The dual-decoder model jointly predicts the transcript and translation in an autoregressive fashion:
A natural model would consist of a single decoder followed by a softmax layer. However, even if the capacity of the decoder were large enough for handling both ASR and ST generation, a single softmax would require a very large joint vocabulary (with size where and are respectively the vocabulary sizes for and ). Instead, our dual-decoder consists of two sub-decoders that are specialized in producing outputs tailored to the ASR and ST tasks separately. Formally, our model predicts the next output tokens (where ) given a pair of previous outputs as:
We also assumed so far that the sub-decoders start at the same time, which is the most basic configuration. In practice, however, one may allow one sequence to advance steps compared to the other, known as the wait- policy [Ma et al., 2019]. For example, if ST waits for ASR to produce its first tokens, then the joint distribution becomes
In the next section, we propose two concrete architectures for the dual-decoder, corresponding to different levels of dependencies between the two sub-decoders (ASR and ST). Then, we show that several known models in the literature are special cases of these architectures.
2 Parallel and cross dual-decoder Transformers
The first architecture is called parallel dual-decoder Transformer, which has the highest level of dependencies: one decoder uses the hidden states of the other to compute its outputs, as illustrated in Figure 1(a). The encoder consists of an input embedding layer followed by a positional embedding and a number of self-attention and feed-forward network (FFN) layers whose inputs are normalized [Ba et al., 2016].All the illustrations in this paper are for the so-called pre-LayerNorm configuration, in which the input of the layer is normalized. Likewise, if the output is normalized instead, the configuration is called post-LayerNorm. Since pre-LayerNorm is known to perform better than post-LayerNorm [Wang et al., 2019, Nguyen and Salazar, 2019], we only conducted experiments for the former, although our implementation supports both. This is almost the same as the encoder of the original Transformer [Vaswani et al., 2017] (we refer to the corresponding paper for further details), except that the embedding layer in our encoder is a small convolutional neural network (CNN) [Fukushima and Miyake, 1982, LeCun et al., 1989] of two layers with ReLU activations and a stride of , therefore reducing the input length by .
Our second proposed architecture is called cross dual-decoder Transformer, which is similar to the previous one, except that now the dual-attention layers receive from the previous decoding step outputs of the other decoder, as illustrated in Figure 1(c). Thanks to this design, each prediction step can be performed separately on the two decoders. The hidden representations in (2) produced by the decoders can thus be decomposed into:This decomposition is clearly not possible for the parallel dual-decoder Transformer.
3 Special cases
In this section, we present some special cases of our dual-decoder architecture and discuss their links to existing models in the literature.
When there is no dual-attention, the two decoders become independent. In this case, the prediction joint probability can be factorized simply as Therefore, all prediction steps are separable and thus this model is the most computationally efficient. In the literature, this model is often referred to as multi-task [Anastasopoulos and Chiang, 2018, Sperber et al., 2020].
Chained decoders
Another special case corresponds to the extreme wait- policy, in which one decoder waits for the other to completely finish before starting its own decoding. For example, if ST waits for ASR, then the prediction joint probability reads This model is called triangle in previous work [Anastasopoulos and Chiang, 2018, Sperber et al., 2020]. A special case of this model is when the second decoder in the chain is not directly connected to the encoder, also referred to as two-stage Also called cascade by ?). We omit this term to avoid confusion with the common cascade models that are typically not trained end-to-end. Note that our chained-decoder (both triangle and two-stage) are end-to-end. [Sperber et al., 2019, Sperber et al., 2020].
To summarize the different cases, we show below the joint probability distributions encoded by the presented models, in decreasing level of dependencies:
where . Similar formalization for the wait- policy (7) can be obtained in a straightforward manner. Note that for independent decoders, the distribution is the same as in non-wait-.
4 Variants
The previous section presents special cases of our formulation at a high level. In this section, we introduce different fine-grained variants of the dual-decoder Transformers used in the experiments (Section 5).
Instead of using all the dual-attention layers, one may want to allow a one-way attention: either ASR attends ST or the inverse, but not both.
At-self or at-source dual-attention
In each decoder block, there are two different attention layers, which we respectively call self-attention (bottom) and source-attentionAlso referred to as cross-attention in the literature. We use a different name to avoid confusion with the cross dual-decoder. (top). For each, there is an associated dual-attention, named respectively dual-attention at self and dual-attention at source. In the experiments, we study the case where either only the at-self or at-source attention layers are retained.
Merging operators
For the sum operator, in particular, we perform experiments for learnable or fixed .
The model proposed by ?) is a special case of our cross dual-decoder Transformer with no dual-attention at source, no layer normalization for the input embeddings (Figure 1(c)), and sum merging with fixed .
Training and Decoding
2 Decoding
We present the beam search strategy used by our model. Since there are two different outputs (ASR and ST), one may naturally think about two different beams (with possibly some interactions). However, we found that a single joint beam works best for our model. In this beam search strategy, each hypothesis includes a tuple of ASR and ST sub-hypotheses. The two sub-hypotheses are expanded together and the score is computed based on the sum of log probabilities of the output token pairs. For a beam size , the best hypotheses are retained based on this score. In this setup, both sub-hypotheses evolve jointly, which resembles the training process more than in the case of two different beams. A limitation of this joint-beam strategy is that, in extreme cases, one of the task (ASR or ST) may only have a single hypothesis. Indeed, at a decoding step , we take the best predictions in terms of their sum of scores ; it can happen that, e.g., some has a so dominant score that it is selected for all the hypotheses, i.e. the (different) hypotheses have a single and different . We leave the design of a joint-beam strategy with enforced diversity to future work. Finally, to produce translations for multiple target languages in our system, it suffices to feed different language-specific tokens to the dual-decoder at decoding time.
Experiments
To build a one-to-many model that can jointly transcribe and translate, we use MuST-C [Di Gangi et al., 2019], which is currently the largest publicly available one-to-many speech translation dataset.Smaller datasets include Europarl-ST [Iranzo-Sánchez et al., 2020] and MaSS [Boito et al., 2020]. Recently, a very large many-to-many dataset called CoVoST-2 [Wang et al., 2020b] has been released, while its predecessor CoVoST [Wang et al., 2020a] only covers the many-to-one scenario. MuST-C covers language pairs from English to eight different target languages including Dutch, French, German, Italian, Portuguese, Romanian, Russian, and Spanish. Each language direction includes a triplet of source input speech, source transcription, and target translation, with size ranging from 385 hours (Portuguese) to 504 hours (Spanish). We refer to the original paper for more details.
2 Training and decoding details
Our implementation is based on the ESPnet-ST toolkit [Inaguma et al., 2020].https://github.com/espnet/espnet In the following, we provide details for reproducing the results. The pipeline is identical for all experiments.
All experiments use the same encoder architecture with 12 layers. The decoder has 6 layers, except for the independent-decoder model where we also include a 8-layer version (independent) to compare the effects of dual-attention against simply increasing the number of model parameters.
Text pre-processing
Transcriptions and translations were normalized and tokenized using the Moses tokenizer [Koehn et al., 2007]. The transcription was lower-cased and the punctuation was stripped. A joint BPE [Sennrich et al., 2016] with 8000 merge operations was learned on the concatenation of the English transcription and all target languages. We also experimented with two separate dictionaries (one for English and another for all target languages), but found that the results are worse.
Speech features
We used Kaldi [Povey et al., 2011] to extract 83-dimensional features (80-channel log Mel filter-bank coefficients and 3-dimensional pitch features) that were normalized by the mean and standard deviation computed on the training set. Following common practice [Inaguma et al., 2019, Wang et al., 2020c], utterances having more than 3000 frames or more than 400 characters were removed. For data augmentation, we used speed pertubation [Ko et al., 2015] with three factors of , , and and SpecAugment [Park et al., 2019] with three types of deterioration including time warping (), time masking () and frequency masking (), where , and .
Optimization
Following standard practice for training Transformer, we used the Adam optimizer [Kingma and Ba, 2015] with Noam learning rate schedule [Vaswani et al., 2017], in which the learning rate is linearly increased for the first 25K warm-up steps then decreased proportionally to the inverse square root of the step counter. We set the initial learning rate to and the Adam parameters to 110-9$100.5LppL$ will be added to the (log probability) score of that hypothesis (Section 4.2). Therefore, longer hypotheses are favored, or equivalently, shorter hypotheses are “penalized”.
3 Results and analysis
In this section, we report detokenized case-sensitive BLEUWe also tried sacreBLEU [Post, 2018] and found that the results are identical. [Papineni et al., 2002] on the MuST-C dev sets (Table 1). Results on the test sets are discussed in Section 5.4. Following previous work [Inaguma et al., 2020], we remove non-verbal tokens in evaluation.This is for a fair comparison with the results of ?), presented in Section 5.4. In Table 1, there are 3 main groups of models, corresponding to independent-decoder, cross dual-decoder (crx), and parallel dual-decoder (par), respectively. In particular, independent++ corresponds to a 8-decoder-layer model and will serve as our strongest baseline for comparison. Figure 2 shows the relative performance of some representative models with respect to this baseline, together with their validation accuracies. In the following, when comparing models, we implicitly mean “on average” (over the 8 languages), except otherwise specified.
Under the same configurations, parallel models outperform their cross counterparts in terms of translation (line 5 vs. line 13, line 6 vs. line 14, and line 7 vs. line 16), showing an improvement of BLEU on average. In terms of recognition, however, the parallel architecture has on average a 0.33% higher (worse) WER compared to the cross models. On the other hand, parallel dual-decoders perform better than independent decoders in both translation and recognition tasks, except for the asymmetric case (line 12), the at-self and at-source with sum merging configuration (line 17), and the wait- model where ST is ahead of ASR (line 19). This shows that both the translation and recognition tasks can benefit from the tight interaction between the two decoders, i.e. it is possible to achieve no trade-off between BLEUs and WERs for the parallel models compared to the independent architecture. This is not the case, however, for the cross dual-decoders that feature weaker interaction than the parallel ones. Interestingly, there is a slight trade-off between the parallel and cross designs: the parallel models are better in terms of BLEUs but worse in terms of WERs. This is to some extent similar to previous work where models having different types of trade-offs between BLEUs and WERs [He et al., 2011, Sperber et al., 2020, Chuang et al., 2020]. It should be emphasized that most of the dual-decoder models have fewer or same numbers of parameters compared to independent++. This confirms our intuition that the tight connection between the two decoders in the parallel architecture improves performance. The cross dual-decoders perform relatively well compared to the baseline of two independent decoders with the same number of layers (6), but not so well compared to the stronger baseline with 8 layers.
Symmetric vs. asymmetric
In some experiments, we only allow the ST decoder to attend to the ASR decoder. For the cross dual-decoder, this did not yield noticeable improvements in terms of BLEU ( at line 4 vs. at line 6), while for the parallel architecture, the results are worse ( at line 12 vs. at line 16). The symmetric models also outperform the asymmetric counterparts in terms of WER ( at line 4 vs. at line 6, at line 12 vs. at line 16). It is confirmed again that the two tasks are complementary and can help each other: removing the ASR-to-ST attention hurts performance. In fact, examining the learnable in the sum merging operator shows that the decoders learn to attend to each other, though at different rates. We observed that for the same layer depth, the ST decoder always attends more to the ASR one, and for both of them increases with the depth of the layer.
At-self dual-attention vs. at-source dual-attention
For the parallel dual-decoder, the at-source dual-attention produces better results than the at-self counterpart (BLEU: vs. , WER: vs. at line 14 vs. line 15), while the combination of both does not improve the results (BLEU , WER at line 17). For the concat merging, using both yields better results in terms of translation but slightly hurts the recognition task (BLEU: vs. , WER: vs. at line 16 vs. line 13).
Sum vs. concat merging
The impact of merging operators is not consistent across different models. If we focus on the parallel dual-decoder, sum is better for models with only at-source attention (line 13 vs. line 14) and concat is better for models using both at-self and at-source attention (line 16 vs. line 17).
Input normalization and learnable sum
Some experiments confirm the importance of normalizing the input fed to the dual-attention layers (i.e. the LayerNorm layers shown in Figure 1(c)). The results show that normalization substantially improves the performance (BLEU: vs. , WER: vs. at line 8 vs. at line 11). It is also beneficial to use learnable weights compared to a fixed value for the sum merging operator (Equation (14)) (BLEU: vs. at line 9 vs. line 10). Note that the fixed weight and non-normalization configuration corresponds to the model of ?) (line 10).
Wait-k𝑘k policy
We compare a non-wait- parallel dual-decoder (line 14) with its wait- () counterparts. From the results, one can observe that letting ASR be ahead of ST (line 18) improves the performance (BLEU: vs. , WER: vs. ), while letting ST be ahead of ASR (line 19) considerably worsen the results (BLEU: , WER: ). This confirms our intuition that the ST task is more difficult and should not take the lead in the dual-decoder models.
ASR results
From the results (last column of Table 1), one can observe that the dual-decoder models outperform the baseline indepedent++, except for the asymmetric case and the wait- model where ST is 3 steps ahead of ASR. While using a single decoder leads to an average of 14.2% WER, all other symmetric architectures with two decoders (except the ASR-waits-for-ST) have better and rather stable WERs (from 12.1% to 13.0%). Detailed results for each data subset are provided in the Appendix.
4 Comparison to state-of-the-art
To avoid a hyper-parameter search over the test set, we only select three of our best models together with the baseline independent++ for evaluation. All of the three models are symmetric parallel dual-decoders, the first one has at-source dual-attention with sum merging, the second one has both at-self and at-source dual-attentions with concat merging, and the last one is a wait- model in which ASR is 3 steps ahead of ST. These models correspond to lines 5, 6, and 7 of Table 12, and will be referred to respectively as par++, par, and par in the sequel. For par++ we increase the number of decoder layers from 6 to 8, thus increasing the number of parameters from 48M to 51.2M, matching that of the baseline. We do not do this for par (48M) as this model already has a higher latency due to the wait-. All models are trained for 550K steps, corresponding to 25 epochs. Following ?), we use the average of five checkpoints with the best validation accuracies on the dev sets for evaluation.
We compare the results with the previous work [Gangi et al., 2019] in the multilingual setting. In addition, to demonstrate the competitive performance of our models, we also include the best existing translation performance on MuST-C [Inaguma et al., 2020], although these results were obtained with bilingual systems and from a sophisticated training recipe. Indeed, to obtain the results for each language pair (e.g. en-de), ?) pre-trained an ASR model and an MT model to initialize the weights of (respectively) the encoder and decoder for ST training. This means that to obtain the results for the 8 language pairs, 24 independent trainings had to be performed in total (3 for each language pair).
The results in Table 12 show that our models achieved very competitive performance compared to bilingual one-to-one models [Inaguma et al., 2020], despite being trained for only half the number of epochs. In particular, the par++ model achieved the best results, consistently surpassing the others on all languages (except on Russian where it is outperformed by the bilingual model). Our results also surpassed those of ?) by a large margin. We observe the largest improvements on Portuguese ( at line 5, at line 6, and at line 7, compared to the bilingual result at line 1), which has the least data among the 8 language pairs in MuST-C. This phenomenon is also common in multilingual neural machine translation where multilingual joint training has been shown to improve performance on low-resource languages [Johnson et al., 2017].
Conclusion
We introduced a novel dual-decoder Transformer architecture for synchronous speech recognition and multilingual speech translation. Through a dual-attention mechanism, the decoders in this model are at the same time able to specialize in their tasks while being helpful to each other. The proposed model also generalizes previously proposed approaches using two independent (or weakly tied) decoders or chaining ASR and ST. It is also flexible enough to experiment with settings where ASR is ahead of ST which makes it promising for (one-to-many) simultaneous speech translation. Experiments on the MuST-C dataset showed that our model achieved very competitive performance compared to state-of-the-art.
Acknowledgements
This work was supported by a Facebook AI SRA grant, and was granted access to the HPC resources of IDRIS under the allocations 2020-AD011011695 and 2020-AP011011765 made by GENCI. It was also done as part of the Multidisciplinary Institute in Artificial Intelligence MIAI@Grenoble-Alpes (ANR-19-P3IA-0003). We thank the anonymous reviewers for their insightful feedback.
References
Appendix
We present the detailed model architecture for the cross dual-decoder Transformer in Figure 3. Detailed WER results on MuST-C dev set are presented in Table 3.