Direct speech-to-speech translation with a sequence-to-sequence model
Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu
Introduction
We address the task of speech-to-speech translation (S2ST): translating speech in one language into speech in another. This application is highly beneficial for breaking down communication barriers between people who do not share a common language. Specifically, we investigate whether it is possible to train model to accomplish this task directly, without relying on an intermediate text representation. This is in contrast to conventional S2ST systems which are often broken down into three components: automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis .
Cascaded systems have the potential problem of errors compounding between components, e.g. recognition errors leading to larger translation errors. Direct S2ST models avoid this issue by training to solve the task end-to-end. They also have advantages over cascaded systems in terms of reduced computational requirements and lower inference latency since only one decoding step is necessary, instead of three. In addition, direct models are naturally capable of retaining paralinguistic and non-linguistic information during translation, e.g. maintaining the source speaker’s voice, emotion, and prosody, in the synthesized translated speech. Finally, directly conditioning on the input speech makes it easy to learn to generate fluent pronunciations of words which do not need to be translated, such as names.
However, solving the direct S2ST task is especially challenging for several reasons. Fully-supervised end-to-end training requires collecting a large set of input/output speech pairs. Such data are more difficult to collect compared to parallel text pairs for MT, or speech-text pairs for ASR or TTS. Decomposing into smaller tasks can take advantage of the lower training data requirements compared to a monolithic speech-to-speech model, and can result in a more robust system for a given training budget. Uncertain alignment between two spectrograms whose underlying spoken content differs also poses a major training challenge.
In this paper we demonstrate TranslatotronAudio samples are available at https://google-research.github.io/lingvo-lab/translatotron . , a direct speech-to-speech translation model which is trained end-to-end. To facilitate training without predefined alignments, we leverage high level representations of the source or target content in the form of transcriptions, essentially multitask training with speech-to-text tasks. However no intermediate text representation is used during inference. The model does not perform as well as a baseline cascaded system. Nevertheless, it demonstrates a proof of concept and serves as a starting point for future research.
Extensive research has studied methods for combining different sub-systems within cascaded speech translation systems. gave MT access to the lattice of the ASR. integrated acoustic and translation models using a stochastic finite-state transducer which can decode the translated text directly using Viterbi search. For synthesis, used unsupervised clustering to find F0-based prosody features and transfer intonation from source speech and target. augmented MT to jointly predict translated words and emphasis, in order to improve expressiveness of the synthesized speech. used a neural network to transfer duration and power from the source speech to the target. transfered source speaker’s voice to the synthesized translated speech by mapping hidden Markov model states from ASR to TTS. Similarly, recent work on neural TTS has focused on adapting to new voices with limited reference data .
Initial approaches to end-to-end speech-to-text translation (ST) performed worse than a cascade of an ASR model and an MT model. achieved better end-to-end performance by leveraging weakly supervised data with multitask learning. further showed that use of synthetic training data can work better than multitask training. In this work we take advantage of both synthetic training targets and multitask training.
The proposed model resembles recent sequence-to-sequence models for voice conversion, the task of recreating an utterance in another person’s voice . For example, proposes an attention-based model to generate spectrograms in the target voice based on input features (spectrogram concatenated with ASR bottleneck features) from the source voice. In contrast to S2ST, the input-output alignment for voice conversion is simpler and approximately monotonic. also trains models that are specific to each input-output speaker pair (i.e. one-to-one conversion), whereas we explore many-to-one and many-to-many speaker configurations. Finally, demonstrated an attention-based direct S2ST model on a toy dataset with a 100 word vocabulary. In this work we train on real speech, including spontaneous telephone conversations, at a much larger scale.
Speech-to-speech translation model
An overview of the proposed Translatotron model architecture is shown in Figure 1. Following , it is composed of several separately trained components: 1) an attention-based sequence-to-sequence network (blue) which generates target spectrograms, 2) a vocoder (red) which converts target spectrograms to time-domain waveforms, and, 3) optionally, a pretrained speaker encoder (green) which can be used to condition the decoder on the identity of the source speaker, enabling cross-language voice conversion simultaneously with translation.
The sequence-to-sequence encoder stack maps 80-channel log-mel spectrogram input features into hidden states which are passed through an attention-based alignment mechanism to condition an autoregressive decoder, which predicts 1025-dim log spectrogram frames corresponding to the translated speech. Two optional auxiliary decoders, each with their own attention components, predict source and target phoneme sequences.
Following recent speech translation and recognition models, the encoder is composed of a stack of 8 bidirectional LSTM layers. As shown in Fig. 1, the final layer output is passed to the primary decoder, whereas intermediate activations are passed to auxiliary decoders predicting phoneme sequences.We hypothesize that early layers of the encoder are more likely to represent the source content well, while deeper layers might learn to encode more information about the target content.
The spectrogram decoder uses an architecture similar to Tacotron 2 TTS model , including pre-net, autoregressive LSTM stack, and post-net components. We make several changes to it in order to adapt to the more challenging S2ST task. We use multi-head additive attention with 4 heads instead of location-sensitive attention, which shows better performance in our experiments. We also use a significantly narrower 32 dimensional pre-net bottleneck compared to 256-dim in , which we find to be critical in picking up attention during training. We also use reduction factor of 2, i.e. predicting two spectrogram frames for each decoding step. Finally, consistent with results on translation tasks , we find that using a deeper decoder containing 4 or 6 LSTM layers leads to good performance.
We find that multitask training is critical in solving the task, which we accomplish by integrating auxiliary decoder networks to predict phoneme sequences corresponding to the source and/or target speech. Losses computed using these auxiliary recognition networks are used during training, which help the primary spectrogram decoder to learn attention. They are not used during inference. In contrast to the primary decoder, the auxiliary decoders use 2-layer LSTMs with single-head additive attention . All three decoders use attention dropout and LSTM zoneout regularization , all with probability 0.1. Training uses the Adafactor optimizer with a batch size of 1024.
Since we are only demonstrating a proof of concept, we primarily rely on the low-complexity Griffin-Lim vocoder in our experiments. However, we use a WaveRNN neural vocoder when evaluating speech naturalness in listening tests.
Finally, in order to control the output speaker identity we incorporate an optional speaker encoder network as in . This network is discriminatively pretrained on a speaker verification task and is not updated during the training of Translatotron. We use the dvector V3 model from , trained on a larger set of 851K speakers across 8 languages including English and Spanish. The model computes a 256-dim speaker embedding from the speaker reference utterance, which is passed into a linear projection layer (trained with the sequence-to-sequence model) to reduce the dimensionality to 16. This is critical to generalizing to source language speakers which are unseen during training.
Experiments
We study two Spanish-to-English translation datasets: the large scale “conversational” corpus of parallel text and read speech pairs from , and the Spanish Fisher corpus of telephone conversations and corresponding English translations , which is smaller and more challenging due to the spontaneous and informal speaking style. In Sections 3.1 and 3.2, we synthesize target speech from the target transcript using a single (female) speaker English TTS system; In Section 3.4, we use real human target speech for voice transfer experiments on the conversational dataset. Models were implemented using the Lingvo framework . See Table 1 for dataset-specific hyperparameters.
To evaluate speech-to-speech translation performance we compute BLEU scores as an objective measure of speech intelligibility and translation quality, by using a pretrained ASR system to recognize the generated speech, and comparing the resulting transcripts to ground truth reference translations. Due to potential recognition errors (see Figure 2), this can be thought of as a lower bound on the underlying translation quality. We use the 16k Word-Piece attention-based ASR model from trained on the 960 hour LibriSpeech corpus , which obtained word error rates of 4.7% and 13.4% on the test-clean and test-other sets, respectively. In addition, we conduct listening tests to measure subjective speech naturalness mean opinion score (MOS), as well as speaker similarity MOS for voice transfer.
This proprietary dataset described in was obtained by crowdsourcing humans to read the both sides of a conversational Spanish-English MT dataset. In this section, instead of using the human target speech, we use a TTS model to synthesize target speech in a single female English speaker’s voice in order to simplify the learning objective. We use an English Tacotron 2 TTS model but use a Griffin-Lim vocoder for expediency. In addition, we augment the input source speech by adding background noise and reverberation in the same manner as .
The resulting dataset contains 979k parallel utterance pairs, containing 1.4k hours of source speech and 619 hours of synthesized target speech. The total target speech duration is much smaller because the TTS output is better endpointed, and contains fewer pauses. 9.6k pairs are held out for testing.
Input feature frames are created by stacking 3 adjacent frames of an 80-channel log-mel spectrogram as in . The speaker encoder was not used in these experiments since the target speech always came from the same speaker.
Table 2 shows performance of the model trained using different combinations of auxiliary losses, compared to a baseline STTTS cascade model using a speech-to-text translation model trained on the same data, and the same Tacotron 2 TTS model used to synthesize training targets. Note that the ground truth BLEU score is below 100 due to ASR errors during evaluation, or TTS failure when synthesizing the ground truth.
Training without auxiliary losses leads to extremely poor performance. The model correctly synthesizes common words and simple phrases, e.g. translating “hola” to “hello”. However, it does not consistently translate full utterances. While it always generates plausible speech sounds in the target voice, the output can be independent of the input, composed of a string of nonsense syllables. This is consistent with failure to learn to attend to the input, and reflects the difficulty of the direct S2ST task.
Integrating auxiliary phoneme recognition tasks helped regularize the encoder and enabled the model to learn attention, dramatically improving performance. The target phoneme PER is much higher than on source phonemes, reflecting the difficulty of the corresponding translation task. Training using both auxiliary tasks achieved the best quality, but the performance difference between different combinations is small. Overall, there remains a gap of 6 BLEU points to the baseline, indicating room for improvement. Nevertheless, the relatively narrow gap demonstrates the potential of the end-to-end approach.
2 Fisher Spanish-to-English
This dataset contains about 120k parallel utterance pairsThis is a subset of the Fisher data due to TTS errors on target text., spanning 127 hours of source speech. Target speech is synthesized using Parallel WaveNet using the same voice as the previous section. The result contains 96 hours of synthetic target speech.
Following , input features were constructed by stacking 80-channel log-mel spectrograms, with deltas and accelerations. Given the small size of the dataset compared to that in Sec. 3.1, we found that obtaining good performance required significantly more careful regularization and tuning. As shown in Table 1, we used narrower encoder dimension of 256, a shallower 4-layer decoder, and added Gaussian weight noise to all LSTM weights as regularization, as in . The model was especially sensitive to the auxiliary decoder hyperparameters, with the best performance coming when passing activations from intermediate layers of the encoder stack as inputs to the auxiliary decoders, using slightly more aggressive dropout of 0.3, and decaying the auxiliary loss weight over the course of training in order to encourage the model training to fit the primary S2ST task.
Experiment results are shown in Table 3. Once again using two auxiliary losses works best, but in contrast to Section 3.1, there is a large performance boost relative to using either one alone. Performance using only the source recognition loss is very poor, indicating that learning alignment on this task is especially difficult without strong supervision on the translation task.
We found that 4-head attention works better than one head, unlike the conversational task, where both attention mechanisms had similar performance. Finally, as in , we find that pre-training the bottom 6 encoder layers on an ST task improves BLEU scores by over 5 points. This is the best performing direct S2ST model, obtaining 76% of the baseline performance.
3 Subjective evaluation of speech naturalness
To evaluate synthesis quality of the best performing models from Tables 2 and 3 we use the framework from to crowdsource 5-point MOS evaluations based on subjective listening tests. 1k examples were rated for each dataset, each one by a single rater. Although this evaluation is expected to be independent of the correctness of the translation, translation errors can result in low scores for examples raters describe as “not understandable”.
Results are shown in Table 4, comparing different vocoders where results with Griffin-Lim correspond to identical model configurations as Sections 3.1 and 3.2. As expected, using WaveRNN vocoders dramatically improves ratings over Griffin-Lim into the “Very Good” range (above 4.0). Note that it is most fair to compare the Griffin-Lim results to the ground truth training targets since they were generated using corresponding lower quality vocoders. In such a comparison it is clear that the S2ST models do not score as highly as the ground truth.
Finally, we note the similar performance gap between Translatotron and the baseline under this evaluation. In part, this is a consequence of the different types of errors made by the two models. For example, Translatotron sometimes mispronounces words, especially proper nouns, using pronunciations from the source language, e.g. mispronouncing the /ae/ vowel in “Dan” as /ah/, consistent with Spanish but sounding less natural to English listeners, whereas by construction, the baseline consistently projects results to English. Figure 2 demonstrates other differences in behavior, where Translatotron reproduces the input “eh” disfluency (transcribed as “a”, between sec in the bottom row of the figure), but the cascade does not. It is also interesting to note that the cascade translates “Guillermo” to its English form “William”, whereas Translatotron speaks the Spanish name (although the ASR model mistranscribes it as “of the ermo”), suggesting that the direct model might have a bias toward more directly reconstructing the input. Similarly, in example 7 on the companion page Translatotron reconstructs “pasejo” as “passages” instead of “tickets”, potentially reflecting a bias for cognates. We leave detailed analysis to future work.
4 Cross language voice transfer
In our final experiment, we synthesize translated speech using the voice of the source speaker by training the full model depicted in Figure 1. The speaker encoder is conditioned on the ground truth target speaker during training. We use a subset of the data from Sec. 3.1 for which we have paired source and target recordings. Note that the source and target speakers for each pair are always different – the data was not collected from bilingual speakers. This dataset contains 606k utterance pairs, resampled to 16 kHz, with 863 and 493 hours of source and target speech, respectively; 6.3k pairs, a subset of that from Sec. 3.1, are held out for testing. Since target recordings contained noise, we apply the denoising and volume normalization from to improve output quality.
Table 5 compares performance using different conditioning strategies. The top row transfers the source speaker’s voice to the translated speech, while row two is a “cheating” configuration since the speaker embedding can potentially leak information about the target content to the decoder. To verify that this does not negatively impact performance we also condition on random target utterances in row three. In all cases performance is worse than models trained on synthetic targets in Tables 2 and 4. This is because the task of synthesizing arbitrary speakers is more difficult; the training targets are much noisier and training set is much smaller; and the ASR model used for evaluation makes more errors on the noisy, multispeaker targets. In terms of BLEU score, the difference between conditioning on ground truth and random targets is very small, verifying that content leak is not a concern (in part due to the low speaker embedding dimension). However conditioning on the source trails by 1.8 BLEU points, reflecting the mismatch in conditioning language between the training and inference configurations. Naturalness MOS scores are close in all cases. However, conditioning on the source speaker significantly reduces similarity MOS by 1.4 points. Again this suggests that using English speaker embeddings during training does not generalize well to Spanish speakers.
Conclusions
We present a direct speech-to-speech translation model, trained end-to-end. We find that it is important to use speech transcripts during training, but no intermediate speech transcription is necessary for inference. Exploring alternate training strategies which alleviate this requirement is an interesting direction for future work. The model achieves high translation quality on two Spanish-to-English datasets, although performance is not as good as a baseline cascade of ST and TTS models.
In addition, we demonstrate a variant which simultaneously transfers the source speaker’s voice to the translated speech. The voice transfer does not work as well as in a similar TTS context , reflecting the difficulty of the cross-language voice transfer task, as well as evaluation . Potential strategies to improve voice transfer performance include improving the speaker encoder by adding a language adversarial loss, or by incorporating a cycle-consistency term into the S2ST loss.
Other future work includes utilizing weakly supervision to scale up training with synthetic data or multitask learning , and transferring prosody and other acoustic factors from the source speech to the translated speech following .
Acknowledgements
The authors thank Quan Wang, Jason Pelecanos and the Google Speech team for providing the multilingual speaker encoder, Tom Walters and the Deepmind team for help with WaveNet TTS, Quan Wang, Heiga Zen, Patrick Nguyen, Yu Zhang, Jonathan Shen, Orhan Firat, and the Google Brain team for helpful discussions, and Mengmeng Niu for data collection support.